Skip to main content
Company

Why We Built TierZero: From Pager Panic to Calm Co-Pilot

Follow the journey from a career-defining Meta outage to a seven-figure Niantic incident, and see how those scars shaped TierZero, the AI production agent built to keep teams calm under fire.

Anhang Zhu
Anhang Zhu
Co-Founder & CEO at TierZero AI
October 8, 2025·4 min read
Why We Built TierZero: From Pager Panic to Calm Co-Pilot

TierZero turns hard-won incident response lessons and SOTA AI agents that understand your infrastructure into a ready-to-run AI production agent any engineering team can deploy in minutes instead of spending years on brittle runbooks.

Key Takeaways

  • Enterprise-scale disasters usually get fixed by one person engaging in unsustainable heroics.
  • Meta's internal tools proved you can automate the chaos away, but nobody else really pulled it off.
  • TierZero takes that capability and shoves it into an AI production agent you can set up while your coffee brews.

Four months into my gig at Meta, the ads system totally cratered. Revenue froze worldwide. I was the new guy, so naturally, I was holding the pager. I stared at dashboards, doom-scrolled diffs, and wrote panicked updates hoping a senior engineer would magically appear. When my manager finally walked into the war room, he didn't ask for a status. He just opened a console that fixed things automatically.

This console scanned the change stream, figured out the blast radius, and told us how to roll back. We went from guessing to having a plan in minutes. It took structured telemetry and playbooks and turned a crisis into a checklist. We fixed it without heroics. No guessing required.

FBAR Was Seeing the Future

That tool was Facebook Auto Remediation, or FBAR. I assumed every big tech company had one. Years later I read the public docs on FBAR and realized how deep the automation went. It looked at server stats, change history, and hardware health to fix things before humans even woke up. It was the kind of institutional memory on-call engineers hallucinate about during 3 AM pages.

I didn't realize how rare that actually was. FBAR existed because Facebook threw entire teams at the problem. They encoded topology and recovery logic like it was their religion. Without that scale, most orgs just keep their knowledge in scattered wikis, random spreadsheets, and the brains of people who eventually quit.

The Reality Check at Niantic

Fast forward six years. Niantic bought my gaming startup and I was running infra. A massive incident hit us and my muscle memory kicked in. Run FBAR. Except we didn't have FBAR. We went back to the stone age of manual debugging, restarting services, and listening to whoever yelled their theory the loudest.

The Damage Report

  • One week of painful all-hands cleanup work
  • $200,000+ refunded to angry players
  • One senior engineer burned out hard enough to quit on the spot

That week made it obvious. Most teams still run on heroism. Knowledge lives in docs nobody updates. On-call engineers learn by getting thrown into the fire and hoping the next outage looks like the last one. Automation was still just a nice idea nobody had time to build.

Manual Response Doesn't Scale

When fixing things depends on one person's memory, you hit three big problems immediately.

  • Detection takes forever while teams hunt for the blast radius and customers scream.
  • Knowledge sharing fails because playbooks rot and tribal knowledge never gets written down.
  • Humans burn out from the adrenaline spikes, which makes reliability even worse.

The old solution was building custom rule engines and scripts for years. Facebook had the budget for that. Most companies can't stop shipping features to build internal tools that might help during the next fire.

AI Production Agents Fixed the Math

Modern reasoning models finally let us combine telemetry, logs, and service maps without writing a million rules by hand. We can train an AI production agent to investigate properly, suggest a safe fix, and write it down so the team actually learns something.

TierZero puts that intelligence in a box you can deploy in fifteen minutes. Just toss in your API keys for logging, metrics, source control, and incident tracking. The platform watches deploys and alerts so when things break, the on-call engineer gets a playbook. No more guessing games.

What TierZero Actually Does

  • Instantly figures out what changed and what the blast radius looks like.
  • Ranks mitigation options by risk. Includes owners and rollback steps.
  • Escalates with context. Tells you what changed and what to do next.
  • Automatically writes down the resolution so your post-mortem doesn't suck.

You don't have to write custom rules. You don't need a special team to maintain it. We took what we learned from FBAR and built a generalized AI production agent that actually works on modern stacks.

Built for the Poor Soul Holding the Pager

TierZero helps the people actually on call. New hires don't panic because the system gives them historical context the second an alert fires. Staff engineers can encode their brain dumps once and keep them relevant. Leaders finally see where incidents are burning time, money, and everyone's will to live.

The tool that saved Meta shouldn't stay locked up in one company. TierZero mixes AI production agents with workflows that actually work. It gives every team a shot at self-healing infrastructure.

Want to See It Work?

Book a TierZero demo. We'll show you what fifteen minutes of setup gets you.

Share
Anhang Zhu
Anhang Zhu

Co-Founder & CEO at TierZero AI

Previously Director of Engineering at Niantic. CTO of Mayhem.gg (acq. Niantic). Owned social infrastructure for 50M+ daily players. Tech Lead for Meta Business Manager.

LinkedIn