Story Time

Why We Built TierZero: From Pager Panic to Calm Co-Pilot

Follow the journey from a career-defining Meta outage to a seven-figure Niantic incident, and see how those scars shaped TierZero, the AI SRE agent built to keep teams calm under fire.

Anhang ZhuOctober 8, 2025
Why We Built TierZero: From Pager Panic to Calm Co-Pilot

TierZero turns hard-won incident response lessons and SOTA AI agents that understand your infrastructure into a ready-to-run AI SRE agent any engineering team can deploy in minutes instead of spending years on brittle runbooks.

Key Takeaways

  • Enterprise-scale incidents still depend on individual heroics at most companies.
  • Meta's internal auto-remediation showed automation can tame chaos, yet it remained an advantage only a few companies could replicate.
  • TierZero compresses that capability into a ready-to-use AI SRE agent that sets up in minutes.

Four months into my first role at Meta the ads system collapsed. The outage froze revenue worldwide and left me, the newest engineer on the team, holding the pager. I paged dashboards, scrolled diffs, and drafted frantic updates while silently hoping a senior teammate would materialize. When my manager stepped into the war room hours later he did not ask for a status report. He opened an internal console that automated remediation.

The console scanned the change stream, mapped likely blast radius, and listed the cleanest rollback steps. Within minutes we transitioned from guesswork to an ordered plan. The system paired structured telemetry with codified playbooks, turning crisis mode into a guided workflow. We closed the incident without guesswork or heroics.

FBAR Was a Glimpse of the Future

That console was Facebook Auto Remediation, or FBAR. I assumed every major company had built something similar. Years later I found Facebook's public overview of FBAR and confirmed how deeply automation was woven into their operations. FBAR continuously correlated server telemetry, change history, and hardware health to schedule remediation before humans even noticed the problem. It was the institutional memory every on-call engineer dreams about.

What I misunderstood was how rare that level of automation actually was. FBAR existed because Facebook invested entire teams into encoding infrastructure topology, incident history, and recovery logic. Without that scale, most organizations kept their knowledge scattered across wikis, spreadsheets, and muscle memory.

The Wake-Up Call at Niantic

Fast forward six years. My gaming startup was acquired by Niantic, and I was leading infrastructure teams. A major incident hit us hard and instinct took over: run FBAR. Except our stack had no FBAR. We fell back to manual debugging, restarting services, and rerunning deploys based on whoever shouted the loudest theory.

The Damage Report

  • One exhausting week of all-hands remediation work
  • $200,000+ refunded to players
  • One senior engineer burned out enough to resign

That experience made the gap painfully clear. Most teams still rely on heroics. Knowledge lives in outdated docs. On-call engineers learn by being thrown into the deep end and hoping the next outage resembles the last one. Automation remained a bespoke luxury.

Manual Incident Response Does Not Scale

When incident response hinges on individual memory, three structural issues surface immediately.

  • Detection slows down while teams chase the true blast radius and customers continue to feel pain.
  • Knowledge sharing falters because playbooks rot and tribal context never makes it into tools.
  • Humans burn out under relentless adrenaline, which drags reliability down even further.

The old answer was to build custom rule engines and automation scripts over years. Facebook could afford that. Most companies could not justify pausing feature velocity to craft internal tooling that might help the next outage.

AI SRE Agents Changed the Equation

Modern reasoning models finally give us a way to synthesize telemetry, change logs, and service maps without hand-coded rules. We can train an AI SRE agent to ask the right investigative questions, suggest safe remediation paths, and capture the result so the organization learns automatically.

TierZero bundles that intelligence into a product any team can deploy in fifteen minutes. Connect your logging, metrics, source control, and incident tracker using API keys. From there the platform keeps a continuous view of deploys, alerts, and customer impact so that when something breaks, the on-call engineer gets a guided playbook instead of guesswork.

What TierZero Delivers Out of the Box

  • Instant reconstruction of recent changes and the most likely blast radius.
  • Ranked mitigation options with risk scoring, owners, and rollback steps.
  • Contextual escalations that summarize what changed plus recommended next actions.
  • Automatic capture of resolution details that accelerate the next post-incident review.

There are no custom rules to author and no special team to maintain. We took the lessons from FBAR and built a generalized AI SRE agent that works across modern stacks.

Built for the Engineers Carrying the Pager

TierZero augments the people already on call. New hires gain confidence because the system shares historical insight the moment they see their first alert. Staff engineers can encode hard-won knowledge once and keep it evergreen. Leaders finally gain observability into where incidents are burning time, revenue, and morale.

The tool that saved outages at Meta should not stay locked inside one company. By blending AI SRE agents with proven remediation workflows, TierZero gives every team a path toward self-healing infrastructure.

Ready to See TierZero in Action?

Book a TierZero demo and let us show you what fifteen minutes of setup unlocks for your team.