Back to Blog
| 6 min read

Tribal Knowledge Is a Terrible Database

If your most important runbook lives inside one engineer's head, congratulations: you have invented a database with legs, opinions, and vacation plans.

Stick figure team turning one engineer's brain-only knowledge into a shared runbook.

Every engineering team has one person who knows where the bodies are buried.

Not actual bodies. Legal would like me to be very clear about that. I mean the tiny operational mysteries that keep production alive: which dashboard lies after midnight, why the payment worker needs exactly three replicas, why restarting the cache in the wrong order summons a meeting, and which environment variable is named ENABLE_NEW_QUEUE even though the new queue has been old enough to vote.

This person is usually calm, competent, and dangerous to let near a beach vacation.

Because the moment they go offline, the team discovers that a surprising amount of the company is being held together by one human brain, two Slack threads, and a sticky note that says “ask Priya.”

That is tribal knowledge. It feels fast until it is not. It feels efficient until the person who knows the answer is asleep, on a plane, or eating dosa with both hands and absolutely not checking PagerDuty.

For firstrespondr, this is not a side quest. The whole point of an AI operations teammate is that it can investigate with context instead of guessing from raw alerts. It can search your runbooks, compare the current incident with past ones, ask better clarification questions, and suggest safer next steps. But it cannot retrieve wisdom that only exists as a dramatic pause in someone’s memory.

Tribal knowledge has a terrible API

Let us evaluate the human brain as a production knowledge store.

Availability: questionable.

Latency: depends on coffee.

Replication: mostly accidental.

Search: “I think Ravi mentioned this during that incident in March?”

Access control: whoever is brave enough to interrupt the senior engineer.

Backups: none, unless you count the junior engineer who once watched over Zoom and nodded with confidence they did not feel.

This is not because people are careless. It happens because teams are busy. Incidents happen. Someone fixes the thing. Everyone says “nice save.” The story becomes folklore. The next sprint starts. The knowledge remains in the warm, squishy database wearing headphones.

Then six months later, production coughs, the same problem returns wearing a small fake mustache, and the team performs the ancient ritual:

  1. Search Slack
  2. Find twelve messages saying “fixed it”
  3. Ask who fixed it
  4. Learn they left the company
  5. Open dashboards with the confidence of an intern near a billing system

Documentation is not homework

Documentation has a branding problem. It sounds like something you do after the real work, like washing dishes after a party where Kubernetes knocked over the furniture.

But good documentation is not homework. It is operational leverage.

When knowledge is documented, the team gets a few very practical superpowers:

  • New engineers stop needing a month of oral history before they can safely touch a service.
  • On-call stops being a trivia contest hosted by a hostile phone.
  • Incidents become searchable evidence instead of campfire stories.
  • firstrespondr can actually help, because agents need written context too.
  • The team can improve the system instead of repeatedly rediscovering it.

The key phrase is written context. Not a 90-page wiki novel. Not a PDF named “final_final_really_final_v3.” Written context means the next person, or the next investigation agent, can answer the obvious questions without guessing:

  • What is this service supposed to do?
  • What does healthy look like?
  • What breaks most often?
  • What should I check first?
  • What actions are safe?
  • What actions need approval?
  • What happened last time?

That is not bureaucracy. That is a flashlight.

The best documentation is boring in the right places

Great docs do not need to be poetic. They need to be findable, current, and specific enough that a tired person can use them without becoming an archaeologist.

Bad documentation says:

If latency is high, investigate the downstream dependency.

Thanks, document. Truly, the wisdom of ages.

Useful documentation says:

If checkout latency is above 900 ms for five minutes, check payment provider latency first. If provider p95 is normal, check payment-worker queue depth. If queue depth is above 5,000 and CPU is below 40%, increase worker replicas from 3 to 5 after approval.

One version sounds professional. The other version gives firstrespondr enough context to gather evidence, explain the likely cause, and ask the on-call engineer for the right approval instead of simply yelling “latency.”

The funny thing is that teams often know this. They just wait too long to write it down. They try to document once the system is “stable,” which is adorable, because production systems are basically toddlers with SSL certificates. They change. They spill things. They learn new ways to surprise you.

Documentation should be part of the system, not a museum exhibit next to it.

AI agents are only as smart as the notes you leave them

This is where documentation gets more interesting.

If you want firstrespondr to investigate incidents, answer operational questions, or recommend safe next steps, it needs the same thing a good human responder needs: context.

Logs tell you what happened. Metrics tell you how loudly it happened. Deploy history tells you who poked the bear. But documentation tells you what it means inside your weird, specific environment.

An agent can read a generic Kubernetes error. It cannot magically know that your staging cluster always reports one fake failing pod because a vendor health check is dramatic. It cannot know that the cache restart order matters because an old client reconnects like it was written during a thunderstorm. It cannot know that “do not run this migration on Friday” is not superstition but the scar tissue of payroll weekend.

Unless you write it down.

The more knowledge you document, the more useful firstrespondr becomes. Your runbooks become executable guidance. Your incident notes become retrieval context. Your postmortems become training material for the next investigation. Your approval rules become clearer because the agent can say “this is safe to inspect” and “this needs a human.” The organization slowly stops depending on heroic memory and starts depending on shared systems.

Heroic memory is fun in movies. In production, it is a single point of failure with a calendar invite.

Make docs tiny, timely, and slightly useful

The trick is not to ask engineers to write perfect documentation. Perfect documentation is how you get one beautiful page from 2022 and a team that avoids Confluence like it owes them money.

Ask for tiny docs.

After an incident, write the three things that would have saved ten minutes. After a weird deploy, write the rollback condition. After answering the same question twice, write the answer where search can find it. After discovering a dashboard lies, put a warning next to the dashboard link, ideally before it gets someone promoted to midnight detective.

Make documentation a trail of breadcrumbs, not a cathedral.

And keep the tone human. A runbook can say “if this graph is red, do not panic yet, it lies during batch jobs.” That sentence is more useful than five paragraphs of enterprise fog.

Write it down before it becomes folklore

Tribal knowledge is not evil. It is how teams learn. Every system starts with someone knowing something before the document exists.

The problem starts when that knowledge never leaves the person.

Documented knowledge is kinder to new teammates, kinder to on-call engineers, kinder to future you, and extremely kind to firstrespondr when it is trying to help without making things weird. It turns “ask Priya” into “firstrespondr found the runbook Priya improved after the last incident.” Priya still gets credit. Priya also gets lunch.

So write down the weird stuff. Especially the weird stuff. The obvious parts can be inferred. The weird parts are where the outages hide.

Your team’s knowledge should not require a campfire, a senior engineer, and three Slack search operators.

Put it somewhere searchable. Keep it alive. Let humans sleep. Let agents read.

And please, for the love of future on-call, document why the payment worker needs exactly three replicas.