Banger Drought

Building in public has a screenshot problem.

When we ship something visible — a new retrieval mode, a benchmark, a rewrite of every skill in the catalog — there's a clean story to tell. Numbers go up, graphs render, posts write themselves.

The last few weeks haven't produced anything like that. I want to talk about why, because I think it's the more honest half of building a product.

What we've actually been doing

Almost everything we shipped in the last weeks falls into one of four buckets.

Security. Tightening the delivery guard. Narrowing alert thresholds that fired too eagerly. Closing gaps that nobody had exploited but somebody could have. Cleaning up the dependency tree. The kind of work where success looks like silence: no incidents, no spam alerts, no surprised emails on a Sunday morning.

Data integrity. A long tail of small inconsistencies between what the code expects and what the database actually holds. Some left over from earlier migrations. Some from scripts that wrote to one table without remembering to sync the cache in another. None of it visible to a user opening the catalog — until one specific code path joins through the stale side of the cache, and then it is.

Checks and monitoring. The daily scan that runs at 06:00 UTC and tells us if anything moved overnight. The health endpoint that watches the four systems that have to be up. The test suite that has been growing alongside the codebase. Per-skill quality checks. A canary skill that exists to be blocked by the guard, so we know the guard still works.

Infrastructure hardening. Database-level guards for bug classes that previously relied on every writer remembering to do the right thing. CHECK constraints. Triggers. Things that prevent a category of mistakes structurally instead of asking humans to be careful.

None of that ships. None of it goes in a tweet. All of it is why we can keep adding features without watching the foundation crack.

Why this is the work right now

When we launched the platform earlier this year, we built fast. The catalog grew from a handful of skills to over a thousand. Search became hybrid. PowerPacks shipped. Then SkillStreaming. Then a complete rewrite of every skill from "encyclopedia explanations" to "hunting expertise."

Each of those was a banger. Each of them also left small debts behind. Caches that got slightly out of sync. Tests that drifted from the design. Alerts that were calibrated against assumptions that no longer held. Skills that needed more research sources than they were originally seeded with.

You can keep adding features on top of that debt for a while. The product gets bigger. The graphs go up.

Eventually something breaks in a way that didn't show up in development, and the maintenance bill comes due.

We're paying it now. Not because anything broke catastrophically, but because we want to ship the next round of features on something that doesn't wobble.

The lesson I'm taking

A red test is a signal, not a nuisance.

When CI has been sitting at "nine failing tests" for weeks and you've learned to skim past them, you've also learned to skim past whatever the real ones say. New regressions hide in the noise.

For us the fix isn't getting CI green by any means necessary. It's making CI honest, then doing what it points at. Some of the tests we lived with turned out to be pointing at real problems nobody had complained about, because nobody had a baseline to compare against. Some of them turned out to contradict how the system is designed to behave. Sorting those two cases apart is most of the work.

The build-in-public tradeoff

If you only post bangers, your audience sees a team that ships endlessly. They will, eventually, be confused when something breaks. They won't have a mental model for "this product has a maintenance bill."

If you only post plumbing, you sound exhausting.

The honest version is: every product has both phases on a cycle. We were in the banger phase for a quarter. We're in the plumbing phase now. In a few weeks we'll be back to new surfaces. The plumbing weeks are what makes the new surfaces survivable.

Most teams have these phases. The difference is which one they make public.

This is the plumbing phase. Posting about it anyway.

What's next

When this round wraps:

A second benchmark drop. We want to publish a quarterly skill-quality benchmark.
ChatGPT integrations, as the surfaces open up.
More transparency posts like this one. The screenshot ones too.

But also more plumbing. There's always more plumbing. That's the deal.

If you're running a small team and a chunk of your quarter goes uncelebrated because it's not shippable: you're doing the job right. Tell people about that part too.