The Bug That Wasn't There

We keep a list of known issues in our own docs. One of them had been sitting there for weeks, labeled "known drift."

It described a trigger on our skills table that was supposed to refresh the full text search index whenever we wrote to certain fields, but only fired on some of them. Writes to three of the fields appeared to skip the refresh. The fix looked obvious. Extend the trigger, backfill the rows that had gone stale, move on.

I sat down to close it. An hour later I had proven the bug existed, planned a careful fix, gotten sign off, and run the first batch. Then Postgres told me the bug had never been real.

Here is what happened, because the way a non-bug survives in your own documentation for months is more interesting than the non-bug.

The confirmation

The doc was specific, so I checked it against the live database instead of trusting it. The trigger really did fire on only six of nine columns. The function it called really did reference all nine. So on paper, writing to the other three fields would leave the index out of date.

Then I measured how many skills were actually affected. I recomputed what the search vector should be for every published skill and compared it to what was stored.

published: 1306
stale_search_vector: 1306

Every single one. A total, catalog wide drift. That number felt like a smoking gun. It should have felt like a broken thermometer.

The careful plan

Backfilling the skills table is the one operation on this platform I treat with real fear. We have taken production down three separate times by running bulk updates that fan out through cascading triggers and saturate the connection pool. So before touching a single row, I read every trigger on the table and every function those triggers call.

The result was reassuring. A write that only touched the search vector column would not null any embeddings, would not mark any chunks stale, would not fire the publish cascade. Each cascading trigger checked specific columns and short circuited on anything else. I wrote the migration, batched it in groups of two hundred, scheduled it off peak. Textbook. I got the go ahead and ran the first batch.

The error

ERROR: 428C9: column "search_vector" can only be updated to DEFAULT
DETAIL: Column "search_vector" is a generated column.

You cannot write to a generated column. Postgres computes it for you, on every row, from an expression baked into the table. I had spent an hour planning how to carefully write a value that the database refuses to accept by design.

So I pulled the actual definition.

setweight(name, 'A') || setweight(description, 'B') || setweight(intent, 'C')

Three fields. Not nine. The search vector had never been built from the nine fields the doc described. It was generated from three, and a generated column is always current with its source columns by construction. There was nothing to backfill. There had never been anything to backfill.

The "1306 stale" was an artifact of comparing the stored value against an expression that never produced it. I had built my own thermometer wrong and then panicked at the reading. As a final insult, one of the three real fields, intent, turned out to be filled on only 318 of the 1306 skills, so for most of the catalog the search index is really just name and description.

Where the lie lived

The trigger and the function it called were dead code. Their assignment to the search vector was silently ignored, because you cannot assign to a generated column. They did nothing. They had probably done nothing for a long time.

They were not harmless, though, because that dead function listed nine fields. At some point someone read it, reasonably concluded the index covered nine fields, and wrote that into our documentation. The doc got repeated. It got more confident. The known drift entry was downstream of a function that had stopped mattering long ago and was still quietly describing a system that did not exist.

Dead code is not neutral. It is documentation that lies, and it lies with the authority of sitting right there in the schema.

What I actually changed

I had already applied my pointless trigger change to production before the error surfaced. The first thing I did was check whether it was harmful. Would a real update to a skill name now fail, because the trigger tries to assign a generated column?

I ran one update inside a transaction and rolled it back.

UPDATE_OK

Tolerated. The assignment is ignored, not rejected. No skill edit had broken. So I reverted my change, restored the trigger to exactly what it had been, and deleted the migration. Production ended the day byte for byte where it started, plus one corrected sentence in the docs.

The real finding, the one underneath the imaginary one, is that our full text search leans on three fields and the rest of the signal comes from the embedding half of our hybrid search. That is fine. Embeddings carry the enrichment fields. The point is that now we know it, instead of believing a comfortable fiction our own tooling had been reciting back to us.

The part worth keeping

I want to be clear about what saved this from becoming an outage. It was not that I avoided the misdiagnosis. I walked straight into it and got all the way to running a batch. What contained it was the boring discipline around the dangerous part. Reading every trigger before the bulk write. Batching instead of one big update. Testing the trigger with a rollback before trusting it. The same caution I thought was protecting me from a real bug ended up protecting me from a fake one.

Two things I am taking forward. When every single row looks broken, suspect the ruler before the world. And when you find dead code, do not leave it sitting because it seems harmless. It is not harmless. It is still talking.