When prod surprises you!

November 29, 2025 (3w ago)

Pushed a new variant of our workflow to prod for alpha testing today. Everything looked smooth in dev — no bugs, decent response times (3-5 mins) — so I felt good about it.

But once in prod, I noticed our image generation module slowing way down. Checked logs, LLM calls, provider console — turns out the 99th percentile response time shot up from about 0.5 minutes to 4 minutes because of the provider.

Then the content compliance check took forever. Traced it down to a recursion limit change — someone bumped it to 500 instead of the usual 5-10, so the agent kept calling the tool endlessly.

Stopped the server, fixed the prompt and limiter, pushed the update — thankfully caught it early thanks to prod tests!

Moral of the story: never underestimate a sneaky numerical bug. Always triple-check those limits 😂

Heart rate’s back to normal now 🙂

Prod Surprise Thumbnail