OpenAI Hunted a Rockset Bug and Found an 18-Year-Old Race Condition

OpenAI Developers

OpenAI Hunted a Rockset Bug and Found an 18-Year-Old Race Condition

1H AGO

2 min read

TRAINING_INFRA

distributed_training pretraining

OPEN_SOURCE

1 hr ago

TRAINING_INFRA

distributed_training pretraining

OPEN_SOURCE

2 min read

Some bugs don't announce themselves. They show up as a crash, vanish, and leave you staring at a corrupted stack trace wondering if you've lost your mind. That's exactly what OpenAI's infrastructure team faced inside Rockset, the C++ data system that powers ChatGPT's conversation search and knowledge-base connectors. What looked like one inexplicable bug turned out to be two completely unrelated problems, coincidentally surfacing at the same time.

The scene of the crime

Rockset is a cloud-native search and analytics engine that OpenAI acquired in 2024. It's written in C++ for raw performance, and it handles real-time indexing so ChatGPT can retrieve relevant information mid-conversation. The downside of C++ is that memory bugs cause hard crashes, and these crashes were deeply weird.

The symptoms were things that shouldn't happen in normal code:

A function would finish executing and then return to a NULL address, causing the kernel to kill the process.
The stack pointer register (%rsp) would be off by exactly 8 bytes, with no obvious explanation.
Both failure modes crashed on function return, not during execution.

Every hypothesis the team (or ChatGPT) could think of had strong evidence against it, so the bug seemed impossible. A stray write landing precisely on a return address is theoretically possible but vanishingly rare. A stack pointer misalignment with no inline assembly, setcontext, or longjmp in the code path is even harder to explain.

Doctor mode vs. epidemiologist mode

The team's first instinct was to go deep on individual crash dumps -- what they call "doctor mode." They spent days reconstructing pre-crash history from register contents and stack frames. They read kernel source, Azure-specific patches, and ran stress tests. Nothing clicked.

The key shift was deciding to gather high-quality population data. They had ChatGPT write a script that downloaded a prefix of each core file, extracted register state, filtered known false positives, and automatically labeled every crash as return-to-null, misaligned-stack, or other. Then they ran it in parallel over every production core dump from the previous year.

The result was immediate. What they had been treating as one weird bug was actually two separate crash populations. The population-level view made the structure obvious in a way that no amount of individual case analysis could.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

OpenAI Hunted a Rockset Bug and Found an 18-Year-Old Race Condition

Takeaways

The scene of the crime

Doctor mode vs. epidemiologist mode

Don't miss what's next in AI