Weights & Biases just shipped a rebuilt version of W&B Weave aimed squarely at one problem: how do you keep an agent reliable once it leaves your laptop and starts handling millions of real conversations? The new release treats production traffic as the primary source of truth for agent quality, not a place where things break after the fact.

The pitch is that Weave now provides end-to-end observability to monitor production agents, out-of-the-box signals to surface failure modes, a complete improvement loop from inference to training, and a flexible evaluation framework to prevent regressions. It is free to try, and the company is positioning it as the observability layer for a broader push by parent CoreWeave to close the gap between training and inference.

Why stack traces stopped working

The framing behind the release is worth pausing on. The W&B team argues that agent tracing is no longer just code tracing. A stack trace doesn't describe what an agent did. You need first-class semantics: sessions, turns, steps, tools, sub-agents. Generic observability tools don't speak this language.

They go further on evaluation: LLMs are so good that individual traces are no longer the most useful debugging tool. Aggregate metrics and trends across millions of traces are needed to identify misbehaviors. Offline evals are getting harder to design, as the space of possible valid behaviors is very wide. Signals from online traffic is a more powerful diagnosis tool. In other words, the behavior space has gotten too wide for hand-built eval sets to keep up, and the real failures live in production.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves