# How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Podcast: How I AI
Published: Jun 17, 2026
Reading time: 16 min
Canonical: https://podbrew.app/briefs/how-i-ai-how-braintrust-uses-ai-agents-evals-and-ci-to-ship-better-software-anku

On Podbrew, we are joined by Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform. We dive into the groundbreaking ways AI agents are tackling deeply technical architecture and infrastructure challenges in software development.

The discussion covers how coding agents perform exhaustive benchmarking for tasks like database optimization, using tools such as Codex. We explore the 'agent line' framework for delegating decisions to AI and demystify evaluations as a critical method for improving AI product quality without constant manual adjustments.

This conversation offers vital insights for senior engineers, staff engineers, and engineering leaders. It highlights how these AI-driven approaches enable rigorous testing, scale quality across teams, and significantly enhance engineering velocity, allowing organizations to ship better software faster and more reliably.

## Key takeaways

- Direct AI models with "hard evals" that address real-world software performance issues, like slow queries, rather than generic AI benchmarks.

- Define clear tests and success criteria for AI models to autonomously identify and implement creative solutions to these complex problems.

- AI coding agents can automatically and exhaustively test various database optimization techniques, such as column store formats and execution engines, to significantly speed up complex queries.

- AI agents can conduct significantly more rigorous and continuous benchmarking for complex engineering problems like database optimization, leading to the discovery of solutions that human efforts might miss.

- The skepticism of staff engineers regarding AI's ability to handle intricate tasks is directly refuted by AI's capacity to thoroughly test performance changes, surpassing the scope and depth of typical human-led benchmarking.

- AI agents enhance engineering quality by mitigating human challenges like context loss and decaying attention on hard, tedious problems, allowing for consistent effort.

- Integrating AI enables organizations to tackle more ambitious and prolonged technical projects, like deep database optimizations, which were previously impractical due to high human resource costs.

- The 'agent line' concept encourages evaluating tasks to determine if they can be equivalently handled by an AI agent, thereby optimizing personal and team productivity.

- Managing multiple concurrent AI agents through tools like tmux sessions for local foreground tasks and remote instances for large-scale experiments allows for efficient and resource-appropriate task execution.

- Utilizing AI models like Codex, known for their ability to disagree and offer alternative perspectives, is highly beneficial when tackling challenging technical problems.

- Cloud development environments and remote computing are becoming indispensable for managing data-heavy engineering tasks that would otherwise strain local machines.

- Constant pressure to use AI agents can lead to "productivity anxiety" and burnout, making individuals feel obligated to be always-on.

- "Chunking your time with AI" helps manage this anxiety and maintain work-life balance by setting explicit periods for AI engagement.

- AI development prioritizes defining the desired 'what' (outcome) over the 'how' (implementation) of a solution.

- Evals function as the contemporary version of Product Requirements Documents (PRDs), specifically outlining success criteria for AI products.

- AI models like GPT-5.4 can automatically generate scoring functions for evaluating agent outputs based on user-defined criteria.

- Over-reliance on "vibe checks" (testing on a few examples) leads to a "whack-a-mole" development cycle where fixing one problem introduces others; quantitative evals are necessary first.

- Subjective design taste can be quantified by translating a designer's specific judgments and feedback into objective evaluation criteria and scoring functions.

- Encoding a designer's aesthetic 'palette' into a system allows their high-quality judgment to be applied consistently and at scale across more product features.

- For AI product development, the primary engineering focus should be on building pipelines for evaluating the AI with real-world data, establishing critical feedback loops for improvement.

## 04:00 - 04:19 Leveraging Hard Evals to Guide AI in Software Improvement

To make AI models truly effective in software engineering, the focus should be on creating "hard evals." These are not general AI benchmarks but real-world performance challenges, such as diagnosing why a specific database query is executing slowly.

By establishing precise tests and success criteria tailored to these difficult problems, AI models can be empowered to work autonomously. This setup allows the models to explore innovative solutions in the background, addressing complex issues without constant human oversight.

This approach enables AI to move beyond theoretical improvements, directly impacting the quality and performance of software systems by solving tangible, business-critical problems. It shifts AI's role from evaluation to active, creative problem-solving in production environments.

> One of the best things that we can do is create really hard evals, not like AI evals, but things like, 'Why is this query so slow?'

## 04:19 - 08:02 Accelerating Database Optimization and Complex Data Migrations with AI Agents

Engineers often face challenges optimizing database queries, especially when dealing with billions of data traces and arbitrary queries that need to pinpoint specific interactions. Manually testing various database techniques like different indexes, prefetching methods, or column store formats for performance improvements is incredibly time-consuming and difficult to scale.

Ankur's team tackles this by identifying slow query patterns and then using a coding agent to automatically test a wide range of database optimization ideas. For instance, the agent can exhaustively try different open-source column store formats and execution engines, computing a matrix of their performance to find the most efficient solution.

Large-scale infrastructure or platform changes are inherently risky, making engineering teams risk-averse to significant shifts in core components. AI coding agents, specifically those leveraging models like GPT and Codex, offer a unique way to mitigate this risk. They enable programmatic testing against complex, long-tail data structures to validate potential solutions before full implementation.

Claire shares an example where AI agents were crucial for a very complex data migration. This involved moving millions of rows of messy, AI-generated structured and unstructured data between schemas with numerous edge cases. While humans can script such migrations, the agents excel at managing and validating cycles, indicating whether a change is correct or requires a different approach.

> It's like the thing that you shipped is the thing that you get stuck with, certainly on, on the engineering side. And what I love about AI right now and these coding agents in particular, and then Codex in particular, has been the only setup where I have been able to set up a very similar process, which is the outcome I want is X Y Z.

## 08:02 - 12:02 AI Agents Outperform Human Benchmarking in Database Optimization

AI agents demonstrate superior capabilities in database optimization by conducting highly iterative and extensive benchmarking, often utilizing production-like or even live production data. This approach allows agents to rigorously test performance changes and uncover solutions that human engineers might overlook due to time and resource constraints.

Ankur Goyal, with two decades of experience in database work, directly challenges the skepticism of staff engineers who believe AI cannot handle complex engineering tasks. He points out that even top human engineers typically run only a few benchmarks for performance optimizations, sometimes making assumptions about less-prioritized tests, which can lead to flawed outcomes.

For instance, his team used AI agents to run a week of continuous experiments to evaluate different index types, leading to the discovery that Bloom filters, despite their poor reputation, were effective for a specific problem in Braintrust. This process involved not only checking for query speed improvements but also thoroughly benchmarking potential slowdowns.

Ultimately, the practical application of AI agents in benchmarking far exceeds the capacity of any staff engineer to manually perform the same volume of rigorous tests, algorithm comparisons, and detailed analyses. This ensures a much higher quality of testing and more reliable performance outcomes in critical engineering areas.

> There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using an agent.

## 12:02 - 14:02 AI Elevates Engineering Quality and Scope by Counteracting Human Limitations

AI agents significantly improve the practical quality of engineering work, particularly on hard and tedious problems. Human engineers often lose context over days and experience decaying attention spans, which can lead to a decline in the practical quality of solutions. AI, however, can consistently and tirelessly work through these challenges, driving up the overall quality.

This capability allows companies to undertake much more interesting and complex technical challenges than before. For example, a business might hesitate to dedicate a team of staff engineers for a year to solve deep database indexing problems due to the perceived cost and time commitment. With AI as a co-pilot, such ambitious projects become practically feasible.

Integrating AI into the engineering process enables organizations to support the cost of deep technical investigations. AI can make expedient progress on complex issues in the background, allowing human teams to simultaneously ship other products. This leads to a higher standard of rigor and performance, eliminating excuses for a backlog of unresolved issues or minor inefficiencies.

> there's just no excuse to not have rigor. Like, if-- and there's no excuse to not have performance.

## 14:02 - 18:03 Ankur Goyal's Framework for Delegating Tasks to AI and Managing Concurrent Workflows

Ankur Goyal introduces the 'agent line,' a framework for evaluating tasks. It suggests that if information can be given to an AI agent to solve a problem equivalently, then that task falls below the agent line. The goal is to continuously push this line upwards by developing smarter AI skills and integrations within a company.

By effectively delegating tasks to AI agents, individuals can free up significant time, enabling them to focus on 'maker schedule' work. Ankur personally structures his day to ensure dedicated focus, abstaining from meetings after noon to maximize his time for coding and deep work, which he finds fulfilling.

His personal workflow involves managing 5-6 concurrent foreground agents on his computer, each running in a tmux session, named sequentially like 'Braintrust one' through 'Braintrust four.' Each session runs its own UI and services, although challenges like port collisions highlight the need for better isolation solutions for complex software.

For resource-intensive tasks, Ankur leverages remote agents. For example, he runs experiments to improve column store performance and test real latency between EC2 and S3 with 4,000 concurrent reads. Such tasks would overwhelm a local machine but are essential for understanding system behavior at scale. He also uses models like Codex for hard problems because it consistently offers dissenting opinions, which is valuable for complex problem-solving.

> if we equivalently took the information that we're discussing and we just gave it to an agent, would it solve the same problem?

## 18:03 - 20:03 Essential Setup for AI-Accelerated Engineering Workflows

While commercial background agents work well for standard web applications, a growing number of engineering teams are now building their own custom background agents. This trend is occurring in organizations of all sizes, indicating a move towards more tailored internal coding agents, even when leveraging core infrastructure from major model providers.

A significant development is the increasing investment in cloud development environments and remote computing. This is crucial because running data-heavy engineering processes locally can severely impact machine performance. Remote environments allow engineers to handle resource-intensive tasks more efficiently.

Managing concurrent engineering processes on local machines, such as handling multiple ports, remains a common challenge. Tools like the one released by Chris Tate at Vercel can help streamline this. Beyond tooling, a key practice is to intentionally dedicate time for coding, emphasizing its importance for sustained productivity.

> make time to code. You need it every night. Every one.

## 20:03 - 22:04 Reclaiming Flow and Managing Productivity Anxiety with AI Tools

When working with AI agents, many developers initially lose the "flow state" they once experienced during hands-on coding. However, by adapting workflows and intentionally using AI, it's possible to re-enter this focused and productive state, even with new tools.

A common challenge with AI-augmented work is "productivity anxiety," where individuals feel compelled to constantly engage with AI agents, even during meetings or personal conversations. This pressure to always be "kicking off agents" can lead to burnout and a sense of doing something wrong if not constantly connected.

To counter this, it's beneficial to intentionally "chunk" time with AI. This strategy involves dedicating specific periods for AI-assisted tasks, allowing for focused work without the constant pressure to be online or interacting with agents. This approach helps maintain a healthier work-life balance and reduces anxiety.

Establishing clear boundaries, such as closing laptops during family dinner, is crucial for preserving personal time. This practice helps prevent technology from encroaching on personal life and reinforces the importance of downtime away from AI tools.

> I think what I see is that people feel like if they're in a meeting and they're not kicking off agents, they're doing something wrong, or, or if they're talking to somebody and they're not kicking off agents, they're doing something.

## 24:03 - 26:04 Evals Serve as Modern Product Requirements Documents for AI

AI development, particularly with large models like transformers, fundamentally shifts the programming focus from defining 'how' a solution should work to clearly articulating 'what' the desired outcome or problem is. Similar to how statistical regression computes a slope and y-intercept from given data points, AI systems are provided with the objective and then figure out the underlying mechanisms to achieve it.

This focus on the 'what' is central to the concept of evals. Evals are a methodology designed to explicitly state 'what success looks like' for an AI product. They are considered the modern equivalent of a Product Requirements Document (PRD), which traditionally describes success criteria in prose.

The key difference between evals and conventional PRDs lies in their quantifiability. While both use prose and provide examples like user stories, evals encode these user stories and success metrics in a way that can be measured. This structured approach allows an AI model or system to independently determine the 'how' (the implementation details) necessary to achieve the clearly defined 'what' (the success criteria).

> In my opinion, Evals are actually the modern version of a PRD.

## 26:04 - 28:04 Generating AI Scoring Functions for Documentation Evals

A demonstration shows how to build an evaluation for AI agents designed to answer questions about Braintrust's documentation. The process begins by populating a dataset with common user questions, which can be uploaded via CSV or auto-generated. A basic prompt is then configured, utilizing a model like GPT-5.4 mini and optionally integrating with an MCP server or Context 7 for indexing documentation.

Once the model generates answers to the questions, manually reviewing each response for quality can be inefficient. To streamline this, the demo illustrates a method where GPT-5.4 is instructed to automatically generate a scoring function.

Users can define specific criteria for the scoring function, such as ensuring code snippets are concise, that only one programming language is used, or that particular stylistic elements like em dashes are avoided. GPT-5.4 then analyzes the outputs against these criteria and produces a new, tailored scoring mechanism.

> Hey, can you come up with a good scoring function for these outputs, I care about having concise code snippets, only using one language. And, let's say avoiding em dashes.

## 28:04 - 32:04 Using AI Playgrounds for Safe Agent Development and the Limitations of Vibe-Check Evaluation

Running AI agents, especially coding agents in an "unhinged mode," can pose risks when executed on a local machine. However, performing these operations within a controlled playground environment significantly reduces danger. In a playground, agents can safely experiment with data and prompts without jeopardizing the user's system, allowing for robust testing and iteration in a contained space.

A practical approach involves having AI build a scorer to evaluate how well agents answer questions based on defined criteria. Instead of manually writing these evaluation criteria, a model can generate them, saving significant effort. These evaluations can then be run, and results viewed in aggregate, providing a quantitative measure of performance across numerous examples.

The alternative "vibe check" method, where developers test on a few examples and generalize, proves less effective. While initial vibe checks might seem important, this approach often leads to a "whack-a-mole" game: fixing one issue only causes another to surface after deployment. This cycle prevents consistent improvement and reliable agent behavior.

To overcome this, it is crucial to first conduct extensive quantitative evaluations. Running "a shit ton of evals" helps systematically improve agent performance before seeking subjective feedback. Only after these rigorous, data-driven assessments show good results should a more qualitative "vibe check" be performed, often revealing flaws even after quantitative improvements.

> if you do this, you end up playing kind of like a whack-a-mole game.

## 32:04 - 34:04 Quantifying Design Expertise and Aesthetic Taste with Evaluation Systems

Braintrust is working to quantify the subjective 'taste' of their designer, David, by translating his design judgments into objective evaluation scorers. When David provides feedback, like specific conditions under which showing multiple languages is acceptable, Ankur and his team incorporate this expertise to refine their scoring functions.

The goal is to encode David's aesthetic sense into a repeatable system. This allows them to avoid repeating past design errors and enables David's influence to scale across more product features. Instead of him needing to provide feedback on every single instance, his 'palette' can be applied consistently.

This approach allows David's high-quality design judgment to impact a broader range of product elements, effectively raising the overall quality bar. The system allows David's expertise to be leveraged more widely, making his contribution more valuable by extending its reach.

This process exemplifies how individual expertise can be systematically captured and scaled. By building a system that embodies a designer's specific 'vibe check' and taste, the organization can achieve higher quality outcomes across a larger scope of work.

> We're able to have David's palette applied to more things, like the, the-- I think the quality bar is, or that we're able to hit is higher because we're able to get more things to that, to that bar.

## 34:04 - 38:04 Managing engineering velocity through feature carving and continuous integration

Product development often leads to quickly building features that introduce too much complexity. The approach to managing customer velocity and improving the user experience is to "carve" away unnecessary features. For instance, a powerful but confusing search implementation was simplified after user complaints to make it more intuitive, demonstrating that removing complexity often leads to a better product.

To manage technical throughput effectively, continuous investment in Continuous Integration (CI) is critical. Instead of pushing out low-quality code when constrained, teams should pause and improve their CI processes. This dedication to CI is presented as a foundational step that "earns the ability to move faster" for the engineering team.

For AI product teams, the fundamental task is establishing robust feedback loops. This involves creating a pipeline that can transform real-world data into actionable evaluations. This process, akin to CI for traditional software, is considered more vital than aspects like prompt engineering or selecting agent frameworks, as it ensures the AI product learns and improves from actual usage.

Applying evaluation practices extends to internal uses of AI tools, too. For example, some teams run evaluations on how engineers interact with AI code generation tools. This analysis helps identify pain points, understand where users give up, or if agents are requesting excessive permissions, leading to better outcomes and more effective tool integration.

> How do I accelerate my engineering velocity with AI? I was like, Fix your CI.

## 38:04 - 40:05 Ankur Goyal's Hands-On Strategy for Fixing Evaluation Failures

Ankur Goyal described a challenge encountered when transitioning an open-source model running millions of tokens per second from model A to model B, where every optimization matters. An existing evaluation script, initially "vibe coded," became stuck.

Upon reviewing the problematic script, Ankur found it to be a 3000-line, confusing mess of scoring functions. To resolve this, he spent a weekend hand-writing a new evaluation script from scratch, without the aid of AI tools like Copilot or autocomplete.

This manual approach served a dual purpose: it directly solved the problem of the failing eval and, crucially, deepened his personal understanding of the underlying issue. The problem was resolved by the end of Sunday, emphasizing the value of restarting with a focused, hands-on approach.

> partly to improve my own understanding of the problem, I hand-wrote the eval, and then by the end of Sunday, the problem was solved.

---

Get podcast briefs for shows you follow: https://podbrew.app/