Apr 30, 2026 Usecase

Using LLM telemetry to improve prompts with GEPA

OneQuery Maintainers 9 min read

A practical workflow for using OneQuery to let a GEPA reflection agent inspect Laminar LLM telemetry while improving prompts.

What is GEPA?

GEPA, short for Genetic-Pareto, is a text evolution engine for optimizing prompts, code, agent architectures, configurations, policies, and other artifacts that can be represented as text. The core idea is to use LLM-based reflection over execution traces, then keep candidates through a Pareto-efficient search process instead of collapsing everything into one aggregate score.

That makes GEPA especially useful when failures come with diagnostic context. A reflection model can inspect why a candidate failed, propose a targeted change, and preserve candidates that solve different subsets of the task. The official GEPA guides are here: https://gepa-ai.github.io/gepa/guides/

In practice, the loop has three roles. The executor runs a candidate on tasks and records full traces. The reflector reads those traces to diagnose failure modes and causal patterns. The curator turns those diagnostic insights into an improved candidate, then sends it back through the loop for the next evaluation round.

GEPA pipeline diagram showing executor, reflector, and curator stages with a feedback loop to the next candidate.

Why telemetry matters for GEPA

Prompt optimization often starts with a simple loop: run a program on examples, score the outputs, collect failures, and ask an LLM to rewrite the instruction. That works better than manual prompt editing, but it leaves useful evidence on the table.

A wrong answer is rarely just a wrong final token. The useful signal is in the trajectory: which prompt was used, what reasoning the model wrote, whether the first solve call was already wrong, whether the review call caught the mistake, which examples were similar, and whether failures came from arithmetic, counting, formatting, or an unsupported shortcut.

LLM telemetry systems such as Laminar (https://laminar.sh/) capture that evidence. The question is how to let an optimization agent use it without handing the agent raw credentials, direct database access, or an unbounded query surface. That is where OneQuery fits.

The experiment shape

Diagram showing an AIME run split into solve and review prompts, Laminar trace spans, OneQuery trace lookup, and GEPA reflection.

The original GEPA AIME experiment solved each problem with a single LLM step. This experiment intentionally split the same AIME-style task into two LLM calls to show a different point: when a workflow has multiple LLM steps, connecting LLM telemetry to the reflection loop is enough to make the GEPA methodology easy to apply.

The first call solved the problem and produced an initial answer. The second call reviewed the initial answer and produced the final answer. Both calls ran under the same Laminar trace so a single problem trajectory could be inspected as a parent span plus solve and review child spans.

The GEPA loop optimized two DSPy predictors: solve.predict and review.predict. That means both the solver prompt and the review prompt were improved through LLM telemetry-backed reflection, rather than treating the workflow as one opaque prompt. For each candidate prompt, GEPA evaluated examples, built a reflective dataset from successes and failures, and asked a reflection agent to propose an improved instruction.

The important difference from a plain GEPA run was that the reflection agent had a OneQuery tool. Before proposing an instruction, it queried Laminar spans through OneQuery using the problem hash from the feedback. That let it inspect the exact solve and review telemetry for the examples GEPA was reflecting on.

What the agent saw

Laminar trace view showing AIME GEPA spans, timeline, transcript, input, model output, and problem solver span details.

The Laminar spans made the failure mode visible. For a given problem, the agent could see whether the solve stage produced a plausible but unsupported answer, whether the review stage merely rubber-stamped it, or whether the review correctly identified an inconsistency.

That distinction matters because the remedy is different. If solve is hallucinating a counting formula, the solve instruction needs stronger casework and verification. If solve is mostly right but review fails to catch arithmetic mistakes, the review instruction needs to behave more like an independent verifier.

In one run, the reflection agent used OneQuery to inspect spans for a failed or partially successful trajectory, then proposed a review instruction that explicitly told the model not to restate the answer, to verify invariants and arithmetic, and to replace unsupported answers with a corrected conclusion.

A sample setup

Setting	Value
Task	AIME-style math problems with solve and review LLM calls
Dataset split	24 train examples, 12 validation examples, 15 test examples
Optimization budget	max-metric-calls set to 72
Telemetry	DSPy cache disabled so Laminar received fresh solve and review traces
OneQuery access	Reflection agent enabled for the onequery-demo org and laminar-aime-gepa source
Artifacts	Experiment repo, result JSON, markdown report, progress plot, and reflection transcripts: https://github.com/wordbricks/onequery-gepa

Results

The experiment repo is available at https://github.com/wordbricks/onequery-gepa. Its report shows a small but clear validation improvement. The base program solved 3 out of 12 validation examples. After OneQuery-backed telemetry reflection, the first accepted candidate reached 4 out of 12, and a later candidate reached 5 out of 12.

Diagram showing the actual initial and accepted GEPA prompt text for solve.predict and review.predict.

The prompt changes were not generic rewrites. The base solve prompt simply asked the model to derive an initial answer, while the accepted reflected solve prompt pushed it to reduce problems into clean cases, verify invariants, and avoid unsupported symmetry assumptions. Review prompt candidates were generated during reflection, but the saved accepted candidate kept the review prompt unchanged.

The final test result was 5 out of 15, or 33.33 percent. That is still modest, but it is the useful signal for this use case: telemetry did not make AIME easy, but it gave GEPA better evidence for diagnosing failures and producing targeted prompt candidates.

Checkpoint	Result
Base validation	3/12 correct, 25%
First accepted candidate	4/12 correct, 33%
Later accepted candidate	5/12 correct, about 42%
Final test report	5/15 correct, 33.33%

Graph showing best validation score rising from 25.00 to 33.33 to 41.67 percent across metric calls, with selected test score at 33.33 percent.

Lessons learned

Optimization needs auditability. The reflection transcripts showed the exact SQL the agent requested, whether OneQuery returned rows, which prompt candidate was proposed, and whether the candidate improved subsample and validation scores.

What comes next

There are better examples where telemetry and the GEPA methodology should have even more room to improve prompts. Browser automation is one of them. A single task can involve many LLM decisions, repeated tool calls, page observations, retries, and partial failures, so pass/fail labels alone are a thin signal.

For those workflows, LLM telemetry is what makes self-reflection practical. The agent needs to inspect which page state it saw, which tool call it chose, what failed, and how the later steps recovered or compounded the mistake. Once those traces are available, GEPA can optimize prompts against the actual trajectory instead of only the final outcome.

We wanted to run those experiments too, but did not have enough time and budget for this pass. When Codex App Server supports LLM telemetry, we plan to revisit the experiment with richer multi-step agent workflows.

Why this is a good OneQuery use case

This is a natural OneQuery use case because the agent needs real operational data, but it should not hold direct access to the telemetry system. OneQuery turns telemetry lookup into a controlled tool call with org-scoped access, bounded reads, and a record of what was queried.

That pattern generalizes beyond Laminar and GEPA. Any prompt optimization, eval debugging, support triage, or agent repair loop can benefit from inspecting traces, product events, support tickets, warehouse rows, or observability data. The model should reason over the evidence, while OneQuery controls access to the evidence.

The practical architecture is simple: instrument the LLM app, expose the telemetry source through OneQuery, give the optimizer a narrow query tool, and make every reflection step save its transcript. The result is a prompt improvement workflow that is more grounded, more auditable, and safer than giving an agent direct access to the telemetry backend.

What is GEPA?

Why telemetry matters for GEPA

The experiment shape

What the agent saw

A sample setup

Results

Lessons learned

What comes next

Why this is a good OneQuery use case

Related posts

Debugging production on Cloudflare with Codex.

Context Enrichment with OneQuery

A Safe Data Access Layer for LLMs