How We Independently Arrived at OpenAI's ROI Validation Methodology
We've overestimated AI in the short term and underestimated it in the long term. In this uncertain middle ground, here's how we actually prove economic value—and why our process organically arrived at the same methodology as OpenAI's GDPval.
Written by Mike Borg, Co-founder and CEO
The discourse around AI and labor has become strangely bifurcated.
On one side, alarm bells. Dario Amodei talks about a potential “bloodbath” in labor markets. Economists warn we’re unprepared. The transformation narrative dominates conference keynotes.
On the other side, something quieter: we’ve moved past hype into an underestimation phase. ChatGPT’s novelty wore off. Many organizations have become comfortable dismissing LLMs as “not quite there yet” for serious operational work.
The reality is somewhere in between. We’ve overestimated AI in the short term and underestimated it in the long term—and we’re in the messy intermediary stage where both errors coexist.
This creates a practical problem: how do you actually measure whether AI can do economically valuable work in your specific context? Not theoretically. Not based on benchmarks designed by AI labs. In your workflows, with your data, evaluated by your experts.
Yesterday we published our thinking on Harness Engineering. Today: how we prove it works.
How We Work
Every customer engagement starts the same way. We ask for a high-ROI workflow with historical outcomes, clear success criteria, and at least one example that succeeded and one that failed. No cherry-picking.
We run our agent on the same inputs the historical process used. The customer’s domain experts evaluate the results against what actually happened. The evaluation is functionally blind—the historical execution predates our agent, so there’s no contamination. Real outcomes serve as ground truth.
This isn’t a demo. It’s a trial.
The GDPval Connection
OpenAI recently published GDPval—a benchmark where domain experts blindly evaluate AI outputs against historical human outputs.
When we read about it, we realized our sales and onboarding process had organically arrived at the same core methodology. Domain experts evaluating AI against historical baselines, blind to which is which, with real outcomes as ground truth. We didn’t design it this way because of GDPval—we designed it this way because it’s what skeptical operations leaders need to see before they’ll commit.
Why the Harness Matters
Our Harness Engineering approach makes this evaluation meaningful.
The harness enforces deterministic behavior—constrained by the customer’s business ontology—so we can measure accuracy, cost, and time against historical baselines. An agent that “sometimes” gets it right provides no basis for ROI calculation. The harness makes outcomes repeatable, which makes them measurable, which makes ROI defensible.
De-Risking Adoption
Here’s where our architecture, pricing, and evaluation process converge.
The harness gives us consistency—every token spent produces predictable, auditable outputs. The evaluation process proves ROI before you commit. And our pricing aligns with this: you pay for workflows that have already demonstrated value on your own data.
The result: maximum ROI from every token, and de-risked adoption because every workflow you pay for has proven ROI with high probability.
When a customer approves an evaluation, the same harness configuration goes into production. Prove it works on one workflow, expand to the next. Same methodology, every time.
The Bottom Line
We didn’t set out to replicate GDPval. We set out to answer a simple question every operations leader asks: “How do I know this will actually work for us?”
The answer: test it against your own historical performance, with your own experts doing the evaluation, using your own success criteria.
Want to run an evaluation on your workflows? Get in touch.
Related: Harness Engineering: How We Make AI Reliable for Supply Chain Operations