Teaching machines financial judgment.
We take real deals, work them the way an analyst would, and grade the answer against what actually happened. Built for frontier labs, by practitioners.
Research toward a verifier for finance.
Quality cannot be crowdsourced.
The hard part of this domain is not volume. It is judgment, and judgment does not come off an assembly line.
An annotation farm can label a million examples. It cannot tell you whether reported EBITDA is real cash or accounting noise, where a first-lien claim sits in the recovery waterfall, or why a covenant decides the outcome. That takes practitioners in a room, building the framework for how to approach the problem before a single task is written.
This is why the work has to come from people who have been right and wrong with real money on the line. Quality data and quality benchmarks are downstream of quality judgment. Get the framework wrong and you have scaled the wrong answer. We do one thing: finance.
And we scale it the way no vendor can. Our originators underwrite real deals, and every deal becomes new ground to learn from.
Code got smart because every answer has a verifier: the tests pass or they do not. Finance never had that signal for the judgment call itself: nothing checked how a decision was reached against what later happened, so it had no automatic way to know when it was right.
With no objective checker, the best you can learn from is consensus. And consensus caps you at average. You cannot beat the market by memorizing it.
A real deal already holds the missing signal: what actually happened. Frame it as a problem to solve, and the realized result becomes a reward a model can train against truth, not against opinion.
Judgment, captured so a model can inherit it.
Nobody is born an analyst. Judgment accumulates, deal by deal, in people who have carried the risk. We capture that wordless calculus in forms a frontier lab can train on: artifacts you own that drop into your own RL and SFT pipelines, not a platform to adopt. Every product is finance, and only finance.
FJB scores a model twice. Once against industry best practice, and once against how the deal was actually underwritten.
The distance between those two scores is the judgment gap. It is the headroom a model has left, a quantity no existing benchmark measures.
Every case becomes a training environment, not a static dataset.
A real financial situation is rebuilt into a set of graded tasks a model can act in and be scored on, with rewards that check against the record. How that rebuild happens stays in-house, but the controls that make it trainable do not: point-in-time splits so nothing leaks from the future, and every task adjudicated by practitioners before it counts. What reaches a model is the judgment of people who have actually carried the risk.
Each task targets a capability frontier models still lack.
The categories isolate distinct kinds of judgment, drawn from how our analysts actually work a problem. The rubric in each category encodes that discipline.
Questions a model team tends to ask first.
Tell us what you are training.
Whether you are a frontier lab working with finance data or an institution with deals to put to work, we will scope it with you directly.
Building the verifier finance still lacks.