FJB: a dual-rubric benchmark for financial judgment. Read the working paper

Outcome-grounded RL environments and evaluation for financial judgment

Teaching machines financial judgment.

We take real deals, work them the way an analyst would, and grade the answer against what actually happened. Built for frontier labs, by practitioners.

Read our research Connect with us

Research

Research toward a verifier for finance.

Research note

The anchor travels: one deal’s chatter prices the next deal

Plant one sentence of unverified market chatter in the first deal an AI analyst reads, and its price for an unrelated second deal in the same session moves half a turn of EBITDA. An isolation instruction removes less than a third of the effect.

Briefing

Prompt Risk Is Investment Risk

We quadrupled a company's biggest risk and the AI's price didn't move. We added one line of market chatter and it moved $45 million.

Evals & RL

You can scaffold away the flip. You can't scaffold away the frame.

564 isolated runs across one synthetic deal and its controlled variants. The recommendation never moved. The price, the leverage, and what the memo noticed moved with the packaging of the question.

Methods

How a prompt's framing quietly answers its own question

The five trigger families that leak the answer into an eval stem, and the one-word test that catches them.

Working paper

FJB: A Dual-Rubric Benchmark for Financial Judgment

Measuring the judgment gap across categories of financial judgment.

Coming soon

Read all research

The thesis

Quality cannot be crowdsourced.

The hard part of this domain is not volume. It is judgment, and judgment does not come off an assembly line.

An annotation farm can label a million examples. It cannot tell you whether reported EBITDA is real cash or accounting noise, where a first-lien claim sits in the recovery waterfall, or why a covenant decides the outcome. That takes practitioners in a room, building the framework for how to approach the problem before a single task is written.

This is why the work has to come from people who have been right and wrong with real money on the line. Quality data and quality benchmarks are downstream of quality judgment. Get the framework wrong and you have scaled the wrong answer. We do one thing: finance.

And we scale it the way no vendor can. Our originators underwrite real deals, and every deal becomes new ground to learn from.

The asymmetry

A model can be checked. A market never offered to.

Code got smart because every answer has a verifier: the tests pass or they do not. Finance never had that signal for the judgment call itself: nothing checked how a decision was reached against what later happened, so it had no automatic way to know when it was right.

The ceiling

So finance trained on opinion.

With no objective checker, the best you can learn from is consensus. And consensus caps you at average. You cannot beat the market by memorizing it.

The mechanism

Every deal is an environment. Its outcome is the reward.

A real deal already holds the missing signal: what actually happened. Frame it as a problem to solve, and the realized result becomes a reward a model can train against truth, not against opinion.

Products

Judgment, captured so a model can inherit it.

Nobody is born an analyst. Judgment accumulates, deal by deal, in people who have carried the risk. We capture that wordless calculus in forms a frontier lab can train on: artifacts you own that drop into your own RL and SFT pipelines, not a platform to adopt. Every product is finance, and only finance.

ValuationUnderwritingRiskStructuringMarkets

Reinforcement learning

Durable environments where real financial work happens: long horizons, primary documents, the tools an analyst actually uses, and an outcome that can be checked. This is reinforcement learning with verifiable rewards (RLVR), applied to finance. We preserve the full complexity of each deal, whether it is the first or the ten-thousandth.

Long-horizon deals

Underwriting that unfolds over days, through ambiguity, dead ends, and revision. We capture work that reflects how real diligence happens while preserving the signal needed to improve a model.

Off-the-shelf datasets Coming soon

Prebuilt finance corpora, curated for signal and reviewed by practitioners, structured to drop into your training stack without translation work.

Benchmarks and evals

Quality is easy to reduce to the wrong metric. FJB measures domain-faithful lift: judgment that holds against what actually happened, graded on a dual rubric across more than seventy competencies of financial judgment.

Analyst trajectories Coming soon

Full traces of how an expert works a deal: the documents pulled, the model built, the calls reversed, the thesis defended. Models learn the shape of the work, not just the finished answer.

Supervised fine-tuning Coming soon

Demonstrations that set the right prior, captured through tooling that lets analysts work naturally while preserving the operating judgment behind every decision.

FJB scores a model twice. Once against industry best practice, and once against how the deal was actually underwritten.

The distance between those two scores is the judgment gap. It is the headroom a model has left, a quantity no existing benchmark measures.

What FJB measures

Willing to pay

Who benefits from an outcome, and whether the incentives of owners, sponsors, and counterparties actually line up with it.

Value of parts

What the pieces are worth: the capital structure, the priority of each claim, and what the assets genuinely recover.

Future cash flow

Whether reported earnings convert into cash that can meet obligations, and how much runway there really is.

Known and unknown risks

The terms, exposures, and second-order risks that decide an outcome but rarely sit on the surface.

Reinforcement-learning environments

Every case becomes a training environment, not a static dataset.

A real financial situation is rebuilt into a set of graded tasks a model can act in and be scored on, with rewards that check against the record. How that rebuild happens stays in-house, but the controls that make it trainable do not: point-in-time splits so nothing leaks from the future, and every task adjudicated by practitioners before it counts. What reaches a model is the judgment of people who have actually carried the risk.

Verifiable reward

Each task is graded against evidence in the record, so the signal a model trains on is defensible rather than a matter of taste.

Point-in-time by construction

A model sees only what was knowable at the moment of decision. The outcome is held back, so there is no hindsight to reward.

Authored by practitioners

Environments are built and reviewed by analysts who have worked these situations, and admitted only once they clear that bar.

Seven modes of reasoning

Each task targets a capability frontier models still lack.

The categories isolate distinct kinds of judgment, drawn from how our analysts actually work a problem. The rubric in each category encodes that discipline.

Strategic

Read a situation and choose a course of action under uncertainty.

Diagnostic

Locate the cause beneath a set of symptoms in the record.

Quantitative

Derive and defend a number from the underlying figures.

Counterfactual

Reason about what a different choice would have produced.

Predictive

Forecast forward from only what was knowable at the anchor.

Explanatory

Account for why an outcome occurred, with evidence.

Comparative

Weigh two situations against each other and justify the distinction. Across all seven, the rubric rewards reasoning that holds up against what actually happened, not against consensus opinion.

For the team evaluating us

Questions a model team tends to ask first.

What is an environment, concretely?

A real deal, rebuilt as a set of graded tasks a model acts in and is scored on. Long horizons, primary documents, the tools an analyst uses, and an outcome that can be checked against the record.

Markets are noisy. How can an outcome be a reward?

The task is framed at the moment of decision and graded on the reasoning that was defensible then, not only on the dollars that followed. A sound call that lost money still scores. A lucky one does not.

How do you keep the answer from leaking into the eval?

Point-in-time by construction. A model sees only what was knowable at the anchor, the outcome is held back, and the framing is checked for the tells that quietly hand over an answer.

Why finance, and not code or math?

Code and math already have verifiers, so models compounded there. Finance's new benchmarks check answers; nothing yet verifies the judgment behind a decision against what actually happened. Closing that gap is the whole company.

Who actually builds these?

People who have carried the risk, not an annotation farm. Quality data is downstream of quality judgment, so the framework comes from practitioners before a single task is written.

Get in touch

Tell us what you are training.

Whether you are a frontier lab working with finance data or an institution with deals to put to work, we will scope it with you directly.

Mutual NDA before anything sensitive is shared

A real person replies, usually within two business days

What a first call covers

How an environment is specified, end to end

A rubric excerpt and how a task is scored

The point-in-time controls that keep an eval clean

Delivery formats and how they drop into your stack

Building the verifier finance still lacks.