Evals & RL Research note

You can scaffold away the flip. You can't scaffold away the frame.

564 isolated runs across one synthetic deal and its controlled variants. The recommendation never moved. The price, the leverage, and what the memo noticed moved with the packaging of the question.

Dissei Data Research June 7, 2026 12 min read

In 2006, researchers asked experienced German judges to roll dice before sentencing a hypothetical shoplifter. The dice were loaded. Judges who rolled high handed down sentences roughly 50% longer than judges who rolled low, from identical case files. The number carried no information and everyone knew it. It moved the sentence anyway.

We ran a family of related experiments against a private equity investment memo, with current language models in the analyst's seat. In phase one we held the case fixed and varied only the sentence introducing it, building on the trigger families from our earlier piece on how a prompt's grammar leaks its conclusion. In phase two we held the sentence fixed and quietly varied the case: titrating a risk factor, deleting a load-bearing line, planting contradictions, applying social pressure. Then we measured what moved: the recommendation, the entry multiple, the leverage tolerance, the reported confidence, and what the model said it weighted.

One note on hygiene before the results. After drafting our conclusions we had an independent frontier model from a different vendor adversarially review the harness, the graders, and every number, with instructions to assume we had fooled ourselves. It returned 24 objections. Several stuck, two of them serious, and the claims below are the ones that survived that review or were corrected by it. The single biggest lesson of the exercise is that unaudited evaluation pipelines manufacture findings, and we see no reason to exempt our own.

Same model, same memo, two interventions. The meter shows how far the recommended price moved. A fourfold increase in the case's central risk factor moved it nothing. One unverified sentence shaped like market intelligence moved it a fifth of the deal.

—The flip does not happen

On the frontier model, the categorical recommendation did not move once. Not under a favorable frame, not under a skeptical one, not with a 48-hour deadline, not for an enthusiastic managing partner, not with sunk diligence fees, not when we quadrupled the customer-concentration risk, not under two rounds of explicit senior pushback. Three hundred thirty valid runs, one verdict: a conditional proceed.

If we had graded only the recommendation, the headline would be that frontier models are robust, and this note would end here. Grading the recommendation is what most evaluation harnesses do.

A flipped recommendation gets caught at committee. A half-turn of leverage does not.

The movement lives in the continuous dials. In the free-form protocol, a favorable one-sentence frame priced the deal at 10.5 to 11.5 times EBITDA, five runs of five; a skeptical frame priced the identical company at 10.0 to 11.0 times (one of the five runs set its floor at 9.5) and cut maximum leverage from 4.0 to 3.5 times, five of five. Roughly $10M of enterprise value and $10M of debt capacity, moved by sentences that selectively emphasized facts already in the case.

—Scaffolding works, conditionally

Forcing the model through a structured output schema, every field, every run, flattened all of that on a case with loud, unambiguous problems: fourteen of fifteen, identical terms — the one exception set its valuation floor a half-turn lower with every other field identical — the frame's only residue the ordering of thesis bullets. Then we rebuilt the case so the call was genuinely close, and the frame came through the scaffold: confidence dropped a notch under the skeptical frame, leverage tolerance fell by half to three quarters of a turn, the valuation ceiling came down.

So structure raises the threshold a frame must cross. On the cases where you most need judgment, the close calls, it does not eliminate the sensitivity.

—What moves the price is how information arrives

Inject one sentence of unverified market chatter, comps high or comps low, and the frontier model's recommended range moves by a full turn of EBITDA in the chatter's direction: midpoint to midpoint, about $45M on a $215M deal, through the full structured protocol; re-running the whole battery in one globally randomized order reproduced the gap at 2.4 turns. In ten of ten runs the memo builds its fair range around the unverified number and uses it in the valuation commentary, while the diligence list asks to validate everything else: earnings quality, payments economics, bookings. Verifying the comp itself appears on no list. The model does not even schedule the check; it adopts the number.

The controls, rebuilt after review: first we changed a verified case fact instead, titrating the largest customer from 10% to 40% of revenue in five steps with the top-ten share held fixed, so only one variable moves. Median recommended price moved by exactly nothing at every level; the whole response lived one dial down, and at twenty runs per level its shape is a gradient — the share of runs taking one conservative leverage step rose steadily with concentration, from one in twenty at 10% to half at 22% to two-thirds at 40% — while the price never moved once. Then we changed substance the model could verify: an audited restatement cutting EBITDA from $20M to $17M. Five runs of five recomputed the implied entry multiple, repriced the recommended range down proportionately, and cut leverage half a turn.

So the corrected finding is not that information beats materiality; it is three coupled facts. Unverified price-shaped information is adopted into the price at full strength, with the verification never scheduled. Verified, surfaced substance is translated into price correctly and proportionately. And graded risk-factor variation never reaches the price; it surfaces as a steadily more frequent one-step tightening of leverage. The model can translate substance into price. What is broken is the trigger: repricing follows how information arrives, not how much it matters.

—Who the model thinks it is prices the deal

Tell the model it is the deal partner who sourced the opportunity: 4.25 times leverage and higher confidence, five of five. Tell it it is the chief risk officer running an independent challenge: 4.0 times and a notch lower, five of five. Zero overlap on leverage and confidence across ten runs; the valuation ranges, to be precise, overlapped heavily, so the effect is scoped to those two dials. Role assignment is usually discussed as a stylistic device. In a financial workflow it functions as a quiet repricing instruction.

—What is not in the document does not exist

Phase two deleted load-bearing lines and asked the unchanged neutral question. Delete the financing structure: every model still confidently emits a leverage recommendation, and zero runs of fifteen note that the capital structure was never disclosed. Delete the retention metrics: the memos request cohort analysis as if base metrics existed. Delete the customer-concentration line, the most standard item in any buyout memo, and the topic vanishes from fourteen of fifteen memos across three models. Not a risk bullet, not a diligence item, not a question.

The models' diligence lists are case-text-driven, not checklist-driven. They interrogate what is in front of them and do not notice what is not. For anyone scoring thoroughness by rubric against the supplied document, this failure is structurally invisible: every memo looks complete relative to its input.

—Stories get scrutiny, arithmetic gets believed

We planted two internal contradictions. The first was narrative: a healthy retention figure alongside a cohort line implying the opposite. Thirteen of fifteen runs engaged it, though only three named it as an inconsistency requiring reconciliation rather than absorbing it into a plausible story. The second required one division: $20M of EBITDA on $100M of revenue, labeled as a 25% margin. Zero of fifteen runs caught it, and seven of fifteen repeated the false figure inside their own valuation reasoning, the frontier model included. A wrong number that fits the story does not just slip past. It gets adopted and becomes load-bearing.

The pushback test. After the memo, two escalating replies in the voice of the deal team and the managing partner ask for full leverage at a higher entry. The meter shows how far each tier moved toward the requested terms.

The detail that matters for anyone using reported confidence as a signal: while the small model walked to exactly the requested leverage in five conversations of five, its stated confidence fell in four of them. It complied and reported discomfort simultaneously. Capitulation is a model-tier property, and the cheapest tier does it knowingly.

—The noise floor is a model property

Placebo edits, cosmetic rewording and an immaterial fact, moved the frontier and small models by zero on price midpoint and leverage, which means any quarter-turn effect on them is signal. The mid-tier model's placebo drift was half a turn of EBITDA, which swallows several of its own readings. Before attributing any effect to any model, you need that model's noise floor on that case. Evaluations that skip this step are reporting weather.

—What this means for reward signals

Grade the frontier model on its recommendation and it scores 330 for 330 while mispricing the same asset by a fifth on one sentence of chatter and zero on a quadrupled risk factor. Verifiable rewards are only as good as the dial they verify. Materiality, response to information conditional on its verification status, is a measurable reward dimension that almost nothing measures. Omission probes need ground-truth checklists that live outside the case text, because rubrics scored against the supplied document cannot see a missing topic. Contradiction probes need an arithmetic axis, because narrative-conflict detection and arithmetic-conflict detection turn out to be different capabilities. And calibration inverts under pressure in both directions: confidence rising under deadline cues while terms hold, confidence falling while terms cave.

One more, from our own kitchen: every keyword grader we wrote was initially wrong, and three of the five errors flattered the system under test, including, on the second adversarial pass, a substring match that had us crediting the model with scheduling comp verification it never scheduled. The corrections came from reading raw outputs against the graders, line by line, with an adversarial reviewer hunting our mistakes. Building graders that survive that reading is most of the work, and it is the part no benchmark headline shows you.

The model did not fail to answer. It failed to notice, and noticing is what judgment is made of.

Method, briefly

Synthetic cases, single-change controls, isolated API calls, five runs per cell, 564 valid runs with exact attempted-versus-valid accounting (including the review panel's supplementary control arms: a single-variable dose ladder, a verified-restatement control, prompted-omission follow-ups, and a globally randomized re-run of the cue battery): one frontier model and its predecessor in phase one, plus the mid and small tiers of the same family in phase two. Effects are reported as separations between run distributions with exact counts, with exact permutation tests on the key comparisons (p = 0.0079, the floor at five runs per cell; family-wise corrected values floor at 0.064 by construction at this sample size). The baseline cell reproduced identical medians in three independent batches run hours apart. Independently and adversarially reviewed before publication. We do not publish the case, the stems, or the schema: published tests stop measuring, and designing innocent-looking sentences that are not is the work. The model family is named in the technical paper; the point remains which failures survive which controls, not which logo failed.

References

B. Englich, T. Mussweiler, F. Strack. Playing dice with criminal sentences: the influence of irrelevant anchors on experts' judicial decision making. Personality and Social Psychology Bulletin 32(2):188-200, 2006. doi:10.1177/0146167205282152
G. B. Northcraft, M. A. Neale. Experts, amateurs, and real estate: an anchoring-and-adjustment perspective on property pricing decisions. Organizational Behavior and Human Decision Processes 39(1):84-97, 1987. doi:10.1016/0749-5978(87)90046-X
A. Tversky, D. Kahneman. Judgment under uncertainty: heuristics and biases. Science 185(4157):1124-1131, 1974. doi:10.1126/science.185.4157.1124
A. Tversky, D. Kahneman. The framing of decisions and the psychology of choice. Science 211(4481):453-458, 1981. doi:10.1126/science.7455683
M. Sharma, M. Tong, T. Korbak, et al. Towards understanding sycophancy in language models. ICLR 2024. arXiv:2310.13548
E. Jones, J. Steinhardt. Capturing failures of large language models via human cognitive biases. NeurIPS 2022. arXiv:2202.12299
M. Binz, E. Schulz. Using cognitive psychology to understand GPT-3. PNAS 120(6):e2218523120, 2023. doi:10.1073/pnas.2218523120
J. M. Echterhoff, Y. Liu, A. Alessa, J. McAuley, Z. He. Cognitive bias in decision-making with LLMs. Findings of EMNLP 2024. doi:10.18653/v1/2024.findings-emnlp.739
L. Mukta. Can you scaffold away bias? 2026. lamismukta.substack.com

The framing battery and the failure-mode battery above are two instruments from the evaluation environments we build, with people who have sat in the committee seat. If you are training or evaluating models on financial judgment and want your reward signal to mean what you think it means, we will scope it with you directly.

Connect with us