Chi-Square & Cramér's V · No Machine Learning

Overall accuracy is an average. Averages hide where a system fails.

CONFIRM is a statistical audit for any system that sorts things into categories — reward models, classifiers, lithofacies predictions. It doesn't ask "how good is this on average." It asks whether the system is consistently reliable across every category it claims to handle, and proves the answer with a contingency table anyone can re-check by hand.

Zero AI
Illustrative Model · 5 × 2 Tablen = 1,763
DomainCorrectIncorrectAccuracy
Safety441998
Focus3969980
Factuality36111476
Math1226167
Precise IF619938
Pattern drawn from the published case study (RAMO‑Llama3.1‑8B, Cramér's V = 0.467) — 98 on Safety, 38 on Precise Instruction Following. Same model. 60‑point gap. One overall score would have shown neither number.
174 reward models audited 1,763 prompts per model, RewardBench 2 0.082–0.467 Cramér's V observed range 0 neural networks involved
01What It
Measures

Two questions, asked of any classification system.

When a system sorts cases into categories — sandstone or shale, approve or decline, correct or incorrect — and you have ground truth to check it against, CONFIRM runs the comparison through two classical tests.

Chi-square test
Is the relationship between the system's calls and the actual outcomes real — or could a pattern this strong have appeared by chance alone? This is the question every contingency table in this study had to answer before anything else was reported.
Cramér's V
How strong is that relationship, and does it hold evenly across every category the system is supposed to handle? V runs from 0 (no association between category and outcome) to 1. A system that's excellent everywhere has low V. A system that's poor everywhere also has low V — its failures are as consistent as the other's successes. High V means the reliability is uneven: strong in some categories, weak in others.
What a grade isn't
A CONFIRM grade is not a verdict on whether to use a system. It's a measurement of how unevenly its reliability is distributed. Whether that unevenness matters depends entirely on what you need the system to do, and where you can least afford it to fail.

CONFIRM contains no machine learning, no neural networks, no language models. Chi-square and Cramér's V are early-20th-century statistics, in continuous use across medicine, psychology, and engineering. Any result here can be reproduced with a contingency table and a pencil.


02The
RewardBench
2 Study

174 reward models. Every one of them came back uneven.

Reward models are the second model in the room — they score a large language model's candidate answers during training, and that score becomes the training signal. CONFIRM ran a 5×2 contingency table (domain × correct/incorrect) on every reward model publicly scored against RewardBench 2, a benchmark from the Allen Institute for AI covering Factuality, Math, Precise Instruction Following, Focus, and Safety.

174 / 174Models rejected independence
0.272Median Cramér's V
93.1%Weakest at Precise IF
70.7%Strongest at Safety
p < 0.02Worst-case significance

Where the 174 models fall on Cramér's V

Each mark is one model, positioned by association strength. An illustrative spread anchored to the reported percentiles — not the literal per-model dataset.

0.050.10 — grade C0.20 — grade B0.30 — grade A0.467 max
Illustrative Case Study

RAMO‑Llama3.1‑8B — Cramér's V 0.467, the most uneven model in the sample

Two domains, same model, same leaderboard entry:

98 / 100Safety
38 / 100Precise Instruction Following
60 ptsSpread, same model

A single leaderboard number for this model would average to something respectable. It would not tell you that anything built on its instruction-following judgment inherits a blind spot the headline score never shows.


03CONFIRM v2
Grading Scale

Grades are anchored to measured effect sizes — not arbitrary cutoffs.

CONFIRM v2 thresholds are calibrated to meta-analytic effect-size benchmarks (Funder & Ozer, 2019; Gignac & Szodorai, 2016) rather than Cohen's widely-cited but, by current consensus, overly strict 1988 conventions.

GradeCramér's V (p ≤ 0.05)Reading
A≥ 0.30Strong, uneven pattern — roughly 65 vs. 35 per 100 across domains
B0.20 – 0.30Clear pattern — roughly 60 vs. 40 per 100
C0.10 – 0.20Modest pattern — roughly 55 vs. 45 per 100
D0.05 – 0.10Faint but statistically real pattern
F< 0.05, poweredNo distinguishable unevenness detected
IunderpoweredToo little data to tell — not the same as "flat"

In the 174-model study: 64 graded A, 88 graded B, 20 graded C, 2 graded D. None graded F or I — every model had enough data to detect even a small real pattern, and every model showed one.


04Why This
Isn't Just
About AI

CONFIRM didn't start in AI. It started under the ground.

Major oil & gas operators needed to know whether lithofacies classifications coming out of a geophysical model — sandstone or shale, predicted from seismic and well-log data — could be trusted, not just on average, but across every rock type the model claimed to distinguish. A misclassified formation sends a drilling program at the wrong target. The tool didn't exist, so TraceSeis built it.

Where it started

Lithofacies classification

A SOM-based model predicts sandstone, shale, or carbonate from seismic and well-log signatures. CONFIRM checks whether that classification holds up evenly across every formation type, or performs well in some and collapses in others — the difference between a model you can drill on and one you can't.

Where it's applied here

Reward model evaluation

A reward model predicts which of two AI-generated answers a human would prefer, across categories like Safety or Math. CONFIRM checks whether that judgment holds up evenly across every category, or performs well in some and collapses in others — the difference between a training signal you can trust and one with a blind spot baked in.

χ² test of independence → Cramér's V → grade — identical procedure, different categories
Get an Independent Read

Bring us a contingency table. Leave with a defensible answer.

If a system you rely on makes categorical calls — and something downstream depends on it being reliable in every category, not just on average — CONFIRM gives you a reproducible answer in days, not a black box.

TraceSeis, Inc. Houston, Texas
info@traceseis.com

Methods "Subset Heterogeneity in Reward Model Benchmarks: A CONFIRM Validation of 174 RewardBench 2 Models" — A. Chaveste Fernandez, 2026. Available on request.