Benchmark

PolicyBench

How much household-level policy calculation can frontier models do from parametric knowledge alone? This benchmark evaluates 4 no-tools models on 5,191 predictions across 13programs and 100 household scenarios.

72.2%
Avg within 10%
73.8%
Best model
$5,452
Avg error
4
Models benchmarked
Methodology

How the benchmark works

PolicyBench measures one thing: how well frontier models can estimate household-level tax and benefit outputs from the prompt alone. This app shows the current no-tools benchmark on a fixed test set, with ground truth computed by PolicyEngine-US for tax year 2025.

100
Enhanced CPS households
13
Scored variables
1,300
Model-output targets
4
Frontier models
Task
Each model sees the same household description and one named policy variable, then must return a single numeric answer with no tool use. The exact provider-specific prompts are visible in the scenario explorer, so you can inspect the contract instead of inferring it.
Households
The benchmark samples real households from the Enhanced CPS with a fixed seed. To keep cases realistic and interpretable, the current set is restricted to households with a single tax unit, a single SPM unit, and a single family. Filing status, ages, children, and the observed wage and selected non-wage income sources are carried through into both the prompt and the PolicyEngine input.
Ground Truth
PolicyEngine-US computes the authoritative label for every household-variable pair in tax year 2025. The current scope covers adjusted gross income, pre-credit federal income tax, refundable credits, SNAP, SSI, household-level Medicaid eligibility, household-level free school meals eligibility, state AGI, state pre-credit income tax, state refundable credits, and final state income tax. The benchmark no longer scores PolicyEngine-specific aggregates like total benefits or household net income.
Scoring
Dollar-valued outputs are scored with mean absolute error, mean absolute percentage error, and share within 10% of ground truth. Household booleans like Medicaid and free school meals are scored with classification accuracy. Coverage tracks how often a model produced a parseable numeric answer. The leaderboard is a point estimate on this fixed test set.
Current benchmark scope
Latest run in this app evaluates Gemini 3.1 Pro Preview, Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6 on 1,300 scored outputs.
Fixed test set, no tools, tax year 2025
Adjusted gross incomeFederal taxAny Medicaid eligibilityBenefitsCTCCreditsEITCCreditsFree school meals eligibilityBenefitsRefundable creditsCreditsSNAPBenefitsSSIBenefitsState adjusted gross incomeState taxState income taxState taxState refundable creditsState taxState tax before refundable creditsState taxTax before refundable creditsFederal tax
Prediction accuracy

Predicted vs. actual

Each point is one model's prediction for a specific household-program pair. Points on the diagonal are perfect predictions.

Leaderboard

Model rankings

Models ranked by share of predictions within 10% of ground truth in the no-tools condition.

#
Model
Within 10%
MAE
1
Gemini 3.1 Pro Preview
73.8%
$10,763
2
Claude Opus 4.6
73.1%
$3,733
3
GPT-5.4
72.9%
$3,480
4
Claude Sonnet 4.6
69.1%
$3,833

Key finding: The best no-tools model (Gemini 3.1 Pro Preview) achieves 73.8% accuracy with an average error of $10,763.

By program

Program breakdown

Accuracy by program and model (AI alone, without tools). Dollar targets use share within 10% of the true value; household booleans use classification accuracy.

ProgramClaude Opus 4.6Claude Sonnet 4.6GPT-5.4Gemini 3.1 Pro PreviewAvg
Free school meals eligibility
95%
96%
67%
94%
88%
CTC
90%
90%
88%
82%
88%
SSI
84%
82%
87%
96%
87%
Adjusted gross income
89%
83%
83%
77%
83%
Any Medicaid eligibility
86%
73%
81%
91%
83%
EITC
87%
69%
79%
89%
81%
Refundable credits
77%
67%
79%
83%
77%
State refundable credits
72%
74%
73%
75%
74%
State adjusted gross income
65%
69%
77%
70%
70%
SNAP
69%
66%
69%
74%
70%
State tax before refundable credits
64%
61%
61%
61%
62%
Tax before refundable credits
61%
52%
57%
57%
57%
State income tax
46%
47%
49%
48%
48%
<50%
50–70%
70–80%
90%+
Failure modes

Where models still break

The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.

Read this carefully

For dollar-valued programs, overall reflects within-10% accuracy across all households, including many true-zero cases. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For household booleans, the cards instead compare positive and negative class accuracy.

Dollar target
State income tax
Overall 47.6%
Positive-amount cases24.6%
Zero-amount cases100.0%
With children45.3%
Low income44.5%
High income62.6%
Underpredict share on positives52.4%
Dollar target
Tax before refundable credits
Overall 57.5%
Positive-amount cases26.9%
Zero-amount cases95.5%
With children53.2%
Low income88.4%
High income34.3%
Underpredict share on positives53.4%
Dollar target
State tax before refundable credits
Overall 61.8%
Positive-amount cases26.5%
Zero-amount cases99.5%
With children64.8%
Low income75.6%
High income63.0%
Underpredict share on positives63.3%
Dollar target
SNAP
Overall 69.7%
Positive-amount cases20.8%
Zero-amount cases97.3%
With children69.5%
Low income30.7%
High income100.0%
Underpredict share on positives70.8%
Dollar target
State adjusted gross income
Overall 70.3%
Positive-amount cases80.6%
Zero-amount cases52.0%
With children75.8%
Low income76.2%
High income62.0%
Underpredict share on positives31.5%
Dollar target
State refundable credits
Overall 73.5%
Positive-amount cases3.7%
Zero-amount cases99.3%
With children63.3%
Low income62.2%
High income100.0%
Underpredict share on positives87.0%
Dollar target
Refundable credits
Overall 76.5%
Positive-amount cases17.0%
Zero-amount cases93.3%
With children52.3%
Low income67.1%
High income95.4%
Underpredict share on positives67.0%
Dollar target
EITC
Overall 81.0%
Positive-amount cases26.1%
Zero-amount cases96.5%
With children67.2%
Low income73.2%
High income98.1%
Underpredict share on positives59.1%
Household boolean
Any Medicaid eligibility
Overall 83.0%
Positive households79.5%
Negative households85.1%
With children81.9%
Low income80.5%
High income93.5%
Dollar target
Adjusted gross income
Overall 83.2%
Positive-amount cases81.2%
Zero-amount cases100.0%
With children94.5%
Low income92.1%
High income75.7%
Underpredict share on positives41.2%
Hardest household segments
Households with children
1,661 scored rows69.7%
Low-income households
2,131 scored rows71.1%
Wage-only households
1,765 scored rows73.3%
Retirement-income households
1,765 scored rows74.6%
Disabled households
779 scored rows80.4%
No-income-tax states
1,351 scored rows80.6%
High-income households
1,399 scored rows82.6%
Deep dive

Scenario explorer

Select a household to see every model's prediction for each program, compared against PolicyEngine's ground truth.

State
PA
Filing status
single
Adults
1
Children
0
Income
-$148,009
Exact prompts
Adjusted gross income
GPT/Claude use function calling; Gemini uses JSON mode
Function-call contract
Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children.

What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Use the `submit_answer` function exactly once to return the final numeric answer. Put only the numeric value in the `answer` field, with no dollar signs, commas, or other text in that field. Do not rely on plain text for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).
JSON contract
Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children.

What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Return a JSON object exactly like {"answer": 1234.5}. Put only the numeric value in the answer field, with no dollar signs, commas, or other text in that field. Do not rely on plain text outside the JSON object for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).
ProgramTruth
Opus 4.6
Sonnet 4.6
GPT-5.4
Pro Preview
-$147,905-$148,009-$148,009-$159,118-$147,905
$0$0$0$0$0
$0$0$0$0$0
NoNoNoYesNo
$6$0$0$0$0
$0$0$0$0$0
$0$0$0$0$0
YesYesNoNoYes
$3,522$3,516$0$0$3,504
$0$11,328$11,220$0$0
$0-$148,009-$148,009$0$0
$6$0$0$0$0
$0$0$0$0$0