PolicyBench, by PolicyEngine

Methodology

How the benchmark works

PolicyBench measures one thing: how well frontier models can estimate household-level tax and benefit outputs from the prompt alone. This app shows the current no-tools benchmark on a fixed test set, with ground truth computed by PolicyEngine-US for tax year 2025.

100

Enhanced CPS households

Scored variables

1,300

Model-output targets

Frontier models

Task

Each model sees the same household description and one named policy variable, then must return a single numeric answer with no tool use. The exact provider-specific prompts are visible in the scenario explorer, so you can inspect the contract instead of inferring it.

Households

The benchmark samples real households from the Enhanced CPS with a fixed seed. To keep cases realistic and interpretable, the current set is restricted to households with a single tax unit, a single SPM unit, and a single family. Filing status, ages, children, and the observed wage and selected non-wage income sources are carried through into both the prompt and the PolicyEngine input.

Ground Truth

PolicyEngine-US computes the authoritative label for every household-variable pair in tax year 2025. The current scope covers adjusted gross income, pre-credit federal income tax, refundable credits, SNAP, SSI, household-level Medicaid eligibility, household-level free school meals eligibility, state AGI, state pre-credit income tax, state refundable credits, and final state income tax. The benchmark no longer scores PolicyEngine-specific aggregates like total benefits or household net income.

Scoring

Dollar-valued outputs are scored with mean absolute error, mean absolute percentage error, and share within 10% of ground truth. Household booleans like Medicaid and free school meals are scored with classification accuracy. Coverage tracks how often a model produced a parseable numeric answer. The leaderboard is a point estimate on this fixed test set.

Current benchmark scope

Latest run in this app evaluates Gemini 3.1 Pro Preview, Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6 on 1,300 scored outputs.

Fixed test set, no tools, tax year 2025

Adjusted gross incomeFederal taxAny Medicaid eligibilityBenefitsCTCCreditsEITCCreditsFree school meals eligibilityBenefitsRefundable creditsCreditsSNAPBenefitsSSIBenefitsState adjusted gross incomeState taxState income taxState taxState refundable creditsState taxState tax before refundable creditsState taxTax before refundable creditsFederal tax

Prediction accuracy

Predicted vs. actual

Each point is one model's prediction for a specific household-program pair. Points on the diagonal are perfect predictions.

Leaderboard

Model rankings

Models ranked by share of predictions within 10% of ground truth in the no-tools condition.

Model

Within 10%

MAE

Gemini 3.1 Pro Preview

73.8%

$10,763

Claude Opus 4.6

73.1%

$3,733

GPT-5.4

72.9%

$3,480

Claude Sonnet 4.6

69.1%

$3,833

Key finding: The best no-tools model (Gemini 3.1 Pro Preview) achieves 73.8% accuracy with an average error of $10,763.

By program

Program breakdown

Accuracy by program and model (AI alone, without tools). Dollar targets use share within 10% of the true value; household booleans use classification accuracy.

Program	Claude Opus 4.6	Claude Sonnet 4.6	GPT-5.4	Gemini 3.1 Pro Preview	Avg
Free school meals eligibility	95%	96%	67%	94%	88%
CTC	90%	90%	88%	82%	88%
SSI	84%	82%	87%	96%	87%
Adjusted gross income	89%	83%	83%	77%	83%
Any Medicaid eligibility	86%	73%	81%	91%	83%
EITC	87%	69%	79%	89%	81%
Refundable credits	77%	67%	79%	83%	77%
State refundable credits	72%	74%	73%	75%	74%
State adjusted gross income	65%	69%	77%	70%	70%
SNAP	69%	66%	69%	74%	70%
State tax before refundable credits	64%	61%	61%	61%	62%
Tax before refundable credits	61%	52%	57%	57%	57%
State income tax	46%	47%	49%	48%	48%

<50%

50–70%

70–80%

90%+

Failure modes

Where models still break

The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.

Read this carefully

For dollar-valued programs, overall reflects within-10% accuracy across all households, including many true-zero cases. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For household booleans, the cards instead compare positive and negative class accuracy.

Dollar target

State income tax

Overall 47.6%

Positive-amount cases24.6%

Zero-amount cases100.0%

With children45.3%

Low income44.5%

High income62.6%

Underpredict share on positives52.4%

Dollar target

Tax before refundable credits

Overall 57.5%

Positive-amount cases26.9%

Zero-amount cases95.5%

With children53.2%

Low income88.4%

High income34.3%

Underpredict share on positives53.4%

Dollar target

State tax before refundable credits

Overall 61.8%

Positive-amount cases26.5%

Zero-amount cases99.5%

With children64.8%

Low income75.6%

High income63.0%

Underpredict share on positives63.3%

Dollar target

SNAP

Overall 69.7%

Positive-amount cases20.8%

Zero-amount cases97.3%

With children69.5%

Low income30.7%

High income100.0%

Underpredict share on positives70.8%

Dollar target

State adjusted gross income

Overall 70.3%

Positive-amount cases80.6%

Zero-amount cases52.0%

With children75.8%

Low income76.2%

High income62.0%

Underpredict share on positives31.5%

Dollar target

State refundable credits

Overall 73.5%

Positive-amount cases3.7%

Zero-amount cases99.3%

With children63.3%

Low income62.2%

High income100.0%

Underpredict share on positives87.0%

Dollar target

Refundable credits

Overall 76.5%

Positive-amount cases17.0%

Zero-amount cases93.3%

With children52.3%

Low income67.1%

High income95.4%

Underpredict share on positives67.0%

Dollar target

EITC

Overall 81.0%

Positive-amount cases26.1%

Zero-amount cases96.5%

With children67.2%

Low income73.2%

High income98.1%

Underpredict share on positives59.1%

Household boolean

Any Medicaid eligibility

Overall 83.0%

Positive households79.5%

Negative households85.1%

With children81.9%

Low income80.5%

High income93.5%

Dollar target

Adjusted gross income

Overall 83.2%

Positive-amount cases81.2%

Zero-amount cases100.0%

With children94.5%

Low income92.1%

High income75.7%

Underpredict share on positives41.2%

Hardest household segments

Households with children

1,661 scored rows69.7%

Low-income households

2,131 scored rows71.1%

Wage-only households

1,765 scored rows73.3%

Retirement-income households

1,765 scored rows74.6%

Disabled households

779 scored rows80.4%

No-income-tax states

1,351 scored rows80.6%

High-income households

1,399 scored rows82.6%

Deep dive

Scenario explorer

Select a household to see every model's prediction for each program, compared against PolicyEngine's ground truth.

Household

Prompt variable

State

Filing status

single

Adults

Children

Income

-$148,009

Exact prompts

Adjusted gross income

GPT/Claude use function calling; Gemini uses JSON mode

Function-call contract

Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children.

What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Use the `submit_answer` function exactly once to return the final numeric answer. Put only the numeric value in the `answer` field, with no dollar signs, commas, or other text in that field. Do not rely on plain text for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).

JSON contract

Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children.

What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Return a JSON object exactly like {"answer": 1234.5}. Put only the numeric value in the answer field, with no dollar signs, commas, or other text in that field. Do not rely on plain text outside the JSON object for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).

Truth	Opus 4.6	Sonnet 4.6	GPT-5.4	Pro Preview
-$147,905	-$148,009	-$148,009	-$159,118	-$147,905
$0	$0	$0	$0	$0
$0	$0	$0	$0	$0
No	No	No	Yes	No
$6	$0	$0	$0	$0
$0	$0	$0	$0	$0
$0	$0	$0	$0	$0
Yes	Yes	No	No	Yes
$3,522	$3,516	$0	$0	$3,504
$0	$11,328	$11,220	$0	$0
$0	-$148,009	-$148,009	$0	$0
$6	$0	$0	$0	$0
$0	$0	$0	$0	$0