How the benchmark works
PolicyBench measures one thing: how well frontier models can estimate household-level tax and benefit outputs from the prompt alone. This app shows the current no-tools benchmark on a fixed test set, with ground truth computed by PolicyEngine-US for tax year 2025.
Predicted vs. actual
Each point is one model's prediction for a specific household-program pair. Points on the diagonal are perfect predictions.
Model rankings
Models ranked by share of predictions within 10% of ground truth in the no-tools condition.
Key finding: The best no-tools model (Gemini 3.1 Pro Preview) achieves 73.8% accuracy with an average error of $10,763.
Program breakdown
Accuracy by program and model (AI alone, without tools). Dollar targets use share within 10% of the true value; household booleans use classification accuracy.
| Program | Claude Opus 4.6 | Claude Sonnet 4.6 | GPT-5.4 | Gemini 3.1 Pro Preview | Avg |
|---|---|---|---|---|---|
| Free school meals eligibility | 95% | 96% | 67% | 94% | 88% |
| CTC | 90% | 90% | 88% | 82% | 88% |
| SSI | 84% | 82% | 87% | 96% | 87% |
| Adjusted gross income | 89% | 83% | 83% | 77% | 83% |
| Any Medicaid eligibility | 86% | 73% | 81% | 91% | 83% |
| EITC | 87% | 69% | 79% | 89% | 81% |
| Refundable credits | 77% | 67% | 79% | 83% | 77% |
| State refundable credits | 72% | 74% | 73% | 75% | 74% |
| State adjusted gross income | 65% | 69% | 77% | 70% | 70% |
| SNAP | 69% | 66% | 69% | 74% | 70% |
| State tax before refundable credits | 64% | 61% | 61% | 61% | 62% |
| Tax before refundable credits | 61% | 52% | 57% | 57% | 57% |
| State income tax | 46% | 47% | 49% | 48% | 48% |
Where models still break
The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.
For dollar-valued programs, overall reflects within-10% accuracy across all households, including many true-zero cases. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For household booleans, the cards instead compare positive and negative class accuracy.
Scenario explorer
Select a household to see every model's prediction for each program, compared against PolicyEngine's ground truth.
Exact promptsAdjusted gross incomeGPT/Claude use function calling; Gemini uses JSON mode
Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children. What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Use the `submit_answer` function exactly once to return the final numeric answer. Put only the numeric value in the `answer` field, with no dollar signs, commas, or other text in that field. Do not rely on plain text for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).
Consider a single filer living in PA for tax year 2025. Adult 1 is 63 years old and has $0 in annual employment income and receives $-144,905 in self-employment income, $-1,242 in short-term capital gains, and $-1,862 in long-term capital gains. They have no children.
What is the federal adjusted gross income (AGI), after above-the-line deductions but before standard or itemized deductions and exemptions for this household? Return a JSON object exactly like {"answer": 1234.5}. Put only the numeric value in the answer field, with no dollar signs, commas, or other text in that field. Do not rely on plain text outside the JSON object for the final answer. If the answer is a dollar amount, give the annual amount. If the answer is a rate, give a decimal (e.g. 0.25 for 25%).| Program | Truth | Opus 4.6 | Sonnet 4.6 | GPT-5.4 | Pro Preview |
|---|---|---|---|---|---|
| -$147,905 | -$148,009 | -$148,009 | -$159,118 | -$147,905 | |
| $0 | $0 | $0 | $0 | $0 | |
| $0 | $0 | $0 | $0 | $0 | |
| No | No | No | Yes | No | |
| $6 | $0 | $0 | $0 | $0 | |
| $0 | $0 | $0 | $0 | $0 | |
| $0 | $0 | $0 | $0 | $0 | |
| Yes | Yes | No | No | Yes | |
| $3,522 | $3,516 | $0 | $0 | $3,504 | |
| $0 | $11,328 | $11,220 | $0 | $0 | |
| $0 | -$148,009 | -$148,009 | $0 | $0 | |
| $6 | $0 | $0 | $0 | $0 | |
| $0 | $0 | $0 | $0 | $0 |