
Welcome to the IneqMath Leaderboard!
This is the official leaderboard for IneqMath, an expert-curated dataset of Olympiad-level inequalities.
Please see the paper Solving Inequality Proofs with Large Language Models for more details.
๐ Project |
arxiv |
๐ค HF Paper |
Code |
๐ค Dataset |
๐ Leaderboard |
๐ฎ Visualization
Submit New Model Evaluation Results
Please upload the JSON file with model evaluation results and fill in the following information. If you have any questions, please contact us at jiayi_sheng@berkeley.edu or lupantech@gmail.com.
โข You can revoke or deactivate your key 15 minutes after evaluation completion. The evaluation process typically costs $5 depending on your submission size.
Select the type of your model
Select whether the model is proprietary or open-source
Optional: Select the reasoning effort level
Required JSON Structure:
Your JSON file must include at least these 5 fields for each problem:
[
{
"data_id": [integer or string] The ID of the test data,
"problem": [string] The question text,
"type": [string] The type of question: 'relation' or 'bound',
"prompt": [string] The prompt used for the problem,
"response": [string] The response of the model
},
...
]
You can click the download button below to get an example file. The system will process your submission and calculate accuracy metrics automatically.
Rank
|
Model
|
Size
|
Type
|
Source
|
Date
|
Overall Acc
|
Answer Acc
|
Step Acc
(NTC)
|
Step Acc
(NLG)
|
Step Acc
(NAE)
|
Step Acc
(NCE)
|
---|---|---|---|---|---|---|---|---|---|---|---|
1 | o3-pro (medium, 40K)๐ฅ | UNK | ๐ง | ๐ | 2025-06-10 | 46.0% | 68.5% | 95.5% | 73.5% | 86.0% | 94.5% |
2 | Gemini 2.5 Pro Preview (40K)๐ฅ | UNK | ๐ง | ๐ | 2025-06-05 | 46.0% | 66.0% | 85.0% | 65.0% | 92.5% | 97.5% |
3 | o3-pro (medium, 10K)๐ฅ | UNK | ๐ง | ๐ | 2025-06-10 | 45.5% | 68.0% | 98.5% | 67.0% | 87.0% | 97.5% |
4 | Gemini 2.5 Pro (30K) | UNK | ๐ง | ๐ | 2025-03-25 | 43.5% | 68.0% | 87.5% | 63.0% | 91.0% | 98.0% |
5 | o3 (medium, 40K) | UNK | ๐ง | ๐ | 2025-04-16 | 37.0% | 72.0% | 96.5% | 56.0% | 86.5% | 94.0% |
6 | Gemini 2.5 Flash Preview 05-20 (40K) | UNK | ๐ง | ๐ | 2025-05-20 | 27.5% | 45.0% | 83.0% | 44.0% | 89.0% | 96.0% |
7 | Gemini 2.5 Flash (40K) | UNK | ๐ง | ๐ | 2025-04-17 | 23.5% | 44.5% | 81.0% | 36.5% | 93.5% | 97.5% |
8 | o3 (medium, 10K) | UNK | ๐ง | ๐ | 2025-04-16 | 21.0% | 37.0% | 93.5% | 39.5% | 91.5% | 97.0% |
9 | o4-mini (medium, 10K) | UNK | ๐ง | ๐ | 2025-04-16 | 15.5% | 65.0% | 62.0% | 26.0% | 86.5% | 93.0% |
10 | Gemini 2.5 Pro Preview (10K) | UNK | ๐ง | ๐ | 2025-06-05 | 10.0% | 13.0% | 91.0% | 20.0% | 98.0% | 98.5% |
11 | DeepSeek-R1-0528 (40K) | 671B | ๐ง | ๐ | 2025-05-28 | 9.5% | 73.0% | 49.0% | 52.0% | 20.0% | 92.5% |
12 | o3-mini (medium, 10K) | UNK | ๐ง | ๐ | 2025-01-31 | 9.5% | 62.5% | 37.0% | 22.0% | 77.5% | 95.0% |
13 | o1 (medium, 10K) | UNK | ๐ง | ๐ | 2024-12-17 | 8.0% | 62.5% | 34.5% | 17.5% | 86.5% | 99.5% |
14 | o1 (medium, 40K) | UNK | ๐ง | ๐ | 2024-12-17 | 7.5% | 68.0% | 28.5% | 19.0% | 83.5% | 95.5% |
15 | DeepSeek-V3-0324 | 671B | ๐ | ๐ | 2025-03-25 | 7.0% | 54.5% | 17.5% | 15.0% | 63.0% | 94.5% |
16 | Grok 3 mini (medium, 10K) | UNK | ๐ง | ๐ | 2025-02-19 | 6.0% | 71.5% | 24.0% | 19.5% | 53.5% | 91.0% |
17 | Qwen3-235B-A22B (10K) | 235B | ๐ง | ๐ | 2025-04-28 | 6.0% | 41.0% | 35.0% | 36.0% | 31.0% | 92.5% |
18 | Gemini 2.5 Pro (10K) | UNK | ๐ง | ๐ | 2025-03-25 | 6.0% | 7.0% | 88.5% | 19.0% | 100.0% | 99.5% |
19 | Claude Opus 4 (10K) | UNK | ๐ง | ๐ | 2025-05-14 | 5.5% | 47.5% | 25.0% | 9.0% | 89.5% | 94.0% |
20 | DeepSeek-R1 (10K) | 671B | ๐ง | ๐ | 2025-01-19 | 5.0% | 49.5% | 57.0% | 17.5% | 81.0% | 95.0% |
21 | DeepSeek-R1 (Qwen-14B) (10K) | 14B | ๐ง | ๐ | 2025-01-20 | 5.0% | 40.5% | 21.0% | 21.0% | 35.5% | 85.0% |
22 | DeepSeek-R1-0528 (10K) | 671B | ๐ง | ๐ | 2025-05-28 | 4.5% | 12.0% | 34.0% | 32.5% | 29.0% | 90.5% |
23 | Gemini 2.5 Flash (10K) | UNK | ๐ง | ๐ | 2025-04-17 | 4.5% | 5.5% | 88.0% | 13.5% | 100.0% | 100.0% |
24 | Grok 3 | UNK | ๐ | ๐ | 2025-02-19 | 3.5% | 54.5% | 17.0% | 16.0% | 36.0% | 93.0% |
25 | DeepSeek-R1 (Llama-70B) (10K) | 70B | ๐ง | ๐ | 2025-01-20 | 3.5% | 53.5% | 23.0% | 26.0% | 35.5% | 87.0% |
26 | Gemini 2.0 Flash | UNK | ๐ | ๐ | 2025-02-05 | 3.0% | 49.0% | 15.5% | 13.5% | 55.5% | 94.5% |
27 | Claude Sonnet 4 (10K) | UNK | ๐ง | ๐ | 2025-05-14 | 3.0% | 44.0% | 19.5% | 6.5% | 86.5% | 95.5% |
28 | GPT-4o | UNK | ๐ | ๐ | 2024-08-06 | 3.0% | 37.5% | 32.0% | 3.5% | 92.5% | 94.0% |
29 | Qwen2.5-7B | 7B | ๐ | ๐ | 2024-09-16 | 3.0% | 35.0% | 44.5% | 4.5% | 92.5% | 93.0% |
30 | Qwen2.5-72B | 72B | ๐ | ๐ | 2024-09-16 | 2.5% | 42.0% | 54.5% | 5.0% | 91.0% | 95.0% |
31 | GPT-4.1 | UNK | ๐ | ๐ | 2025-04-14 | 2.5% | 40.5% | 16.0% | 10.0% | 59.5% | 93.5% |
32 | Llama-4-Maverick | 128 x 17B | ๐ | ๐ | 2025-04-05 | 2.5% | 40.5% | 42.5% | 4.0% | 89.0% | 95.0% |
33 | QwQ-32B (10K) | 32B | ๐ง | ๐ | 2025-03-05 | 2.0% | 49.5% | 26.0% | 29.5% | 21.0% | 87.0% |
34 | QwQ-32B-preview (10K) | 32B | ๐ง | ๐ | 2024-11-27 | 2.0% | 43.5% | 28.0% | 30.0% | 22.5% | 87.5% |
35 | Claude 3.7 Sonnet (10K) | UNK | ๐ง | ๐ | 2025-02-19 | 2.0% | 42.0% | 49.0% | 4.0% | 93.5% | 93.0% |
36 | GPT-4o mini | UNK | ๐ | ๐ | 2024-07-18 | 2.0% | 39.5% | 29.0% | 2.5% | 90.0% | 93.0% |
37 | Qwen2.5-Coder-32B | 32B | ๐ | ๐ | 2024-11-10 | 1.5% | 40.5% | 36.0% | 3.0% | 90.5% | 88.5% |
38 | Qwen2.5-Coder-32B | 32B | ๐ | ๐ | 2024-11-10 | 1.5% | 40.5% | 36.0% | 3.0% | 90.5% | 88.5% |
39 | Llama-4-Scout | 16 x 17B | ๐ | ๐ | 2025-04-05 | 1.5% | 33.5% | 30.5% | 3.5% | 93.0% | 92.5% |
40 | Gemini 2.0 Flash-Lite | UNK | ๐ | ๐ | 2025-02-25 | 1.5% | 33.0% | 11.5% | 3.5% | 73.0% | 90.5% |
41 | Claude 3.7 Sonnet (8K) | UNK | ๐ง | ๐ | 2025-02-19 | 1.0% | 41.5% | 49.0% | 2.5% | 93.5% | 92.0% |
42 | DeepSeek-R1 (Qwen-1.5B) (10K) | 1.5B | ๐ง | ๐ | 2025-01-20 | 0.5% | 14.5% | 20.0% | 6.0% | 48.0% | 83.5% |
43 | Gemma-2-9B (6K) | 9B | ๐ | ๐ | 2024-06-25 | 0.0% | 15.5% | 83.5% | 0.5% | 100.0% | 99.0% |
44 | Llama-3.1-8B | 8B | ๐ | ๐ | 2024-07-18 | 0.0% | 14.5% | 90.5% | 0.0% | 99.0% | 92.0% |
45 | Llama-3.2-3B | 3B | ๐ | ๐ | 2024-09-25 | 0.0% | 11.0% | 82.0% | 0.0% | 98.5% | 88.5% |
46 | Gemma-2B (6K) | 2B | ๐ | ๐ | 2024-02-21 | 0.0% | 7.5% | 73.5% | 0.0% | 99.0% | 95.0% |
Icons Explanation:
Type: ๐ง = Reasoning Model, ๐ = Chat Model, ๐ง = Tool-augmented Model
Source: ๐ = Proprietary Model, ๐ = Open-source Model
Step Accuracy Abbreviations:
NTC: No Toy Case - Step accuracy excluding using toy-case for general conclusions
NLG: No Logical Gap - Step accuracy without logical reasoning gaps
NAE: No Approximation Error - Step accuracy excluding approximation errors
NCE: No Calculation Error - Step accuracy excluding all calculation errors
Process Query
ID
|
Status
|
Model
|
Size
|
Type
|
Source
|
Date
|
Submission time
|
Overall Acc
|
Answer Acc
|
Step Acc
(NTC)
|
Step Acc
(NLG)
|
Step Acc
(NAE)
|
Step Acc
(NCE)
|
---|
Status Explanation:
- Processing: Your submission is currently being evaluated by us. This may take several minutes to complete.
- Completed: Evaluation is finished and results are ready. You can submit these results to the leaderboard for public display.
- Pending: Your submission is awaiting admin approval (either for addition to or removal from the leaderboard).
- Leaderboard: Your submission is currently displayed on the public leaderboard and visible to all users.
Actions Available:
- Select "Completed" submissions to submit them to the leaderboard
- Select "Leaderboard" submissions to request their removal from the leaderboard
๐ About
IneqMath is an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws.
โ๏ธ Leaderboard Details
- The leaderboard displays model performance on the IneqMath test set.
- Community members can request automated benchmarking of any new model on IneqMath.
- To submit an evaluation request, visit the ๐ค IneqMath Leaderboard ๐ .
- To contribute your own model's results, click the Submission button on the leaderboard page and upload a results file that meets our format requirements.
- You can search your model's evaluation process and evaluation results in Process Query. Type in your email to get access.
- You may choose whether your results are published on the Leaderboard and may request the removal of any content already displayed. All such requests must undergo our review, and the requested action will only be carried out once approved.
๐ Submission Instructions
Download the dataset
The IneqMath dataset is available on Hugging Face. Follow the instructions on that page to obtain the test split.Generate a results file
- For proprietary models, follow the instructions in our GitHub repository to produce a results JSON.
- For other models, evaluate on the test set and create a JSON file containing at least these five keys for each example:
{ "data_id": 1, // Integer or string: the example's unique ID "problem": "โฆ", // String: the question text "type": "relation", // String: either "relation" or "bound" "prompt": "โฆ", // String: the prompt you used "response": "โฆ" // String: the model's output }
Submit to the leaderboard
Click Submission on the leaderboard page, fill in the requested information, and upload your JSON file.Receive your report
You can search your model's evaluation process and manage your evaluation results in Process Query. Type in your email to get access.