Welcome to the IneqMath Evaluation Platform!

Submit New Model Evaluation Results

Please upload the JSON file with model evaluation results and fill in the following information. If you have any questions, please contact us at jiayi_sheng@berkeley.edu or lupantech@gmail.com.

Split *

Required: Select which dataset split to evaluate on

ℹ️

OpenAI API Key Required for Evaluation

• A valid OpenAI API key (Tier 2+ with $30+ budget) is required for LLM judge evaluation. We do not save or store your API key - it's only used during evaluation.
• You can revoke or deactivate your key 30 minutes after evaluation completion. The evaluation process typically costs around $25 and 25 minutes depending on your submission size.

Model Name *

Required

Model Type *

Select the type of your model

Model Source *

Select whether the model is proprietary or open-source

Size *

Required: Model size (e.g., 7B, 13B, 128 x 17B, 1000M, 500K). Write UNK for unknown size.

OpenAI API Key (Tier 2+ with $30+ budget)

Required: Your OpenAI API key for LLM judge evaluation.

Email *

Required: You can use your email to manage your submissions.

Evaluation Results JSON File *

Live Evaluation Status

Overall Acc

Answer Acc

Step Acc (No Toy Case)

Step Acc (No Logical Gap)

Step Acc (No Approximation Error)

Step Acc (No Computation Error)

Required JSON Structure:

Your JSON file must include at least these 5 fields for each problem:

[
    {
        "data_id": [integer or string] The ID of the test data,
        "problem": [string] The question text,
        "type": [string] The type of question: 'relation' or 'bound',
        "prompt": [string] The prompt used for the problem,
        "response": [string] The response of the model
    },
    ...
]

You can click the download button below to get an example file. The system will process your submission and calculate accuracy metrics automatically.

History Submissions

⚠️

Past 24 Hours Only

For privacy and performance, this view only retrieves submissions from the past 24 hours.

Please enter your email to check the results of your submission within 24 hours.

Enter the email address to search

ID	Status	Model	Size	Type	Source	Date	Submission time	Overall Acc	Answer Acc	Step Acc (NTC)	Step Acc (NLG)	Step Acc (NAE)	Step Acc (NCE)

If some scores don't display normally or have disappeared, please click the "Update Data" button and search again.

Status Explanation:

Processing: Your submission is currently being evaluated by us. This may take several minutes to complete.
Completed: Evaluation is finished and results are ready.

Step Accuracy Abbreviations:

NTC: No Toy Case - Step accuracy excluding using toy-case for general conclusions
NLG: No Logical Gap - Step accuracy without logical reasoning gaps
NAE: No Approximation Error - Step accuracy excluding approximation errors
NCE: No Calculation Error - Step accuracy excluding all calculation errors