📝 The organization of papers refers to our survey
"Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks".
🚀 Welcome to submit issues to include your LLM4SE benchmarks!
🔥 If you find our survey useful for your research, please cite the following paper:
@article{LLM4SEBenchmark,
title={Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks},
author={Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, David Lo},
year={2025},
journal={arXiv preprint arXiv:2505.08903},
url={[http://arxiv.org/abs/2303.18223](https://arxiv.org/pdf/2505.08903)}
}Requirements and Design
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link |
|---|---|---|---|---|---|
| Elicitation | NFR-Review | 2018 | - | [Paper] | [Link] |
| Rahman and Zhu | 2024 | Readability, Understandability, Specificability, Technical-aspects | [Paper] | [Link] | |
| Habib et al. | 2025 | Precision, Recall, and F | [Paper] | [Link] | |
| Voria et al. | 2025 | Precision, Recall, F, Accuracy, BLUE, ROUGE, METEOR, Brevity Penalty, Length Ratio | [Paper] | [Link] | |
| Analysis | PROMISE NFR | 2007 | - | [Paper] | [Link] |
| SecReq | 2010 | - | [Paper] | [Link] | |
| PURE | 2017 | - | [Paper] | [Link] | |
| Dalpiaz et al. | 2019 | Precision, Recall, F1-score, AUC | [Paper] | [Link] | |
| ReqEval | 2020 | Precision, Recall, F2, Success rate | [Paper] | [Link] | |
| NFR-SO | 2022 | F1 | [Paper] | [Link] | |
| DAMIR | 2022 | Precision, Recall, F2, Success rate | [Paper] | [Link] | |
| Gärtner and Göhlich | 2024 | Accuracy, Precision, Recall, F | [Paper] | [Link] | |
| Preda et al. | 2024 | Precision, Recall, F | [Paper] | [Link] | |
| Koltoff et al. | 2024 | Precision, Recall, F, Accuracy | [Paper] | [Link] | |
| Specification & Validation | Jdoctor | 2018 | Precision, Recall, F | [Paper] | [Link] |
| DocTer | 2022 | Precision, Recall, F1 | [Paper] | [Link] | |
| Poudel et al. | 2023 | F2 and MAP | [Paper] | [Link] | |
| Mandal et al. | 2023 | Precision, Recall, F1 | [Paper] | [Link] | |
| SV-Benchmarks | 2024 | - | [Paper] | [Link] | |
| SpecGenBench | 2024 | #Passes, Success Probability, #Verifier Calls, User Rating | [Paper] | [Link] | |
| Reinpold et al. | 2024 | Precision, Recall, F | [Paper] | [Link] | |
| Krishna et al. | 2024 | Unambiguous, understandable, correctness, verifiable, consistency, non-redundancy, completeness, conciseness | [Paper] | [Link] | |
| OSVBench | 2025 | Pass@N, Syntax Error, Semantic Error | [Paper] | [Link] | |
| Management | Wang et al. | 2020 | Precision, Recall, F1 | [Paper] | [Link] |
| Helmeczi et al. | 2023 | Accuracy, F1 | [Paper] | [Link] |
Coding Assistant
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link | ||
|---|---|---|---|---|---|---|---|
| Code Generation and Recommendation | Lin et al. | 2013 | BLEU, CodeBLEU | [Paper] | [Link] | ||
| Leetcode | 2015 | Passing Test Cases, Runtime, Memory Usage | [Paper] | [Link] | |||
| ExampleCheck | 2018 | misuse rate | [Paper] | [Link] | |||
| CONCODE | 2018 | BLEU | [Paper] | [Link] | |||
| CoNaLa | 2018 | precision, recall, TPR | [Paper] | [Link] | |||
| NL2Bash | 2018 | manual, BLEU | [Paper] | [Link] | |||
| Spider | 2018 | Component Matching, Execution Accuracy | [Paper] | [Link] | |||
| CodeSearchNet | 2020 | NDCG, MRR | [Paper] | [Link] | |||
| APPS | 2021 | Test Case Average, Strict Accuracy | [Paper] | [Link] | |||
| MBPP | 2021 | % solved | [Paper] | [Link] | |||
| CodeXGLUE | 2021 | EM, ES | [Paper] | [Link] | |||
| HumanEval | 2021 | Pass@k | [Paper] | [Link] | |||
| miniF2F | 2021 | Pass@k | [Paper] | [Link] | |||
| Lyra | 2021 | BLEU, AST match | [Paper] | [Link] | |||
| FC2Code | 2022 | BLEU | [Paper] | [Link] | |||
| CodeContests | 2022 | n@k, Pass@k | [Paper] | [Link] | |||
| AixBench | 2022 | Correctness, Maintainability, Pass@1 | [Paper] | [Link] | |||
| ReCode | 2022 | robust Pass@k, drop@k | [Paper] | [Link] | |||
| SecurityEval | 2022 | percentage | [Paper] | [Link] | |||
| MathEquations | 2022 | Functional accuracy | [Paper] | [Link] | |||
| MBXP | 2022 | Pass@k | [Paper] | [Link] | |||
| NumpyEval | 2022 | Pass@k | [Paper] | [Link] | |||
| PandasEval | 2022 | Pass@k | [Paper] | [Link] | |||
| TorchData-Eval | 2022 | Pass@k | [Paper] | [Link] | |||
| MonkeyEval | 2022 | Pass@k | [Paper] | [Link] | |||
| BeatNumEval | 2022 | Pass@k | [Paper] | [Link] | |||
| MTPB | 2022 | Pass Rate, PPL | [Paper] | [Link] | |||
| Multi-HumanEval | 2022 | Pass@k | [Paper] | [Link] | |||
| DSP | 2022 | Pass@k | [Paper] | [Link] | |||
| ExeDS | 2022 | BLEU, CodeBLEU, EM | [Paper] | [Link] | |||
| XLCoST | 2022 | BLEU, CodeBLEU, MRR | [Paper] | [Link] | |||
| Qing et al. | 2023 | Success Rate | [Paper] | [Link] | |||
| ClassEval | 2023 | Pass@k, DEP(F), DEP(M) | [Paper] | [Link] | |||
| TACO | 2023 | Pass@k | [Paper] | [Link] | |||
| xCodeEval | 2023 | F1, Pass@k, Accuracy | [Paper] | [Link] | |||
| CodeApex | 2023 | AC@1, AC@all, AC Rate | [Paper] | [Link] | |||
| CloverBench | 2023 | Accept@k | [Paper] | [Link] | |||
| Mastropaolo et al. | 2023 | CodeBLEU, Levenshtein Distance | [Paper] | [Link] | |||
| CoderEval | 2023 | Pass@k, Acc@k | [Paper] | [Link] | |||
| EvalPlus | 2023 | Pass@k | [Paper] | [Link] | |||
| Shapkin et al. | 2023 | CodeBLEU, Accuracy | [Paper] | [Link] | |||
| CrossCode-Bench | 2023 | EM, BLEU, ROUGE-L | [Paper] | [Link] | |||
| MultiPL-E | 2023 | Pass@k | [Paper] | [Link] | |||
| StudentEval | 2023 | Pass@1 | [Paper] | [Link] | |||
| TorchDataComplexEval | 2023 | Pass@k | [Paper] | [Link] | |||
| DS-1000 | 2023 | Pass@1 | [Paper] | [Link] | |||
| ML-Bench | 2023 | Pass@k | [Paper] | [Link] | |||
| LowCoder | 2023 | Accuracy | [Paper] | [Link] | |||
| Ren et al. | 2023 | Time Consumption, Answer Correctness | [Paper] | [Link] | |||
| CodeAlpaca (Py) | 2023 | - | [Paper] | [Link] | |||
| CoLadder | 2023 | Usability, Cognitive Load | [Paper] | [Link] | |||
| VeriGen | 2023 | Pass@k | [Paper] | [Link] | |||
| SOEval | 2023 | NDCG@K | [Paper] | [Link] | |||
| Decept-Prompt | 2023 | ASR, WFR | [Paper] | [Link] | |||
| HumanEval-X | 2023 | Pass@k | [Paper] | [Link] | |||
| ARCADE | 2023 | Pass@k | [Paper] | [Link] | |||
| MCoNaLa | 2023 | BLEU | [Paper] | [Link] | |||
| CrossCode-Eval | 2023 | Code Match, Identifier Match | [Paper] | [Link] | |||
| Pisces | 2023 | BLEU, Syntax-Match, CodeBLEU | [Paper] | [Link] | |||
| MBPP/HumanEval/APPS-ET | 2023 | CrystalBLEU, BERTScore, COMET, CodeBERTScore | [Paper] | [Link] | |||
| LiveCode-Bench | 2024 | Pass@k | [Paper] | [Link] | |||
| Mercury | 2024 | Beyond Pass | [Paper] | [Link] | |||
| EffiBench | 2024 | ET, NET, MU, TMU, Pass@k | [Paper] | [Link] | |||
| MBPP-san-DFY | 2024 | verify@k | [Paper] | [Link] | |||
| CoderUJB | 2024 | Pass@k, Count@n, Coverage@n | [Paper] | [Link] | |||
| PythonSaga | 2024 | Pass@k | [Paper] | [Link] | |||
| DevEval | 2024 | Pass@k, Recall@k | [Paper] | [Link] | |||
| Exec-CSN | 2024 | Pass@k | [Paper] | [Link] | |||
| Wang et al. | 2024 | BLEU-4, CodeBLEU, edit sim | [Paper] | [Link] | |||
| EvoCodeBench | 2024 | Pass@k, Recall@k | [Paper] | [Link] | |||
| RustEval | 2024 | Pass@k | [Paper] | [Link] | |||
| Devbench | 2024 | Faithfulness, Pass@k | [Paper] | [Link] | |||
| BigCode-Bench | 2024 | Pass@k | [Paper] | [Link] | |||
| OOPEval | 2024 | Pass@k, Pass@o | [Paper] | [Link] | |||
| ODEX | 2024 | Pass@k | [Paper] | [Link] | |||
| NaturalCodeBench | 2024 | Pass@k | [Paper] | [Link] | |||
| PAREval | 2024 | speedup@k, efficiency@k | [Paper] | [Link] | |||
| CAASD | 2024 | pass rate | [Paper] | [Link] | |||
| CodeScope | 2024 | Pass@k | [Paper] | [Link] | |||
| CodeAgent-Bench | 2024 | Pass@k | [Paper] | [Link] | |||
| JavaBench | 2024 | Completion@k, Compilation@k | [Paper] | [Link] | |||
| Chart2Code-160k | 2024 | Execution/pass rate, text match | [Paper] | [Link] | |||
| PoorCodeSumEval | 2024 | BLEU, BERTScore | [Paper] | [Link] | |||
| ComplexCodeEval | 2024 | BLEU, Syntax Match, Data Flow Match | [Paper] | [Link] | |||
| StackEval | 2024 | Acceptance Score | [Paper] | [Link] | |||
| Code-Vision | 2025 | Pass@k | [Paper] | [Link] | |||
| CodeIF-Bench | 2025 | Pass@k | [Paper] | [Link] | |||
| CodeIF | 2025 | Satisfaction Rate | [Paper] | [Link] | |||
| LibEvolutionEval | 2025 | F1-score, MRR | [Paper] | [Link] | |||
| COFFE | 2025 | Efficienct@k | [Paper] | [Link] | |||
| Deep-Bench | 2025 | Pass@k | [Paper] | [Link] | |||
| DynaCode | 2025 | Pass@k | [Paper] | [Link] | |||
| FEA-Bench | 2025 | Precision, Recall | [Paper] | [Link] | |||
| MaintainCoder | 2025 | Pass@k, CodeDiff, ASTsim | [Paper] | [Link] | |||
| mHumanEval | 2025 | BERTScore | [Paper] | [Link] | |||
| REPOEXEC | 2025 | Functional correctness, Dependency utilization | [Paper] | [Link] | |||
| Plot2Code | 2025 | code pass rate, text-match ratio | [Paper] | [Link] | |||
| ProjectEval | 2025 | Pass@K | [Paper] | [Link] | |||
| SolEval | 2025 | Pass@K, Compile@k, Gas Consumption | [Paper] | [Link] | |||
| ConvCodeWorld | 2025 | Pass@K, MRR, Recall | [Paper] | [Link] | |||
| Web-Bench | 2025 | Pass@K | [Paper] | [Link] | |||
| REPOCOD | 2025 | Pass@K | [Paper] | [Link] | |||
| Code Summarization | PCSD | 2017 | Python | BLEU, BLEU-4, ROUGE-L, METEOR, CIDEr | 92,545 pairs | [Paper] | [Link] |
| JCSD | 2018 | Java | Precision, Recall, F-Score, BLEU-4, METEOR, ROUGE-L, CIDEr, BLEU-DC | 87,136 pairs | [Paper] | [Link] | |
| Deepcom | 2018 | Java | BLEU-4, METEOR, ROUGE-L | 69,708 pairs | [Link] | ||
| Funcom | 2019 | Java | BLEU, ROUGE-L, METEOR | 2.1M pairs | [Link] | ||
| CodeXGLUE | 2021 | Java, Python | BLEU, BLEU-4, ROUGE-L, METEOR, USE, MRR | see code_generation table | [Link] | ||
| Funcom-java-long | 2023 | Java | BLEU | 8192 methods | [Link] | ||
| CroCoSum | 2023 | English, Chinese | ROUGE, BERTScore | 18,857 pairs | [Link] | ||
| CAPYBARA | 2023 | C | EM, BLEU-4, ROUGE-L, METEOR | 7,826 pairs | [Link] | ||
| BinSum | 2023 | C | BLEU, METEOR, ROUGE-L, Semantic Similarity | 557,664 functions | [Link] | ||
| P-CodeSum | 2024 | Multiple PLs | BLEU-4, ROUGE-L | 1,500 pairs | [Link] | ||
| FILE-CS | 2024 | Python | BLEU, ROUGE-L, METEOR | 98,236 pairs | [Link] | ||
| Code Translation | CodeSearchNet | 2020 | Multiple PLs | BLEU, CodeBLEU, METEOR, Exact Match | 6.45M pairs | [Link] | |
| CodeXGLUE | 2021 | Java, Python | BLEU-4, BLEU, ACC, CodeBLEU | 11,800 pairs | [Link] | ||
| CodeNet | 2021 | C++, Python | Compilation, Runtime Errors, Functional Errors | 4,053 problems, 13.9M samples | [Link] | ||
| CoST | 2022 | Multiple PLs | BLEU, CodeBLEU | 132,046 pairs | [Link] | ||
| XLCoST | 2022 | Multiple PLs | CodeBLEU, BLEU, MRR | 1,002,296 pairs | [Link] | ||
| Nova | 2023 | Binary | BLEU, Exact Match, Instruction LCS | 60,600 pairs | [Link] | ||
| SUT | 2023 | Multiple PLs | Syntax Unit Test Accuracy, Syntax Element Score | 60k parallel, 200k mono | [Link] | ||
| xCodeEval | 2023 | Multiple PLs | Pass@K | 25M examples | [Link] | ||
| CodeTransOcean | 2023 | Multiple PLs | BLEU, CodeBLEU, Exact String Match | 45 languages | [Link] | ||
| G-TransEval | 2023 | Multiple PLs | BLEU, CodeBLEU, Computational Accuracy | 400 pairs | [Link] | ||
| AVATAR | 2023 | Java, Python | BLEU, Syntax Match, CodeBLEU, Execution Accuracy | 62,520 pairs | [Link] | ||
| AVATAR-TC | 2024 | Java, Python | BLEU, CodeBLEU, Compilation Accuracy, Functional Equivalence | 57,368 pairs | [Link] | ||
| RustRepoTrans | 2024 | C, Java, Python → Rust | Pass@k | 375 tasks | [Link] | ||
| Code Reasoning | CRUXEval | 2024 | Python | Pass@k | 800 | [Link] | |
| REval | 2025 | Python | Accuracy, Incremental Consistency Score | 3,152 | [Link] | ||
| DyCodeEval | 2025 | Python | Pass@K, DivPass@K | 591 | [Link] |
Software Testing
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link |
|---|---|---|---|---|---|
| Test Generation | Evosuite SF110 | 2011 | Line coverage, branch coverage, and test correctness | [Paper] | [Link] |
| Defects4J | 2014 | The number of test case is executable, CodeBLEU, line coverage, branch coverage, and the number of detected bugs | [Paper] | [Link] | |
| DynaMOSA | 2018 | Line coverage, branch coverage, number of detected bugs | [Paper] | [Link] | |
| BugsInPy | 2020 | Line coverage, branch coverage, and number of detected bugs | [Paper] | [Link] | |
| HumanEval | 2021 | Mutation score, Pass@K, the number of killed mutants, line coverage, and branch coverage | [Paper] | [Link] | |
| MBPP | 2021 | Pass@K | [Paper] | [Link] | |
| APPS | 2021 | Pass@K | [Paper] | [Link] | |
| CodeContests | 2022 | Pass@K | [Paper] | [Link] | |
| HumanEval-X | 2023 | Pass@K | [Paper] | [Link] | |
| CoderUJB | 2024 | Syntax correctness rate, compile passing rate, line coverage | [Paper] | [Link] | |
| SWT-Bench | 2024 | Success rate and change coverage | [Paper] | [Link] | |
| TestBench | 2024 | Syntax/compilation/execution correctness rate, coverage/defect detection rate | [Paper] | [Link] | |
| TestEval | 2025 | overall/line/branch/path coverage | [Paper] | [Link] | |
| ProjectTest | 2025 | Compilation/correctness/coverage rate | [Paper] | [Link] | |
| Assertion Generation | ATLAS | 2020 | Exact match, edit distance, and longest common subsequence | [Paper] | [Link] |
| GUI Test | Themis | 2021 | The number of detected bugs, and activity coverage | [Paper] | [Link] |
| QTypist | 2021 | Passing rate, coverage metrics, activity number, and page number | [Paper] | [Link] | |
| Testing Automation | LAVA-M | 2016 | Coverage, Unique bug | [Paper] | [Link] |
| Unibench | 2021 | Quality of bugs, stability of finding bugs, speed of finding bugs, and overhead | [Paper] | [Link] | |
| FuzzBench | 2021 | Coverage, Unique bug | [Paper] | [Link] | |
| FuzzGPT | 2024 | Code coverage, API coverage, number of unique crashes | [Paper] | [Link] | |
| Testing Prediction | IDoFT | 2019 | Precision, Recall, F1-Score | [Paper] | [Link] |
| FlakeFlagger | 2021 | Precision, Recall, F1-Score | [Paper] | [Link] | |
| Testing Repair | TARBENCH | 2025 | CodeBLEU, BLEU, exact match, repair accuracy | [Paper] | [Link] |
| Syn-Bench | 2025 | Syntactic/semantic correctness, code coverage | [Paper] | [Link] |
AIOps
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link |
|---|---|---|---|---|---|
| Log Statement Generation | LANCE | 2022 | Correct prediction ratio | [Paper] | [Link] |
| LogBench | 2024 | Accuracy, Precision, Recall | [Paper] | [Link] | |
| SCLoger | 2024 | Accuracy, Precision, Recall, F1, BLEU, and ROUGE | [Paper] | [Link] | |
| AL-Bench | 2025 | Position Accuracy, Level Accuracy, Average Level Distance, Message Accuracy, Dynamic Expression Accuracy, Static Text Similarity | [Paper] | [Link] | |
| Log Parsing | Loghub | 2023 | Accuracy | [Paper] | [Link] |
| Loghub-2.0 | 2024 | Accuracy, F1-score | [Paper] | [Link] |
Maintenance
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link |
|---|---|---|---|---|---|
| Code Review | CodeReview | 2022 | Exact Match | [Paper] | [Link] |
| CodeReviewer | 2022 | Exact Match and BLEU | [Paper] | [Link] | |
| AUGER | 2023 | ROUGE, Perfect Prediction Rate | [Paper] | [Link] | |
| Review-Explaining | 2023 | Explanation type correctness,the semantic meaning correctness | [Paper] | [Link] | |
| Code-Review-Assist | 2023 | Precision, Recall, and F1 score | [Paper] | [Link] | |
| CodeReview-New | 2024 | Exact Match Trim, Exact Match, BLEU | [Paper] | [Link] | |
| ManualReviewComment | 2025 | Precision, Recall, F1 | [Paper] | [Link] | |
| Clone Detection | BigCloneBench | 2014 | Precision, Recall, F1 | [Paper] | [Link] |
| POJ-104 | 2016 | Precision, Recall, MAP | [Paper] | [Link] | |
| Company-C/C++ | 2023 | MRR, Precision, Recall | [Paper] | [Link] | |
| GPTCloneBench | 2023 | Precision, Recall | [Paper] | [Link] | |
| Curated CodeNet | 2023 | Precision, Recall | [Paper] | [Link] | |
| Refactoring | JavaRef | 2023 | Accuracy, Exact Match, Edit Distance, Character Error Rate | [Paper] | [Link] |
Quality Management
| Task | Benchmarks | Year | Evaluation Metrics | Paper | Link |
|---|---|---|---|---|---|
| Defect Prediction | Bugs.jar | 2018 | Precision, Recall, F1, Accuracy, MCC | [Paper] | [Link] |
| Bears | 2019 | Precision, Recall, F1, Accuracy, MCC | [Paper] | [Link] | |
| Zeng et al. | 2021 | Accuracy, Recall, False Discovery Rate, AUC-ROC, AUC-PR | [Paper] | [Link] | |
| Review-Explaining | 2023 | Explanation type correctness,the semantic meaning correctness | [Paper] | [Link] | |
| JIT-defects4j | 2022 | F1-score, AUC, Recall@20 Effort, Effort@20 Recall, P𝑜𝑝𝑡 , Top-N Accuracy | [Paper] | [Link] | |
| Opu et al. | 2025 | Precision, Recall, F1, Accuracy, MCC | [Paper] | [Link] | |
| Bug Localization | Ye et al. | 2014 | Accuracy, MRR, MAP | [Paper] | [Link] |
| Defects4J | 2014 | ACC@K, FPR, Top@N | [Paper] | [Link] | |
| Bench4BL | 2018 | MRR, MAP, HIT@K | [Paper] | [Link] | |
| Devign | 2019 | Top@N | [Paper] | [Link] | |
| BugsInPy | 2020 | ACC@K,Top@N | [Paper] | [Link] | |
| Zhu et al. | 2021 | Accuracy | [Paper] | [Link] | |
| CodeReviewer | 2022 | Accuracy | [Paper] | [Link] | |
| Ciborowska et al. | 2022 | Precision@K, Recall@K, F1-score@K, MRR, MAP | [Paper] | [Link] | |
| Ma et al. | 2023 | MAP, MRR, Top@N | [Paper] | [Link] | |
| RTLLM | 2024 | Hit Rate, pass@k | [Paper] | [Link] | |
| BeetleBox | 2024 | Accuracy, MRR, MAP | [Paper] | [Link] | |
| SWE-Bench | 2024 | Accuracy, MRR, MAP, TopN, Precision | [Paper] | [Link] | |
| Chandramohan et al. | 2024 | Accuracy, MRR, MAP | [Paper] | [Link] | |
| Stracquadanio et al. | 2024 | Top-1 bug coverage | [Paper] | [Link] | |
| Manke et al. | 2024 | TP, FP | [Paper] | [Link] | |
| D58 | 2024 | Recall, MRR, CandiAvg | [Paper] | [Link] | |
| Saha et al. | 2024 | MRR, MAP, HIT@K | [Paper] | [Link] | |
| Widyasari et al. | 2024 | Top-K | [Paper] | [Link] | |
| LINUXFLBENCH | 2025 | Recall@k, MRR | [Paper] | [Link] | |
| ACPR | 2025 | Accuracy | [Paper] | [Link] | |
| Repair | Defects4J | 2014 | # fixed bugs | [Paper] | [Link] |
| QuixBugs | 2017 | # fixed bugs | [Paper] | [Link] | |
| LMDefects | 2023 | # fixed bugs | [Paper] | [Link] | |
| InferredBugs | 2023 | Ratio of fixed bugs | [Paper] | [Link] | |
| ARHE | 2023 | Accuracy | [Paper] | [Link] | |
| Leetcode-debug | 2023 | Acceptance rate | [Paper] | [Link] | |
| API-Misuse-Repair | 2017 | Eaxct Match, BLEU, CodeBLEU | [Paper] | [Link] | |
| DebugBench | 2024 | Pass Rate | [Paper] | [Link] | |
| SWE-Bench | 2024 | Resolution rate | [Paper] | [Link] | |
| SWE-bench Multimodal | 2024 | Resolution rate | [Paper] | [Link] | |
| SWE-Lancer | 2025 | Resolution rate | [Paper] | [Link] | |
| Multi-SWE-bench | 2025 | Resolution rate | [Paper] | [Link] | |
| Vulnerability Detection | Choi et al. | 2017 | Accuracy, F1, AUC | [Paper] | [Link] |
| Lin et al. | 2017 | Top-k Recall | [Paper] | [Link] | |
| DGBBench | 2017 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| Juliet | 2018 | Precision, Recall, MCC | [Paper] | [Link] | |
| VulDeePecker | 2018 | FN, FP, TN, TP, Precision, Recall, F1, AUC, MCC | [Paper] | [Link] | |
| Draper | 2018 | FN, FP, TN, TP, Precision, Recall, F1, AUC, MCC | [Paper] | [Link] | |
| Devign | 2019 | Accuracy, Precision, Recall, F1, FPR, AUC, Precision@K,MCC | [Paper] | [Link] | |
| Ponta et al. | 2019 | AUC, F1 | [Paper] | [Link] | |
| BigVul | 2020 | Accuracy, Precision, Recall, F1, FPR, AUC, Precision@K, MCC | [Paper] | [Link] | |
| ReVeal | 2020 | Accuracy, Precision, Recall, F1, FPR, AUC, Precision@K | [Paper] | [Link] | |
| SmartBugs | 2020 | Precision, Recall, F1, Top-N Accuracy, MAR, MFR | [Paper] | [Link] | |
| Great | 2020 | Precision, Recall, Accuracy | [Paper] | [Link] | |
| Magma | 2020 | ROC-AUC | [Paper] | [Link] | |
| SolidiFI | 2020 | FN, FP | [Paper] | [Link] | |
| SySeVR | 2021 | FPR, FNR, Precision, Recall, F1 | [Paper] | [Link] | |
| D2A | 2021 | Precision, Recall, MCC | [Paper] | [Link] | |
| PatchDB | 2021 | Precision, Recall, F1 | [Paper] | [Link] | |
| CVEFixes | 2021 | Accuracy, Precision, Recall, F1, FPR | [Paper] | [Link] | |
| CrossVul | 2021 | Accuracy, Precision, Recall, F1, FPR | [Paper] | [Link] | |
| VCmatch | 2022 | AUC, F1 | [Paper] | [Link] | |
| VUDENC | 2022 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| SARD | 2023 | Accuracy, Precision,Recall, F1 | [Paper] | [Link] | |
| DiverseVul | 2023 | Accuracy, Precision, Recall, F1, FPR | [Paper] | [Link] | |
| Web3Bugs | 2023 | TP,TN, FP, FN | [Paper] | [Link] | |
| DeFi Hacks | 2023 | TP,TN, FP, FN | [Paper] | [Link] | |
| VulBench | 2023 | Precision, Recall, F1 | [Paper] | [Link] | |
| OWASP | 2023 | Accuracy | [Paper] | [Link] | |
| TreeVul | 2023 | F1, Macro-F1, MCC | [Paper] | [Link] | |
| FormAI | 2023 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| Hu et al. | 2023 | Hit # | [Paper] | [Link] | |
| FalconVulnDB | 2024 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| FormAI-v2 | 2024 | Average Property Violations Per File/Line | [Paper] | [Link] | |
| MoreFixes | 2024 | Accuracy, Precision, Recall, F1 | [Paper] | [Link] | |
| VulEval | 2024 | Precision, Recall, F1, MCC, Precision@k, Recall@k | [Paper] | [Link] | |
| InterPVD | 2024 | FPR, FNR, Accuracy, Precision, F1 | [Paper] | [Link] | |
| ReposVul | 2024 | Accuracy | [Paper] | [Link] | |
| MegaVul | 2024 | Accuracy, Precision, Recall, F1 | [Paper] | [Link] | |
| SecLLMHolmes | 2024 | Response Rate, Accuracy, Correct Reasoning Rate | [Paper] | [Link] | |
| VulDetectBench | 2024 | F1, Accuracy | [Paper] | [Link] | |
| SC-LOC | 2024 | Precision, Recall, Accuracy, F1-Score | [Paper] | [Link] | |
| Ma et al. | 2024 | Precision, Recall, Accuracy, F1-Score | [Paper] | [Link] | |
| FELLMVP | 2024 | Precision, Recall, Accuracy, F1-Score | [Paper] | [Link] | |
| Yıldırım et al. | 2024 | Accuracy | [Paper] | [Link] | |
| Vulcorpus | 2024 | Accuracy, Improvement Suggestion | [Paper] | [Link] | |
| Fang et al. | 2024 | Not vulnerability detection | [Paper] | [Link] | |
| SLFHunter | 2024 | TP,TN, FP, FN, F1-score | [Paper] | [Link] | |
| Guo et al. | 2024 | Precision, Recall, F1-Score | [Paper] | [Link] | |
| VulnPatchPairs | 2024 | Precision, Recall, F1, Accuracy, FPR, FNR | [Paper] | [Link] | |
| Real-Vul | 2024 | Precision, Recall, F1, Accuracy, AUC | [Paper] | [Link] | |
| PairVul | 2024 | Accuracy, Pairwise Accuracy, F1-score, MCC | [Paper] | [Link] | |
| VulSmart | 2024 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| KernJC | 2024 | TP, TN, FP, FN, Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| LLM4Vuln | 2025 | TP,TN, FP, FN, F1-score | [Paper] | [Link] | |
| VULZOO | 2025 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| CWE-Bench-Java | 2025 | #Detected, Avg. False Discovery Rate, Avg. F1, Precision, Recall | [Paper] | [Link] | |
| CASTLE | 2025 | CASTLE Score, Combination Score, Precision, Recall, Accuracy | [Paper] | [Link] | |
| SecVulEval | 2025 | human-evaluated scoring rubric | [Paper] | [Link] | |
| JITVUL | 2025 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| Li et al. | 2025 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| BinPool | 2025 | Precision, Recall, F1, Accuracy | [Paper] | [Link] | |
| ICVul | 2025 | Precision, Recall, F1, Accuracy | [Paper] | [Link] |