Skip to content
View CASTLE-Benchmark's full-sized avatar

Block or report CASTLE-Benchmark

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
CASTLE-Benchmark/README.md

CASTLE Benchmark

Logo of the CASTLE Benchmark

The CASTLE Benchmark is a comprehensive dataset and a scoring method for evaluating single or combinations of static analyzers with a focus on security. It consists of a hand-crafted dataset of 250 micro-benchmark programs (almost 11,000 lines of C code), covering 25 common CWEs. We also introduce the novel CASTLE Score metric to enable fair and reliable comparisons, considering factors such as true positive and false positive rates, as well as the tools' ability to find more common issues. This dataset enables a comparison of single tools, as well as the effectiveness of tool combinations.

This dataset was created by Richard A. Dubniczky, Krisztofer Zoltan Horvát, Tamás Bisztray, Mohamed Amine Ferrag, Lucas C. Cordeiro, and Norbert Tihanyi as a joint research project and it is currently under peer-review.

Paper preprint is available at: arxiv.org

Citing the Paper

@misc{dubniczky2025castle,
    title={CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection}, 
    author={Richard A. Dubniczky and Krisztofer Zoltán Horvát and Tamás Bisztray and Mohamed Amine Ferrag and Lucas C. Cordeiro and Norbert Tihanyi},
    year={2025},
    eprint={2503.09433},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url={https://arxiv.org/abs/2503.09433}, 
}

The Complete Dataset

  • CASTLE-C250.json - The parsed and labeled dataset with 250 tests in C language.
  • CASTLE-C250.min.json - The minified version parsed and labeled dataset with 250 tests in C language. It contains everything from the non-minimized version but it's less readable: recommended for automated use.
  • CASTLE-C250 - All 250 tests in C language in individual C files with an accompanied Makefile.
  • CASTLE-Source - The source code repository for the CASTLE Dataset, Tests, Wrappers, Evaluators, Diagrams and more ...

Architecture

CASTLE Architecture Framework Diagram

The CASTLE Architecture consists of 4 main stages:

  1. We selected the tested CWEs, created the dataset of tests and labelings and validated the correctness of the C code over many iterations using expert review, static analyzers, formal verification methods, and LLMs. The final result from this step is the dataset JSON file.
  2. In the second step we created wrappers for all tools to automate the evaluation as much as possible. For some open-source tools this means running the tests in a container in sequence, while for others we had to access APIs or manually download the results and parse them afterwards. The output of this step is an report JSON file in a custom common format.
  3. In the manual review phase we looked at all the ~7,500 findings and validated our TP and FP classifications. Some tools marked a different line or CWE than our tests indicate, and in those cases we set our classification accordingly and updated our tests. We ran these tests on the tools at least 3 times between Nov 2024 and Feb 2025 with the updated dataset.
  4. We evaluated the findings and calculated the CASTLE Score for single tools and tool combinations, as well as generated our final toplists and charts.

Results

We tested a total of 25 tools (13 static analyzers, 2 formal verification tools, 10 LLMs) on the CASTLE C250 Dataset. The results from the top 250 tests and their CASTLE Scores ordered by their CASTLE Score:

The results of the CASTLE Benchmark

Barchart comparing the CASTLE Scores of the individual tools and the best 5 tool combinations. The theiretical maximum score is 1250, while the negative is unlimited, as all false positives subtract points.

The CASTLE Score barcharts

True vs False Positive rates of the tools:

True and False positive rate of the tools

Additional Charts

List of CWEs in the dataset

Strengths of LLMs vs Static Analyzers based on the standard metrics.

Comparing the CASTLE Score of tool combinations vs the better individual tool scores

True positive count for each tool per CWE

Venn diagram of specific vulnerabilitites detected by the best three-way CASTLE score. The smaller the intersections, the higher the improvement in CASTLE scores given a limited amount of false positives.

Popular repositories Loading

  1. CASTLE-Benchmark CASTLE-Benchmark Public

    The CASTLE Benchmark is a modern micro-benchmarking solution to test Static Analyzers and LLMs in vulnerability detection

    C 20 4

  2. CASTLE-Source CASTLE-Source Public

    The source code for the CASTLE Benchmark Tests, Wrappers, Evaluator, Diagrams and more

    C 5 1