We benchmarked top SAST products, and this is what we learned

When we started to build Bearer, we wanted to understand how to validate the quality of our findings and be able to benchmark it. Code security scanning solutions are notorious for reporting a lot of false positives and other deficiencies, and even though we believed we could do much better, we needed a way to prove it.

In Java, there is an OWASP project, BenchmarkJava, which makes it easy to compare the output of two software security solutions. Unfortunately, there is no similar benchmark for other coding languages.

We’ve shared previously how we are building and improving our own solution using Open Source projects. During our conversations with enterprise customers and users, we understood that they deal with a similar challenge of comparing a solution with another, resulting in an incomplete decisions and a lot of frustration later on. With organisations consistently dealing with tight product release timelines in a competitive market, our hope is to help security meet the developers where they are, and for them to use this benchmark as a decision enabler.

As we like to build in public as much as possible, keeping in theme with our Open Source engine, we thought it was time to look into how Bearer CLI compares today with other “free” and available SAST solutions on the market.

For this benchmark, we are focusing on a few key features, such as language support, quality of the findings, speed of the scanner, extensibility and relevance of the ruleset, and User Experience (UX) for both security and developers. We will go through them all in detail below, giving you a blueprint and data points for how to compare two or more SAST solutions.

Note: Like any benchmark, ours has some inherent biases. We have documented our decisions as thoroughly as possible so that you have all the context you need to understand them. More importantly, we have included the dataset used to generate the numbers at the end of the article, so that you can explore it yourself and come to your own conclusions.

Landscape

We decided to benchmark Bearer against solutions that are often mentioned as being the ‘modern’ ones, which is a mix of commercial solutions with Free and/or Open Source offering.

We have selected SemgrepSnyk CodeSonarBrakeman (as part of Synopsys) and of course, Bearer CLI.

Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
Free offering limitation N/A N/A 100 scan per month N/A
Open Source license Elastic LGPL 2.1 N/A Unclear
Built in Go OCAML + Python Unknown Ruby
Launched in 2023 2017 (originates from Facebook Pfff OSS project) 2020 (originates from Deepcode acquisition) 2011

We will review these solutions under 5 sections in this benchmark:

  • Language support
  • Quality of findings
  • Speed
  • User Experience
  • New risks coverage

If you want the see all the datapoints we collected, head directly to the ‘The Complete SAST Benchmark‘ at the end of this post.

As you go through this benchmark, you are welcome to find more about Bearer CLI, or try it directly via Github to compare it with your SAST scanner and create your own benchmark. If you are interested in learning how you can manage application security at scale, supercharged with sensitive data context, you can request a demo for Bearer Cloud. 

Language support

Language support is probably the #1 factor when deciding to use a SAST product. In today’s world, teams tend to use multiple languages and stacks to build their product, resulting in the need for a solution with broad language support or the usage of multiple solutions.

Unfortunately, language support is not as simple as “does it support X?” The level of granularity required to offer good support makes it difficult to do it well across many languages. Behind the language support there are factors such as framework support, relevance of the ruleset, quality of the rules themselves and their maintenance.

Solutions that support many languages tend to offer a very disparate quality of support, leading customers to either combine multiple options or accept low quality on some of their stacks.

We collected data on the three categories below for our benchmark:

  • Language coverage: Which language does the solution cover?
  • Ruleset: How many rules are available for each language supported? Even if it’s not necessarily a good indication of the depth of the support (details follow), it’s a good point of comparison of language coverage among the same solution.
  • % of rules that triggered in the benchmark: Providing a lot of rules is great, but do they actually matter? That’s a very different question. One of the ways to assess the quality of the rules set, is to assess their probability to trigger.
Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
Language coverage JS/TS, Ruby, Java JS/TS, Ruby, Java, Go, C#, Kotlin, PHP, Python, Scala JS/TS, Ruby, Java, Go, PHP, Python, C# Ruby
JS/TS
Ruleset 65 208 51 N/A
% of rules that triggered in the benchmark 54% 18% 100% N/A
Ruby
Ruleset 59 43 26 84
% of rules that triggered in the benchmark 58% 65% 100% 65%

TL;DR We can clearly see that the size of the ruleset itself is not that important. A large count that only triggers a few rules may indicate a lack of relevance and maintenance, while on the other hand fewer rules may be because the way they are built is more in ‘catch all’ than ‘surgical’.

Quality of findings

Beyond the language support, what matters most is the quality of the findings, which is the most difficult part to assess. To run this benchmark, we’ve used 15 top Open Source projects for each language (Ruby and JS/TS) and manually reviewed and classified every finding. Traditional SAST tools are notorious for a high %age of false positives, so this will be our focus here.

Ultimately, we collected data on the three categories below for our benchmark:

  • Total # of findings: How many findings did the solution surface?
  • Total # of false positives: How many of those findings are actually not relevant?
  • % Precision: What is the overall precision?
Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
JS/TS
Total # of findings 508 216 600 N/A
Total # of false positives 50 27 234 N/A
% Precision 90% 87% 61% N/A
Ruby
Total # of findings 333 1345 327 471
Total # of false positives 39 765 221 310
% Precision 88% 43% 32% 34%

TL;DR The level of precision is the most important data point we all have in mind, though it’s important to contextualise it with the number of findings. It is possible to achieve 100% precision with only 10 findings, which highlights the need for comprehensive evaluation.

Furthermore, what is overlooked in this analysis is the number of false negatives (often known as ‘recall’), which represents the instances of missed findings or false negatives.

Speed

Speed is a single data point, but quite important so it required its own category. As reported in the Github DevEx Survey 2023, 25% of the developers’ time is being spent on waiting for code reviews, so it is an important consideration for DevSecOps programs, mainly for two reasons:

  1. How long does the team need to wait to review findings?
  2. CI/CD runtime costs money. Every second a scan hangs is money lost.
Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
JS/TS
AVG execution time 82 seconds 33 seconds 270 seconds N/A
Ruby
AVG execution time 131 seconds 72 seconds 418 seconds 79 seconds

TL;DR In general, solutions that operate “locally” tend to be quite fast, enabling seamless integration into a CI/CD workflow. However, it is noteworthy that Snyk, which runs in the cloud, is significantly slower compared to others.

It is crucial to consider both speed and precision together, as speed alone holds no value. Therefore, we strongly recommend keeping these two data points in mind simultaneously.

User Experience

When providing a solution for security engineers and developers, UX is key and has some specific language associated with it. Ultimately, we are talking about a tool for engineers.

These are important questions we all need to consider – how difficult is it to set it up? how do you run it? does it require sending source code to a cloud? do you have access to a vast number of output formats? what are the arguments required to control the tool precisely? how well does it integrate into your workflow?

We gathered data on the following categories below for our benchmark:

  • Setup type & avg time: Engineers have very little time and therefore patience. The speed and easiness of the setup is usually a good indicator of a good developer tool.
  • Execution type: Is the scanning done offline (on your machine or infra) or on a distant Cloud? Essentially, do you need to trust your SAST providers with your code?
  • CLI options: Is the solution fully controllable from the CLI? From choosing which rules to execute and limiting certain severity levels, up to ignoring findings.
  • Output format: The more formats it supports, the better the tool will integrate in your workflow. It’s especially true with SAST that are run both manually as well as automatically and need to integrate with other tools easily.
  • Open rules: Are the rules underlying code/pattern provided? It’s an important topic if you want to understand why a finding was triggered or not, and ultimately to provide confidence in the rules themselves.
  • Custom rules support: Providing an excellent bundle of rules as part of language coverage support is key, but there are always custom use-cases where specific might be required. Being able to build your own rules using a solution is the best way to make sure all your use-case will be covered in the long-term, and help reduce vendor lock-in.
  • Source Code Management (SCM) integration level: Modern security products need the best possible CI/CD integration. Does it integrate out of the box with GitHub and GitLab? In your CI, CD? Can it annotate PR?
Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
Setup type and avg time CLI install (< 1 min.) CLI install (< 1 min.) Require online signup + CLI install Ruby package install (< 1 min.)
Execution type Locally Locally On the cloud Locally
CLI options Complete Complete Partial (Missing filtering per rule and ability to ignore findings) Complete
Output format JSON, SARIF, HTML JSON, SARIF, XML JSON, SARIF, HTML JSON, SARIF, CSV, HTML
Open rules Yes Yes No Yes
Custom rule support Yes Yes Beta Yes
SCM integration level GitHub and GitLab Security integration, pull request annotation GitHub and GitLab Security integration GitHub and GitLab Security integration, pull request annotation Unclear

TL;DR Since we’ve benchmarked ‘modern’ solutions, we can clearly see that they mostly all live up to this expectation when it comes to User Experience. Though, it’s important to mention that the best experience comes from those that are open and free in comparison with the fully closed source products.

New risks coverage 

SAST is evolving, as risks and security team roles are. We believe that sensitive data exfiltration risks combined with third-party services risks should by default be part of any SAST product.

By adding a sensitive date context layer to our SAST solution, Bearer CLI is able to detect risks such as “leakage of PHI to a logger”, or “leakage of PII to OpenAI”, as well as provide an automated privacy report to allow security and privacy engineering team to kick-start their compliance journey.

Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
Third-party component detection Yes No No No
Data exfiltration rules Yes No No No
Threat modeling Yes No No No
Privacy report Yes No No No

TL;DR The increasing importance of safeguarding sensitive data and preserving privacy has made them among the most influential emerging risks for your organization. As the nature of risks continues to evolve, it is crucial for the value offered by your SAST solution to align with these changes.

The Complete Benchmark

Here is the datapoint summary of the benchmark:

Bearer CLI Semgrep Snyk Code Brakeman (Synopsys)
Free offering limitation N/A N/A 100 scan per month N/A
Open Source license Elastic LGPL 2.1 N/A Unclear
Built in Go OCAML + Python Unknown Ruby
Launched in 2023 2017 (originates from Facebook Pfff OSS project) 2020 (originates from Deepcode acquisition) 2011
Language support
Language support JS/TS, Ruby, Java JS/TS, Ruby, Java, Go, C#, Kotlin, PHP, Python, Scala JS/TS, Ruby, Java, Go, PHP, Python, C# Ruby
JS/TS
Ruleset 65 208 51 N/A
% of rules that triggered in the benchmark 54% 18% 100% N/A
Ruby
Ruleset 59 43 26 84
% of rules that triggered in the benchmark 58% 65% 100% 65%
Quality of findings
JS/TS
Total # of findings 508 216 600 N/A
Total # of false positives 50 27 234 N/A
% Precision 90% 87% 61% N/A
Ruby
Total # of findings 333 1345 327 471
Total # of false positives 39 765 221 310
% Precision 88% 43% 32% 34%
Speed
JS/TS
AVG execution time 82 seconds 33 seconds 270 seconds N/A
Ruby
AVG execution time 131 seconds 72 seconds 418 seconds 79 seconds
UX
Setup type and avg time CLI install (< 1 min.) CLI install (< 1 min.) Require online signup + CLI install Ruby package install (< 1 min.)
Execution type Locally Locally On the cloud Locally
CLI options Complete Complete Partial (Missing filtering per rule and ability to ignore findings) Complete
Output format JSON, SARIF, HTML JSON, SARIF, XML JSON, SARIF, HTML JSON, SARIF, CSV, HTML
Open rules Yes Yes No Yes
Custom rule support Yes Yes Beta Yes
SCM integration level GitHub and GitLab Security integration, pull request annotation GitHub and GitLab Security integration GitHub and GitLab Security integration, pull request annotation Unclear
New risks coverage
Third-party component detection Yes No No No
Data exfiltration rules Yes No No No
Threat modeling Yes No No No
Privacy report Yes No No No

Our goal is to try to update this benchmark every once in a while and ideally expand it to other solutions and languages. We are excited to hear your feedback and comments on this, so please don’t hesitate to reach out to us on Twitter @tryBearer or join us at Discord!

Please find here the entire data set used to create this benchmark.