Different levels of true positives and precision in vulnerability scanners
Vulnerability scanners and security tools often create a lot of noise. In this article, we try to clarify what causes the noise in software composition analysis tools and what Debricked is doing to be better at it!
What is precision and why do we care about it?
By precision, we essentially mean the number of true positive results divided by the number of false positives. So, if an SCA tool generates 100 security alerts but only 90 of them are true positives it means that we have a precision of 90%.
The reason why we want to have as high precision as possible is quite obvious. You don’t want to spend unnecessary resources on analysing and solving problems that do not exist. False positives spark irritation, stealing time and focus, and frankly give you more problems rather than solving them.
The tradeoff between precision and recall
Oftentimes we might consider these two metrics – precision and recall – to both refer to the accuracy of the given model. However, despite it to a certain extent being true, there is a much more specific meaning for both these measures.
Precision is the proportion of predicted positives which are indeed positive, how many results are relevant?
Precision = true positives / (true positives + false positives)
Recall is the proportion of all identified positives (total relevant results) that were predicted (classified by the model) correctly
Recall = true positives / (true positives + false negatives)
Both measures should be considered when evaluating a tool that relies heavily on data. This is where a trade-off comes into play, since you may increase one of the given statistical means at the cost of another.
This is quite intuitive – if you recall everything, non-accurate results would have to keep being generated, thus decreasing your precision rate.
Identifying the correct balance between these two depends on careful consideration of the problem at hand that is being solved; in this case – finding vulnerabilities in the open source software that you are using.
“Precision is only the rate of true positives in your results, while recall corresponds to the number of true positives that you find. It is almost impossible to truly know the recall, as we can’t know all the vulnerabilities, despite us always trying to do our best, which poses a dilemma to keep in mind”
Emil Wåreus, Head of Data Science at Debricked.
How do we choose which trade-off we go for? Such a balancing decision directly depends on the purpose of the model and data. In the case of, for instance, medical test screening, the recall would probably be desired to be close to 1.0, in order to find all patients in need of treatments.
If the costs of the further medical examination are not high, precision could be sacrificed for higher recall. Understanding both these statistical constructs is crucial and both should be carefully examined in the context of vulnerabilities.
What are the 4 results of Binary Classification?
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
This poses a challenging optimization problem. How can we choose between precision and recall and what is Debricked’s take on it? Throughout this article we ask Emil Wåreus, co-founder and Head of Data Science, to elaborate on these challenges.
Emil has amongst other things contributed insights to the Linux Foundations Guide on SCA Tools comparison as described in our previous article on how to choose an SCA tool.
Emil, what is Debricked’s take on the optimization problem – the dilemma between recall and precision?
“The optimization problem has a lot to do with the usability of Software Composition Analysis-tools and vulnerability management. At Debricked we have noticed that having a lot of false positives causes the usage of our tool to decrease significantly. That in turn implies that the value that the tool delivers is reduced”.
The precision levels for vulnerabilities in open source
Now, in order to grasp the dynamics and variety of open source vulnerability issues, let’s dive into the levels of precision.
Layer 0 – Errors in the underlying data sources
This is a different, equally important, topic to discuss. However, in this article we will assume that the input data from vulnerability sources is correct.
Layer 1 – Correctly matching which Open Source is used
The first layer is to correctly understand which open source dependencies are actually being used in the software that you want to scan. There are many concepts to tackle, for example, “asset inventory”, “Software Bill of Materials (SBOM)”, etc. But the underlying challenge remains the same – how do you know, with high precision, that you have identified the correct open source?
To illustrate, a regular mismatch is when two dependencies are named the same, written in the same language, but available from different package managers.
These types of problems are very common. For instance, the “Click” dependency, which is a CLI tool that is popular on Github, is written in Python, and yet there is another dependency with the same name, also written in Python, but used for the Linux stack. This may cause a lot of confusion.
In the industry of Software Composition Analysis there are two main techniques of understanding what open source is used:
Snippet Level analysis and matching
The code to be analysed is hashed and compared to a database of so called snippets from open source projects. This gives the possibility to find and match with open source that is not “officially” declared in any dependency files or similar.
This is particularly useful when doing so-called partial matches where a developer has copied a part of an open source project into the software. However, this behavior is more and more uncommon due to the fact that most developers today use package managers.
Component level analysis derived from dependency files
Just as it sounds, you find all dependency files such as package.lock and composer.lock and extract dependencies from them. In some cases the whole dependency tree needs to be resolved as it is not stated in the files themselves.
What approach does Debricked use?
We primarily use component level analysis as it is enough for 99% of all customers. However, we have snippet level capabilities in the cases where you must be certain that any developer has not accidentally copied and pasted open source code into your software.
Regarding our component level analysis, we have a machine learning model in place with the goal of estimating the confidence of a match between a vulnerability and the software package. We enrich the data we have on vulnerabilities and software packages as much as possible to have more data to compare.
This includes looking at release dates of software, publishing dates of vulnerabilities, references and whether they are referring to similar communities, as well as performing text analysis with natural language processing to identify similarities between different parts of data.
To summarize – it can be concluded that this is not a simple task! Still, it is Often overlooked by customers when evaluating different tools.
What is the usual precision rate among commercially available tools and free tools?
Precision on the vulnerability <-> dependency match (Level 1) from a really good tool is generally assumed to be above 90%. In some languages, you can even get as high as 95%, yet as that is becoming more challenging you will need to have higher trade-offs with recall.
Free tools, such as Githubs Dependabot, depending on the languages, can have quite a lot lower precision rates. We have even seen performance going down towards 60-70% for this and other free tools. In comparison, Debricked scores about 90% level 1 precision in all supported languages.
Layer 2 – Is the vulnerable part (function) called in the critical component?
The second level investigates whether you are using the vulnerable part and functionality of the specific open source component. When you have a vulnerability in an open source dependency, it usually doesn’t mean that the whole project is affected by that vulnerability.
Some vulnerabilities are more compartmentalised (meaning that maybe a single class or method is vulnerable), such as injection type vulnerabilities or buffer overflows, while other more logical vulnerabilities may not be compartmentalised to one single class or function.
You may be using a vulnerable dependency – but are you using the vulnerable functionality in that dependency? If you can find that method or class, you can then scan your proprietary code and check whether you’re using the vulnerable functionality.
State of the industry
Analysing this is a quite difficult challenge. . There is no vendor that has truly solved this issue for a multitude of languages. Not only do you need to find and isolate the vulnerable functions in millions of OSS projects, which in itself is already challenging.
But, you also have to find the vulnerable parts called by customer’s proprietary software which can be very challenging depending on the languages used. Using static analysis in scripted languages, such as Javascript and python, is close to impossible since they are simply not built for that. Therefore, dynamic analysis is usually the better choice when it comes to these types of languages. .
Debricked’s approach
We have the functionality built for Java. You can do static analysis on it to some extent. We have initial decent performance in Python and customers will be able to enjoy it before summer 2021. Next, we are looking at Javascript and C#.
What can this mean in practice?
The process of checking if you are calling the vulnerable functions is often done manually today. According to our research, level 2 precision makes it possible to eliminate tens of percent (0-50%) of total vulnerabilities needed to be analysed, which in turn will save a lot of time.
Layer 3 – Is this code ever run in runtime? And how often?
You should be aware that you need to prioritise vulnerabilities that are actually used in runtime and production environments. You want to separate dependencies depending on how they are used in your software.
For instance, some OSS frameworks and dependencies are only used in tests (e.g. Pytest) – something that is not nearly as critical to fix as a dependency that is in production.
Another example is that you have statically imported a vulnerable dependency and are calling the vulnerable function, but it’s never actually used during runtime. Hence, that dependency is never installed in runtime! Therefore it is not as important to prioritise.
A similar problem is development tools, such as development server environments. Those can be vulnerable too, but can be completely irrelevant for the security of your application in production.
Debricked’s Approach
Unfortunately, this is not something that we are doing at Debricked today. However, we are conducting active research in this area and hope to have something to show in the future.
Layer 4 – Is the code reachable and exploitable in production?
The last level answers the question of whether the vulnerability is exploitable in its production environment. This corresponds to the most urgent types of vulnerabilities. However, maybe you have coded around such a particular vulnerability so it does not directly affect you even though you are using the vulnerable code.
Most vulnerabilities are simply solved by updating dependencies – but maybe there is no update that can solve it. Or you can’t update it in order to solve it because of serious breaking changes.
You may keep using the vulnerable functions if you handle it correctly in the proprietary code. There is, to our knowledge, no tool provider that explicitly offers the feature of checking the exploitability in your final production environment and software.
Debricked’s Approach
Just as the previous layer, this is not something that we have today. But of course, we are actively researching any methods that could help us to this analysis in a scalable way.
How Debricked aims to dominate
We have built a quite elaborate testing environment for finding and benchmarking our precision and recall. We have hundreds of repositories that we are doing continuous measurements against, so the scope of our testing is indeed very large.
Moreover, we have constructed a database of vulnerability matches and we are also maintaining a large dataset of manually labeled vulnerabilities. That is a great way for us to examine and monitor our performance, compare ourselves and competitors to one another.
Thus, we continuously know our performance for the languages that we support in real-time. This creates a feedback loop into our matching algorithms and development initiatives to further increase our performance and impress our customers. Naturally, customer feedback on vulnerabilities is also taken into account.
Besides this, we are also conducting research on how new, yet unclassified vulnerabilities can be discovered.