What metrics are important when benchmarking and choosing SCA tools? Ibrahim Haddad, the Executive Director of LF AI & Data Foundation, will delve into this topic and bring you useful and actionable insights into open source management from a compliance and security perspective.
From compliance to future predictions for open source management
As the adoption of open source keeps expanding in various fields, choosing the right tools that will allow relevant teams to stay proactive in their security work while being in control of the end-to-end process has become an undeniably crucial concern. In this following interview, Ibrahim will as well cover the recent discussions around precision and recall, and share his take on this particular topic. Let’s dive in!
Hi Ibrahim! It’s an honor to have you here. Can you tell us more about your background?
Hello and thank you for this opportunity. My name is Ibrahim Haddad and I am the Executive Director of LF AI & Data Foundation, an umbrella organization under the Linux Foundation focused on supporting and advancing the development and innovation of open source AI & Data projects. In prior roles, I worked at Samsung Research, the Linux Foundation, HP, Palm, Motorola and Ericsson Research mostly focused on software development and building platforms using open source software.
Throughout my career, I either developed software, managed R&D teams, or drove open source software initiatives which made me the primary person responsible for ensuring open source compliance. These different efforts taught me among other things how to build and run open source program offices and how to establish compliance programs and ensure compliance at scale. I love sharing these experiences via publications that I make available via my website.
You recently wrote a guide on how to evaluate and benchmark SCA tools, which Debricked contributed to. Why is this such an important question?
Thank you for the Debricked contribution to the guide. It was really a cool exercise to have the guide reviewed in the open and incorporate incoming feedback before it gets published following the open source model.
The open source compliance community in general lacks a standardized way or a formally recognized set of metrics to compare SCA tools. My hope was that the guide would at least make available a baseline of metrics that can assist organizations deciding what’s important to them and then compare apple-to-apple the functionalities of the tools.
It’s actually a daunting process to evaluate these tools; they can differ vastly in what they do, how they do it, and there are so many companies (over 10 and counting) that are competing in the space with a mixed bag of functionalities and a blurring line between their compliance and security offerings. This is such an important topic because choosing the wrong tool for your needs/priorities or the second best would have a significant impact on your development activities. My advice on this specific point is to be as methodological as possible in the evaluation process, dedicate enough time to learn the ins and outs of the final choices and how they can help you achieve your goals from a compliance and security perspective. Also, don’t rush the process and remember that this choice will carry with you for a few years unless you plan to do it again in less than a year.
What is the main factor to consider for someone choosing between two SCA tools?
There are many factors and none of them are necessarily the same for any two organizations. Each company has their own priorities, risk profile, rules, policies, and guidelines. We also need to be aware that organizations are making two choices, not one: a tool that manages their open source legal compliance and a tool that discovers security vulnerabilities in their code bases. What unites them is the fact that they both require a certain engine to scan the code and if you have that, then you can theoretically implement both.
In relation to factors to consider, I would consider the metrics listed in the guide I published a year ago, but also pay attention to a few additional items:
- A tool that enables and empowers developers, doesn’t slow them down, break their workflow, or require them to do additional work (unless there is an issue that needs to be dealt with)
- A tool that enables scan with every commit approach and integrates with every version control system and build system we are using or could be using in the future. Following the open source practice of “release early and often”, I’d like scans to be done as early as possible and as often as possible. What’s a better approach than doing it at commit time and then failing the commit if a compliance or security red flag is raised
- A tool that has high quality data that allows you to operate on the hypothesis that the results will be accurate and up-to-date with current developments happening outside of your organization walls, that applies both to project origin and license information and security vulnerability information
- A tool that can ideally integrate project specific metrics related to the health of the project and its sustainability long term. Remember that the vulnerabilities with the most impact came from poorly maintained projects. With such valuable data, you will be aware and understand your dependencies on projects that are a weak link in your chain. Then you have the chance to either try to find other healthier projects that offer such functionalities or possibly contribute yourself to the project and work to make it viable and sustainable.
Recently, discussions around the amount of noise created by the tools arose, what is your take on the tradeoff between precision and recall?
I’ve personally used SCA tools for many years and had my share of frustration with the false-positives. At the core of it, it’s just a data quality issue combined with the challenge of tracking a given body of code to its true origin and capturing its license.
Zlib is a great example that I use often because it has a permissive license and the code is copied into hundreds of components. When you scan software containing zlib code, you will get hundreds of hits with many different licenses. How can you figure out the true origin and source if that piece of code shows 100s of matches from different sources and different licenses? Ideally, I’d like to get a single positive hit telling me the code matches code from the zlib project licensed under the zlib/libpng license.
From a security standpoint, the tool may flag vulnerable code but that code may not be called during runtime resulting in noise as you refer to it. It’s true the vulnerable code is there but it’s not being used and that knowledge requires a special breed of tools capable of deducing that information.
I think it is a hard balance to be honest thinking about it from a tool perspective. I would really suggest providers to focus on providing a solution that does not require deciding on a tradeoff. If a tradeoff is to happen, then that tradeoff will depend on the risk profile of the organization and the speed at which they want to move more than anything else.
Lastly, what do you think the future holds for open source management?
I think the space is moving pretty fast with so many players competing for market share and trying to innovate each other. The future is certainly bright and that goes hand in hand with increased adoption of open source software in new industries and looking at thousands of new organizations that are in the process of entering that open development ecosystem and require SCA solutions.
In relation to tooling capabilities, I would list the following predictions for the next 12-24 months. I don’t know if all of them will be manifested in a single solution, however, I see various efforts by different vendors going in these directions:
- Full automation of the compliance and security process with assistance of AI and Machine Learning technologies
- Increased accuracy, the ability to correctly identify the true origin of the code and the license under which it was released
- Improved data quality which is an ongoing issue in relation to knowledge bases and vulnerability information
- Seamless integration with developers workflow via a certain mechanism similar to running a scan at every commit for instance
- Improved integration with build systems, package managers and the ability to be agnostic towards programming languages
- Improved discovery of security vulnerabilities, not necessary static vulnerable code that is not being run, but vulnerable code that is being called and executed during runtime
Thank you for your questions and the opportunity to chat.