Data-driven method shows promise for detecting open source supply chain attacks

by Debricked Editorial Team

2022-12-27

3 min

The past years have seen a rise in malicious actors targeting the open source ecosystem, with well-known incidents such as the Log4j attack. One can say that the growing popularity of open source software has brought together many code lovers, but has also exposed vulnerabilities such as software supply chain attacks.

These attacks often target widely-used open source projects, exploiting the fact that even large projects may rely on smaller, more niche ones that are easier to compromise. As these attacks continue to grow in prevalence, the study of open source supply chain security has become an increasingly important area of research.

We are Oliver and David, Master’s students in Information and Communication Engineering Technologies and Electrical Engineering at Lund University. In our Master’s thesis with Debricked, we explored the potential of using machine learning algorithms to detect and classify these attacks. This blog post summarizes our findings.

The starting line

The goal of our thesis is to evaluate the effectiveness of using machine learning algorithms to detect open source supply chain attacks.

To do this, we used state-of-the-art algorithms to convert the source code in our dataset into a simplified data structure known as a tensor. We then applied our machine learning models to this data, analyzing 375 packages and 434 individual files. Our results will help us understand which types of approaches and attacks are easier to detect using this method.

The “how”

Our study uses a straightforward approach to detect open source supply chain attacks: we first convert the source code into tensors, which allows us to compare the similarity of different packages. Based on these similarities, we group packages into clusters and use these clusters to identify new attacks. If a new attack is similar to a cluster, we consider it to be malicious. Using this method, we were able to achieve an F1-score of approximately 0.85, which beats other methods in the same field.

The findings

The F1-score is a measure of a model’s performance, with a score of 1 indicating a perfect model that always categorizes correctly. In our study, we achieved an F1-score of 0.85, which means that our model was able to identify malicious files 79% of the time correctly and non-malicious files 93% of the time. The most effective attack that our model was able to detect was data exfiltration, which it identified correctly 86% of the time.

Confusion matric for the adjacency F-1 approach

This result is not surprising, as exfiltration is the most common type of attack. The two attacks that our model had the most difficulty detecting were financial gain and dropper, with accuracy rates of 50% and 62.5%, respectively. However, these results should be interpreted with caution, as financial gain was only labeled as such twice in our dataset.

Classification results per file broken down by ecosystem, origin, attack type, and obfuscation

Our results also showed that heavier forms of code obfuscation are actually easier to detect. This may seem counterintuitive, but our findings suggest that heavily obfuscated code is significantly different from ordinary code, making it stand out and easier to identify. In contrast, non-malicious code is rarely if ever obfuscated.

Based on these results, it is possible to develop a system that can scan updates to common packages and detect open source supply chain attacks.

The adventure

Conducting our thesis at Debricked has been an incredible experience. We have had the opportunity to work on a fascinating and important problem, and have received great support throughout the process. Additionally, we have been given the freedom to approach the problem in the way we believe is best, which has allowed us to make significant progress.

Overall, the trust and respect that Debricked has shown us have made this a truly memorable experience.

The journey’s end

We are thrilled to have reached the end of our studies and are proud of the results we have achieved. We hope that our work will be useful for future research at Debricked and beyond and that it will add value to the company as a whole. It’s been an exciting journey, and we’re happy to have made it to the finish line!

David and Oliver five minutes before presenting their findings for the whole Debricked crew

A side note from the Debricked Editorial Team

As much as we’d love to say otherwise, this powerful opportunity is not available in our tool…yet. But don’t worry; we’ll let you know if that changes. In the meantime, you can sign up for our newsletter to stay in the loop and never miss any major updates. Now we say thank you to Oliver and David and wish them the best of luck in their future adventures!

Meet the Crew

Meet the Crew
Meet the Debricked crew: Jonna Gustafson

2022-08-24 3 min
Meet the Crew
Meet the Debricked crew: Phatchana Srinin

2022-06-16 3 min
Meet the Crew
Meet the Debricked crew: Shilpa Sandilya

2022-05-25 3 min

Data-driven method shows promise for detecting open source supply chain attacks

The starting line

The “how”

The findings

The adventure

The journey’s end

A side note from the Debricked Editorial Team

Related posts

Meet the Debricked crew: Jonna Gustafson

Meet the Debricked crew: Phatchana Srinin

Meet the Debricked crew: Shilpa Sandilya

Safer travels in cyberspace