Cloning all Github data: How, why and everything in between

by Daniel Wisenhoff

2021-01-19

6 min

Debricked has achieved a not so small feat – we are now able to actively keep and maintain a clone of all data on GitHub! For what reason? You may ask. To understand all the why’s and how’s we have interviewed our Head of Data Science, Emil Wåréus.

Before we start with the questions, who are we talking to?

My name is Emil and I’m the Head of Data Science at Debricked. Me and my team of 5 data engineers are the masters behind everything related to data. Also, I was the second employee at Debricked!

Debricked Cloning Github Data

Let’s start with the million dollar question: why would anyone want a copy of all GitHub data?

The short answer is – to have a better and faster representation of the data that we need to service our customers. You see, we want to do big computations on all open source. Yes! You heard that right. On all open source.

If we only wanted to monitor a couple of thousands of open source projects we could do it through the API calls provided by default.

But our products and solutions are not meant to give customers partial coverage; it’s supposed to be extensive. Therefore we decided to index all 28M projects on GitHub, and that’s not the end of it. Soon we will be adding the other large repositories such as Gitlab, and more.

But doing this, cloning all of GitHub that is, poses quite an interesting challenge because of the many different data structures and relational dependencies in the data. Some can be loosely coupled and some can be tight.

As a result, huge challenges arise regarding the time complexity for calculations on such a large dataset. For these reasons we decided to go on a journey and see if we could create an up to date hourly mirror of GitHub locally.

Why do you choose to store the data locally?

Haha! Yes, that may sound a bit unintuitive in the era of free cloud credits. But the truth is that it saves us a lot of money. Because the RAM and CPU load is high at all times, constantly chugging data, it is more cost-efficient.

Cloud is great when you have variable loads such as ML training. But when you are pushing a model like ours, scraping data with a lot of database interactions, it is a lot cheaper to host it on prem.

For all the hardware geeks out there, what does a setup like this look like?

There are a couple of things at play. First, we have a relational database that is not too large, meaning not in the terabyte space. Second, in terms of cloning all the repositories we are looking at about 20 terabytes of data. Third, we have a graph database in which we want to understand how the open source relates to one another.

For this we use Neo4j. Essentially we are doing graph computations to answer questions such as “what are the root dependencies” and “what open source impacts what open source”.

The final component of the puzzle is to do the actual scraping and analytics. Here we have a cluster with a couple of hundred pods working in parallel to both gather and analyse the incoming data.

What people don’t realise is that there is a lot of complexity to this. Just cloning a repository is not enough. We need the history of it and we need to create multiple snapshots of the data for different points in time. This multiplies the problem by many factors.

And before I forget, all of that storage is of course on PCI-Express NVMe SSDs sending Gigabytes of data for analysis every second. We must also thank AMD for creating such nice server CPUs with many cores! 🙂

Emil, smiling when thinking about his beautiful server.

20 terabytes sounds like a lot of data – what does it entail exactly?

There are about 10 000 – 30 000 new pull requests each hour at GitHub. This is 5-10 times more than about 5 years ago, and that’s only the open source contributions on GitHub! We are monitoring about +40M repositories, 100M issues, 80M pull requests and 12M active users.

Thanks to our technology, both hardware and software, we are able to keep our model up to date within the hour. So, whenever someone makes a pull request, comments, stars something or follows someone new we can collect that data with minimal time lag.

How did you go about setting this up? Did you wake up one day and think “I need to clone Github”? What triggered it?

Haha, no. Of course not! We did GitHub scraping early on in the history of Debricked. It all started when we realised that a lot of vulnerabilities are discussed in issues and we wanted to link and classify those. Through research we realised that this data turned out to be highly relevant.

About 2% of all issues are related to security, but only 0.2% are correctly labeled with a security tag. It is not until much later in time that those discussions are publicly disclosed as vulnerabilities through NVD and other sources of curated vulnerability information.

To curb this, we developed our own machine learning models to classify issues as security related or not. This way we can provide customers with a huge security demand information about issues that may be a risk, but which have not turned into officially disclosed vulnerabilities yet.

I must brag and say that we have achieved state of the art in the security text/issue classification space and have surpassed previous research in the area.

How many Vulnerabilities have you discovered this way so far?

About 200 000. In contrast the probably most popular database for vulnerability information NVD (Natural Vulnerability Database) has about 130 000 where a lot of them are not directly related to open source. We are also investigating how we could properly disclose this data to the public.

How did this later on translate to gathering all of the Github data?

We realised that this data (issues and vulnerabilities) is not independent from pull requests etc, which could provide us with additional signal to our models. For this reason we started to scrape commits, releases, code diffs etc., and perform lots of interesting calculations on the data.

We discovered that with stronger links between vulnerabilities and the original source (the open source projects) we could increase the precision of our data dramatically. This turned out to be a quite technical machine learning solution.

We match the software bill of materials to vulnerabilities and leverage all of our different data points which give us superior data quantity and quality in terms of accurately finding vulnerabilities in our customers proprietary code.

But the data is not only used for security, right?

Correct! As our customers will know, we are also providing health data on open source projects. For example, we look at the trend of super users commits to a certain repository to build scores which will give us intel about the project’s well being.

This is a good example of how complex the data can be. You want to monitor the contributors that know their way around a project and see that they keep contributing. This, in combination with all the other data we have on a project, can give us an indication of its health.

So, in essence data has become our core here. Gathering, understanding and enriching data from GitHub and other sources has enabled Debricked to create services unheard of before.

This all sounds great! But what does this mean for current Debricked customers and potential future users?

The most impactful part of it is that you get more accurate information on vulnerabilities. We correctly map vulnerabilities to repositories at GitHub, commits, issues etc. which in turn increases our confidence that a vulnerability actually affects an open source project, and gives us insight into how it is affected.

This is a problem in general in our industry; the precision of the vulnerabilities that you are presented within your Software Composition Analysis tools, Vulnerability Scanners etc.

It is a difficult challenge to match the description of vulnerabilities and vulnerability information in for example email threads and other highly unstructured data sources to the actual software they affect. And then to find out if you are using that open source component and determine how you are using it and if you are in fact affected by the vulnerability.

Good tools have 85-95% precision (True Positive rate). While we have seen competitors and free tools go as low as 60%, Debricked currently has a 90-98% precision rate on the languages we fully support. We manage this while being one tenth of the size of some of our competitors. This would not be possible without the data combined with our algorithms and raw talent in the team.

A little bird whispered that this might open doors for some interesting new functionalities?

Yes! I can’t tell you all the secrets just yet… But by having all this data we are able to determine if a vulnerability affects a certain class or function. A vulnerability is only relevant if you are using the class or function containing the vulnerability in your software.

This can remove second order false positives and increase the accuracy of your internal triaging of the vulnerability. You will know exactly where in your code the vulnerability is, and if the code is called. This greatly reduces the amount of vulnerabilities you actually have to check, which will save a lot of time. So, stay tuned!

Thanks a lot for your time Emil! Is there anything else you would add?

Yes. Keep an eye on our blog for some cool insights later on, both about this and other fun stuff! 🙂