📕 🤖 🔑 Managed Identities: A Practical Guide to Eliminating Secrets - from GitGuardian & CyberArk

Finding over 6,000 credentials in Twitch's source code - How our source code is a vulnerability

In this video, we break down the recent source code leak at Twitch and discuss what makes our source code a vulnerability.We used GitGuardians secret detection engine to uncover any hidden credentials inside the Twitch source code, spoiler alert we found a lot... over 6,000. We walk through some possible attack paths malicious actors could take and examine why this is such a systemic problem for so many organizations today. Intro-0:00Source code leaks-0:31Twitch Breach-1:05What happened-2:16How did the source code get leaked-04:19Security concerns with source code-05:42Secrets in twitch source code-07:07How attackers exploit secrets-09:10How do we protect source code-14:50Wrap up-17:54

Video Transcript

Introduction to the Twitch source code leak

Hello everyone!

Welcome to another video. In this video, I'm going to tackle an interesting topic. Well, at least I think it's interesting. That's the concept of our source code as a vulnerability in and of itself. I'm not talking about vulnerabilities in our applications because of insecure coding. For instance, I'm not going to talk about cross-site scripting, or injection dependency vulnerabilities, or cryptographic vulnerabilities. I want to talk about our source code in and of itself, and how it can be a vulnerability. If you've been paying attention to the cyber security news, you'll be noticing that over the last couple of years but in particular the last year adversaries have really been targeting organizations' code source. We can see this in the Codecov supply chain attack, where adversaries specifically were trying to get into organizations' private code repositories. We've seen multiple leaks and breaches such as with Microsoft, EA games... But there was one incident in particular that I want to focus on for this video. And that is the Twitch breach that happened in October 2021.

Understanding the Twitch Breach

I've already written a lot of articles about this particular breach. I've written some for DarkReading and other publications and personal blogs. But I want to tackle it in a bit of a broader sense in this video.
What makes this breach so interesting is that we were really able to take a dive into an organization's code source. And not just code source for a particular application, or maybe a backend service. No, the entire code base for that entire organization including the subsidiaries, the backend services, and even the secret projects that no one was meant to know about... And we're going to look at this from the point of view of an adversarial! We're going to put on our metaphorical "black hats" and run through this how an attacker would. So that we can understand what it is about our source code that adversaries are really targeting, and what we can do to protect it. And not really protect just protect our source code, but what can we do to make sure that we're not leaving any gifts around for the attackers that may be trying to access it.

What happened with the Twitch Breach?

So let's take a look at exactly what happened. On October 6th, an anonymous 4chan user posted a link to 125 gigabytes of source code that they claimed was from Twitch. Twitch later confirmed this. In total there were around about 6 000 different repositories. And uncompressed it was about 3 million documents and 250 gigabytes of data! This was a huge amount of information, and when you have a leak of this size, well there's going to be a lot of interesting stuff to go through. But you also have the constraint that there is so much data it's hard to know what to focus on.

In addition to the repositories there was also some data about streamers' income and what they were earning. And this was largely the narrative that all the major news media took up on: that the streamer's income was leaked. However, I've spent literally hours and hours going through this source code. And I can tell you that from Twitch's security team's point of view, the information about what their streamers earn is the absolute least of their concerns. There is a lot of sensitive information that was inside this source code. Before I go too much further, I just want to say that I like Twitch as a company, and this video by no means is designed to try and name and shame them in any way. In fact, there's a lot of evidence that they were using security tools such as gosek in their source code. So they were at least aware and putting some complementary measures in to prevent secret sensitive information from being leaked. And the reality is that most organizations of this size if they had their source code leaked, would probably have a lot of sensitive data: similar if not more than what we're going to go through with Twitch. That being said, there is still a lot of interesting information that was exposed in this breach.

How did the Twitch source code get leaked?

So the first part of the story is: How did attackers get access to this information? The only information we have from Twitch is that the incident was a result of a server configuration change that allowed improper access by an unauthorized third party. What we probably can guess is that a server that contained the git repositories or a backup of those git repositories was inadvertently made public. And someone who was scanning Twitch or perhaps an internal employee that knew of this, or even a previous employee then downloaded that information. But we can't really be too sure. However, server configuration areas happen all the time. There are endless stories about Amazon s3 buckets accidentally being made public, and adversaries finding them. And this isn't actually that uncommon. And when you look at source code as its nature, it's a very leaky asset once it enters into our git repositories, it's cloned onto multiple developers' different machines, and remote works... These can be both professional and personal, it's backed up into different areas, it's probably shared in internal wikis, in documentation, and even on messaging systems... so there's no real way to really protect our source code.
This means if there's anything sensitive inside our source code, adversaries know that this is probably going to be an easy target.

Security concern with source code leaks

So what are the security implications of your source code leaking? Well, it's not really what you might think. It's not the fact that your proprietary information has been leaked, it's not the fact that adversaries now may be able to find vulnerabilities inside how your application works. Although that potentially could be true. The biggest threats that we face immediately in this situation are our secrets, our API keys, our credentials, our security certificates, our private keys... All of these can be found inside source code. If you look at what the organization most wants to protect, then you will think that secrets would be up the absolute top. And this is true: companies go to a lot of effort to manage their secrets securely. But they don't go to so much effort to secure their source code. And this is because it would interrupt the workflow of their developers and make it very hard to make progress.
The idea is that these secrets or customer information would never enter into your source code but the way that the version control systems work and how developers operate today with the increased amount of sensitive information they have. Especially when you take a company the size of Twitch, there are going to be a lot of secrets inside that source code. It's very hard to guarantee that they won't leak.

Secrets in Twitch's source code

So what about Twitch? Did we find secrets inside the Twitch git repositories?
Yes, we did. In fact, we found about 6 000 of these keys. We scanned the entire source code with GitGuardian secret detection engine. Usually, we would perform a validity check on these credentials. For instance, making sure that the Amazon keys, or the Twilio keys are not only real but also active right now, and pose a real threat. We opted not to do this. The main reason was we understand that Twitch would be performing a bunch of forensics to make sure that malicious third parties didn't access or move laterally into different systems. And we didn't want to confuse this by checking their API keys that are actually valid.
And we also realized that because this was public, they would be rotating a lot of these keys anyway. So it wouldn't give us a clear indication about what was valid at the time of the leak/
Instead what we've decided to do is look at other indicators such as when the keys were leaked, and the time of the commits to give us an indication of whether or not we think they were valid at the time of the leak. In total as I said we found about 6 600 different secrets inside the Twitch code repositories. This included 194 AWS keys, 69 Twilio keys, 68 Google API keys, 100 of database connection strings, 14 GitHub OAuth tokens, and even 4 Stripe keys and this is just to name but a few.

How can attackers exploit credentials in source code?

Now, as crazy as this is, what's perhaps the most troubling is that when we look at other organizations of the same size, this is pretty in line with what we would typically find inside a private source code repository. But how would adversaries actually use these keys in an attack?
So what I want to do now is put on a metaphorical black hat, and take a look at what an adversary would do. To simulate a real attack we're going to pretend that this source code wasn't made public and we, as a malicious actor had just been given access to it. What are the first things that we're going to do? Well, we would definitely start scanning these source code repositories for secrets. But when we have results like 6 600 different keys in hundreds of different repositories that all perform different tasks, how do we go about actually doing something malicious? How do we move into critical systems and elevate privileges and launch our attack?

Well, to start with we'll definitely do two things.
#1: we're going to want to find the highest value keys, and quickly.
And #2, we're going to want to try and cross relate those keys to find the services that they refer to. So how do we identify these high-value keys?

Well, we want to break up our keys into different areas. Like any attack, we're racing against the clock. We want to make sure that we can launch our attack before Twitch is even aware that we're inside their source code. So we want to identify the keys that are going to allow us to move laterally into different systems. Typically these might be things like cloud service keys, payment system keys, something that's going to give me an immediate quick win.
Other keys that we're going to look for, for example, are going to be our data keys. Keys that give us access to a database like an s3 bucket or encryption keys that we can use to decrypt data that may be sensitively installed in these databases that we can gain access to.
And the third area of keys that we're really going to be looking for, but we'll put on the back burner, are what I'll call our secondary attack keys. These are keys that we can use to launch different attacks but are going to take some time so we want to put them aside for the moment and get our quick wins first. So these are keys like our keys to your messaging systems. For instance, if you have a Slack key, you might be able to post messages and launch a phishing attack that comes from an internal system, so it looks more legit, and use this in a different style of attack. Or maybe you have recapture keys, where you can bypass some security implementations to launch an attack that way. These keys are definitely interesting, but they're not what we're going to be looking for immediately in the first few hours once we have this information.

Once we have these keys we're going to want to separate them into ones that are valid, and to production services ; and ones that are either invalid test credentials, or to sandbox or pre-prod environment. So this means that I'm going to want to apply a filter to remove any of these keys that contain certain keywords. For instance, PayPal Braintree keywords that we found, let us know what environment these were for. You see, some had a sandbox environment while others conveniently let us know that this was for a production environment. So we can separate these, and focus just on the production ones. We also want to look at the commit date of these keys and pretty much remove anything that's a few years old, or even a few months old if we have enough information.

For instance, these Stripe keys look very tasty for us to use as a hacker. Unfortunately, the commit goes all the way back to 2015. This means it's highly, highly unlikely that these are still valid. So we're not going to waste any attention or any movements on trying to exploit these keys.

Okay, so now that we've filtered and separated out these high-value keys while we still have hundreds of results. This means that we need to filter down these more because we don't want to start launching attacks on invaluable systems. It's just going to raise alerts that we're exploiting them. So the next thing we're going to do is separate where we found all these keys. And we can take a look at the repositories that we found them in, and identify ones that look like they're going to be the most amount of value. For example, if we found 14 AWS keys inside a repository that was named cloud services... Hmm... That might be an interesting place to start launching our attacks. Of course, AWS keys are one specific attack path that we may want to take. But there are literally hundreds of different avenues that we can go down after we've accessed the source code and found the information that we want. The purpose of this video isn't really to explore each one, but just to say that really if you look inside these large companies' source codes, you're going to find a lot of sensitive information. And if you're an attacker and you're trying to access this information, are you going to go after the highly protected secrets management systems that have hundreds of different alarms set up around it with multiple protections of layers, that are using top-of-the-line encryption? Or are you going to look for the source code that's sprawled into multiple locations, that is backed up into different areas, that needs just a simple configuration mishap to be made public, which probably contains a lot of those valuable secrets anyway? Well, I know which one I would take.

How do we protect our source code?

But this takes us to the larger problem: how do we protect our source code from this?
And really the solution is that we need to change our way of thinking. It's not about protecting our source code and making it harder for developers to be able to access it. It's really making sure that our source code doesn't contain this highly sensitive information like secrets. There are some ways that we can do this.

#1: Use short-lived credentials where possible

This means that we have a clear path to be able to rotate our keys and documentation on how to do this. There are products like Hashicorp Vault, which we've talked about, that introduced the concept of dynamic secrets, which is secrets that auto-generate each time the application is built. And this is great because it means old keys are essentially worthless. However behind every short-lived credential, there is a long-lived credential so this problem doesn't explicitly solve the issue. And not all keys can be dynamically created.

#2: make sure the secrets don't end up in repositories in the first place

This is a little bit easier said than done, but there are tools out there to help this, that developers can use. For instance, you can easily create a pre-commit or pre-push git hook that checks your commits before they enter into your git repository, and make sure that they're free from secrets. This is the best place to detect a secret because if it enters into your repository, even if you catch it immediately, and even if you delete their commit over there, you still need to revoke that key because that code is cloned into multiple different locations. So using pre-commit, or pre-push git hooks is the best way to catch these secrets. However, this can't be globally enforced and isn't a complete fail-safe solution either.

#3: secret scanning in our repositories

We scanned this with GitGuardian to find these keys, and you can do the same on your repositories. GitGuardian can be installed in a couple of clicks, it's free for teams of less than 25 developers, and it will scan in real-time your repositories, including all the history to make sure that secrets aren't present in these repositories. If we do these things our source code largely doesn't become a vulnerability in and of itself.
Of course, there may still be some business logic flaws that an adversary can exploit if they do get the source code. But it extends our time, it reduces the ability for them to be able to quickly move from our source code into our other systems, which is essential.

And the final point of this is that we should also always operate under the impression that our source code is going to be leaked or made public. And if we do this, and we think to ourselves: can we open source our software today? And sleep well? Then we've achieved the correct level of security inside our source code.

Well, I hope you found this video interesting. Let me know in the comments. If you have any questions, you can always reach out to me on Twitter @advocatemack. And if you enjoyed this video, please give it a like because it helps the Google Overlords know which kind of content to recommend in the future.

Thanks for watching and see you next time!

‍