The state of
Secrets Sprawl on GitHub

How leaky can it Git

GitHub is more than ever “The Place to Be” for developers when it comes to innovating, collaborating and networking.

This amazing “octoverse” gathers more than 50 million developers working on their personal and/or professional projects. So when 60 million repositories are created in a year and nearly 2 billion contributions* are added, some mistakes can happen, such as leaked secrets, Intellectual Property or PII.

Some companies may think: I don’t really care about public GitHub, we are not open sourcing our code, everything is stored on our private repositories. But what about the developers of these companies… they most likely have open source repositories and can leak secrets.

The State of Octoverse 2020

Secrets Sprawl

Findings

Where leaks
come from

WHY

What type of secrets
do we find

File extensions that
cause data breaches

Pro bono
alerting

What happens
after a leak

Recommendations

To conclude

Secrets Sprawl

Let’s now focus on secrets. You would say that secrets stored in internal Version Control Systems is a very bad practice but in fact it is much more frequent than you would think. But why is that?

API keys, database connection strings, private keys, certificates, usernames and passwords… As organizations move to cloud architectures, SaaS platforms and microservices, developers handle increasing amounts of sensitive information, more than ever before.

To add to that, companies are pushing for shorter release cycles, developers have many technologies to master, and the complexity of enforcing good security practices increases with the size of the organization, the number of repositories, the number of developer teams and their geographical spread.

As a result, secrets are spreading across organizations, particularly within the source code. This pain is so huge that it even has a name:
Let us introduce you to the concept of “secrets sprawl” and how this can lead to public exposure of some of your most sensitive assets..

At GitGuardian, we’ve been monitoring every single commit pushed to public GitHub since July 2017. Three and a half years later…

we’ve uncovered
millions of secrets

and sent nearly
1 million pro bono alerts
to developers in 2020 alone.

Secrets

A secret can be any sensitive data that we want to keep private. When discussing secrets in the context of software development, secrets generally refer to digital authentication credentials that grant access to services, systems and data. These are most commonly API keys, usernames and passwords, or security certificates.
Secrets are what tie together different building blocks of a single application by creating a secure connection between each component. Secrets grant access to the most sensitive systems.

Learn more about secrets on our blog

Secrets Sprawl

Keeping secrets encrypted and tightly wrapped makes it harder for developers to both access and distribute them. This can lead developers to choose the path of least resistance when handling them which may include hardcoding them into source code, distributing them through email or messaging systems like Slack, saving them directly into config files and storing them inside internal wikis. Once secrets start to enter different systems:

‍
• Attackers can move laterally through infrastructure
• You lose visibility over where secrets end up.

Commit

A commit is an incremental change that has been made to an individual or set of files.When making a commit, the difference (or diff) between the current version of files and the previous version is saved, including data that was removed.

So here is a deep dive into what we find…

What are we looking at

2.5

public commits
scanned/day

almost

public commits
scanned/year

And the volume is growing*

more repositories
created last year

more contributions to open source projects

*State of the octoverse 2020

What DO WE FIND

more than

secrets
detected/day

over

secrets
detected in 2020

A Growing Number…

+20

compared
to previous year

WHERE DO WE FIND THE SECRETS

Secrets present in all these repositories can be either personal or corporate and this is where the risk lies for organizations as some of their corporate secrets are exposed publicly through their current or former developers’ personal repositories.

of the leaks occur on developers’
personal repositories.

of leaks on GitHub occur within public repositories owned by organizations.

We launched this audit, and several leaked secrets were brought to our attention. What was very interesting and what we didn’t anticipate was that most of the alerts came from the personal code repositories of our developers.

Anne Hardy, CISO

Where leaks come from

Top 10

India

Brazil

United States

Nigeria

France

Russia

Canada

Bangladesh

Indonesia

Why

Usually these leaks are unintentional, not malevolent. They happen because:

• Developers typically have one GitHub account that they use both for personal and professional purposes, sometimes mixing the repositories.

• It is easy to misconfigure git and push wrong data.

• It is easy to forget that the entire git history is still publicly visible even if sensitive data has since been deleted from the actual version of source code.

Human error exists,
but the key is to be alerted
and be able to take appropriate action when a leak is found.

Anne Hardy, CISO

Human error is nothing you can avoid and prevent, especially if it is not an error but just laziness, or even provoked, implement a risk based approach and simply add many layers to prevent it in your whole lifecycle.

David Dos Neves, Munich Re

What type of
secrets do we find

Secrets are digital authentication credentials that grant access to services, systems and data (API keys, usernames and passwords, or security certificates). The volume and diversity of these digital authentication credentials is growing fast as architectures move to the cloud but also rely on more and more components and apps.

Our larger customers, with 2,000 or more employees, deploy an average of 175 apps per customer, while our smaller customers, with 1,999 or fewer employees, deploy an average of 73 apps per customer.*

➜ Okta

All these categories of secrets expose companies to easy and direct attacks. Cloud provider and data storage secrets by data loss but also by allowing infrastructure suppression. Identity provider and messaging system by allowing legitimate identity usage.

Top 10

File extensions that cause data breaches on GitHub

As you might expect, with the many programming languages, frameworks and coding practices. adopted throughout the world, there is a very long list of extensions that can contain secrets here is the view of the top 10.

• Top 10 file extensions account for 81% of all the results

• The top 3 accounting for over 56% of the results

File extensions can be grouped into 3 categories

• Programming languages: Python, JavaScript, PHP, TypeScript

• Data serialization files: JSON, XML, YAML, .properties

• Forbidden or sensitive files: .env, .pem

Learn more about how secrets leak throught file extensions on our blog*.

Read the article

Examples of Secrets Leaks

Publicly disclosed examples of recent data breaches through leaked credentials.

Uber Data Breach

May 2014

Hackers discovered credentials in a personal public repository on GitHub that granted access to a database containing private information of thousands of Uber drivers.

Read the article

Starbucks Data Breach

MJanuary 2020ay 2014

JumpCloud API key found in GitHub repository.

Read the article

Equifax Data Breach

April 2020

Leaked secrets in personal GitHub account granted access to sensitive data for Equifax customers.

Read the article

EquifaxUN Data Breach Data Breach

April 2January 2021020

.gitcredentials in a public repository giving hackers access to private repositories
with sensitive information.

Read the article

Pro bono alerting

Such knowledge of leaked credentials comes with a great responsibility. We alert developers in a pro bono manner. Here is an idea of the volume of alerts we sent in 2020.

937,539

558,085

DEVELOPERS WERE ALERTED PRO BONO

ALERTS WERE SENT PRO BONO

it represented

700,000

unique repositories

860,000

unique COMMITS

What happens after a leak

GitGuardian’s algorithm reaction to a leak is 4 seconds (Mean Time To Detect). The alert is sent right away.

25 minutes Median Time To React. The developer is on the front line of the issue, which allows to nullify most of the potential damage very quickly, if the developer takes. immediate action after the alert.

When a secrets detection solution is in place, security teams also receive dual alerts to make sure they can follow up, remediate and report easily on security incidents.

If you leave your keys to your house in the lock and you notice they are gone then you change the locks.

Allan Alford

Gitignore is not a Vault!

REMINDER

Gitignore allows you to tell what file you don’t want to commit. Your files containing your secrets should be listed in your gitignore file but your secrets should not be described in plain text in your gitignore file… Hundreds of developers committed this mistake in 2020.

Don’t share too much!

REMINDER

If you search GitHub for “removed AWS key” you will see thousands of results. Removing a hardcoded secret and pushing a new commit only buries the secret in the history, making it harder for you to find but still accessible to attackers.

Recommendations

Companies can’t avoid the risk of secrets exposure even if they put in place centralized secrets management systems. These systems are typically not deployed on the whole perimeter and are not coercitive as they do not prevent developers from hardcoding credentials stored in the vault.

Solutions are available for them to automate secrets detection and put in place the proper remediation, but the market is far from mature on this subject.
Companies need to scan not only public repositories but also private repositories to prevent lateral movements of malicious actors.

Learn more about detection performance

Some best practices can be followed to limit the risk of secrets exposure or the impact of a leaked credential:

• Never store unencrypted secrets in .git repositories

• Don’t share your secrets unencrypted in messaging systems like Slack

‍
‍• Store secrets safely

• Restrict API access and permissions.

Developers training programs should be put in place although these do not eradicate the risk of leaked credentials.

Following best practices is not sufficient and companies need to secure the SDLC with automated secrets detection. Choosing a secrets detection solution they need to take into account:

‍
• Monitoring developers’ personal repositories capacities

‍
• Secrets detection performance* – Accuracy, precision & recall

• Real-time alerting

‍
• Integration with remediation workflows

• Easy collaboration between Developers, Threat Response and Ops teams.

To conclude

There are millions of commits per day on public GitHub, how can organizations look through the noise and focus exclusively on the information that is of direct interest to them? How can they make sure their secrets are not ending on their developers’ personal repositories on GitHub?
They can’t avoid that developers have personal repositories, they need automated detection and efficient remediation tools.

In this state of secrets sprawl on GitHub analysis we focused on secrets although this is not the only sensitive information that can end up being publicly exposed: Intellectual Property, personal and medical data are also at risk. But this is for another State of Report!

About GG detection engine,
data gathering & methodology

GitGuardian’s secrets detection engine has been running in production since 2017, analyzing billions of commits coming from GitHub. Since day one we began to train and benchmark our algorithms against the open source code.
It allowed GitGuardian to build a language agnostic secrets detection engine, integrating new secrets or new way of declaring secrets really fast while keeping a really low number of false positives. We have developed the vastest library of specific detectors being able to detect more than 482 different types of secrets*.

You can find the exhaustive list here

We are also collecting feedback from the alerts we are sending including the pro bono alerts:

• Explicit feedback when a developer or security team marks an alert as a false alert.

• Implicit feedback when a developer takes down a public repository or deletes a public commit a few minutes after we sent an alert.

‍
Our secrets detection engine is

‍
• High precision: We want to keep a low number of false positives to avoid alert fatigue.

• High recall: We want to keep a low number of secrets missed to keep our customers safe.

• Fast: While speed is less important than recall and precision our secrets detection engine is designed to be fast and scan a common git repository history under a minute.

• Community and customer driven: Our engine is constantly trained and improved by the feedback of the hundreds of thousands developers using our applications and by the feedback of our customers.

The state ofSecrets Sprawl on GitHub

GitHub is more than ever “The Place to Be” for developers when it comes to innovating, collaborating and networking.

Secrets Sprawl

Secrets

Secrets Sprawl

Commit

So here is a deep dive into what we find…

What are we looking at

And the volume is growing*

What DO WE FIND

WHERE DO WE FIND THE SECRETS

Where leaks come from

Why

What type of secrets do we find

File extensions that cause data breaches on GitHub

Examples of Secrets Leaks

what usually happens

When it really goes wrong

Pro bono alerting

What happens after a leak

Recommendations

To conclude

Secured by GitGuardian

The state of
Secrets Sprawl on GitHub

What type of
secrets do we find

When it really
goes wrong