Secrets like API keys, credentials and security certificates are the crown jewels of organizations but can easily sprawl through all your systems. It is important to be able to gain visibility into your systems and code to find these secrets. In this tutorial, we will run through a simple python script to scan for secrets in local files and directories. The same principles can be applied to detect secrets anywhere in your CI/CD pipeline. Links:Blog post: https://blog.gitguardian.com/scan-secrets/Example code: https://github.com/mackenziejj/directoryscannerGitGuardian Dashboard: https://dashboard.gitguardian.comOpen-source dependencies:python-dotenv: https://pypi.org/project/python-dotenv/py-gitguardian link: https://github.com/GitGuardian/py-gitguardian
okay today we're going to be taking a quick look at how we can create a simple python script to detect secrets inside local files and directories now the first question you might have is what is a secret well a secret can be defined as a digital authentication credential typically these are things like api tokens security certificates database username and passwords really anything that provides access to an external service or system now these secrets are really the keys to the kingdom and as developers we really need to make sure that these secrets are stored securely with tight access control but we also need visibility into our internal systems to be able to see if these secrets have been sprawled the challenge is though that secret detection can be really quite difficult this is because secrets are generally computer generated strings but they can look identical to non-sensitive computer generated strings like unique identifiers or public keys distinguishing between the two means that we have to take into account the context of the code luckily however we can do this simply thanks to the get guardian api we can use this to do all of the heavy lifting for us now i know that using an external api for something so sensitive like secrets may be triggering but i think it's important to know that the get guardian api doesn't store any data and it's stateless but in addition to that i would also really recommend that when we find secrets in compromise places through scanning that we revoke them anyway but there's some really big advantages to using this method to do sequence detection and this is because that all we need to do is write a simple python script that compiles the information we want to scan in our case the files from our local directory into an array and then we just palm it off to the api to scan it we also have the python get guardian client which is going to do a lot of the communication for us as well why this is cool is because really we need to be able to detect secrets in all of our services so if all this is doing is compiling information and sending it over well this part will be the same for really anything if we want to scan slack channels for example to see if any secrets have been sent unsecurely maybe we want to check emails maybe we want to create a pre-commit git hook we can do all of this using this method and it makes it really simple to use custom secrets detection which i think is really cool but enough about that let's jump onto the computer and run through our example okay so before we get started just be aware that there is a blog that goes through all the information that we're going to run through here today and it has all the lines of code that we're going to go through with more detailed explanations so if you want a little bit more information or you want to go through it at your own pace then feel free to check that out the link will be in the description below the first thing that i'm going to do is i'm just going to create a project folder so just in my terminal i'm going to create that once i hit inside that project folder i'm going to create two files now i'm going to create first my env file oops sorry my env file this is what's going to store our environment variables so this is going to be our api token the second file i'm going to create is our actual script i'm just going to call this scanner.py all right i'm just going to open these up so that they're all ready so before we get coding we first need to install the dependencies that we're going to be using in this so let's go to the first one which is the python get guardian client now this is an open source project on github so you can take a peek into the inner workings if you want to but for the moment we're just going to install this using the python package manager pip so in our terminal or command line we're just going to run the command pip3 install i'm going to use the upgrade flag just so that i have the latest version and then our project python py git guardian and this installs the python client onto our machine so that we can use it within our project now the next thing that we're going to need to do as i mentioned before we're going to be using the api to do the scanning so we need an api token so we can grab this at the get guardian dashboard it's completely free so if you hit the dashboard.getguardian.com you can sign up either using github or your email account and once we have the dashboard open let's navigate down to the api section and we're going to generate a new api call key i'm just going to call this directory scanner so once we have our api token from guardian we're going to open back up our dot env file i'm going to create a variable i'm going to call it gg underscore api underscore key and we're going to paste our api token into this file now the advantages of using environment variables means that we can keep this api key the secret outside of our source code and i'm going to be using a project called python.env which is going to load in all the variables that we have in this dot env file and it's going to store them in the local memory and this way we never have to hard code secret like api keys into our source code so just quickly if you're not familiar with using environment variables check out python.env and you can download this project you can install it using pip just like we did with the get guardian client which is just pip install python dot env and that will install that project for you and that means with just a few lines of code we can easily and quickly load all the variables within this dot env file into our project okay so we're done so we can start coding now but first i'm just going to skip ahead a couple of steps this is our finished code here and i just want to outline the four different sections of this that we're going to be using so the first one is just going to be loading all our dependencies in and the modules that we're going to be using the second part is going to be processing all of our information and storing it as an array so in our case it's going to be our files within our directory the third part is going to process all of that information in an array and break it up into chunks that meet the requirements of the api process all of our chunks and then the last part is going to be printing the results of those chunks from the api so they're the four sections that we're going to be running through okay so let's just start by importing our modules so we're going to be using a lot of standard modules we're going to be using the operating system functionality os we're going to be using the system parameters functions module which is sys we're going to be using a traceback module and we're actually going to be importing another module here which is our glob module what this does is it finds the pile the path names of our files so we're going to use this to seek and load in all the files that we want to scan now the next thing that we need to do is we need to import our api token so that's what we just copied into our env file so to do this it's just a couple of lines of code we're going to go from dot env import load underscore dot env and then on a new line load underscore dot env this may seem a little funny but essentially what we're doing is we're loading in the dot env uh package and then we're loading our env file and that's it we can actually use our all the variables from our emv file within our project now so let's do that let's create a variable called api underscore key and we're going to assign this variable the value that we gave the get guardian api token in our env file which if you remember was gg underscore api underscore key so now we have our api key imported we can move on to the next step which is importing our get guardian api client modules so the core module is the python get guardian client so we're gonna go from py guardian and we're going to import our gg client so remember we installed the pyga guardian so we're going to import the get guardian client from this now and we're also going to import some configurations so we're going to go from py get guardian dot config import the multi-document limit now just quickly the multi-document limit is essentially the maximum payload that we can send the api so the api allows for asynchronous scanning which means that we can't clog it up with a huge request for i don't know thousands and thousands of documents so what it does is it creates a limit of 20 documents or 2 megabytes so we need to break our payload up into chunks that meet that requirement so the multi-document limit just imports all of the variables that create that limit now the next thing we're going to do is we're going to initialize the get guiding client so we're going to append our api key to the get guardian client so we're going to go client equals ggclient and then in brackets api is called key equals api underscore key so this appends and initializes our client so that it has all the information ready to go now we're ready to start loading the files that we want to scan into an array so we're going to create a list of dictionaries for scanning so to first do this we're going to go create an empty array called to scan and now we're going to use our glob module to find the path names of the files that we want to scan so for name in glob.glob now we can actually do this a couple of ways now we can either do a recursive scan which means that we can scan everything forward of where our python scanner file is we can do this by going star star star or we can specify a specific file path in here and that's what i'm going to do because i have a folder that has a whole bunch of fake secrets in it that we should be able to detect and then we're going to go recursive equals true and this means that it's going to go forward into new folders it's going to keep going but i'm going to add an if statement into here to exclude some files so i already know that i have some secrets and dot env files but i have these well documented and excluded from any git repositories so i'm going to ignore any secrets that are within a dot env file so if dot env in name i want to exclude that now also i want to exclude the folder name now obviously i want to scan what's inside the folder directories but the folder itself will create an error if i try and scan it because it doesn't have any information in it so i'm also going to exclude dot path and is directory isdir name and then we go continue okay and now we need to start appending these documents into our empty array so we've open name as function fn we're going to take our two scan array and we're going to go append and we're going to go a document which is going to be the document itself so function.read and then we're also going to append our file name which is the path name of our files so that's os.path dot base name and then name now at this point uh let's just run a quick check to make sure that we don't have any unwanted errors so far so i'm just going to add a print a quick print tag in here and we're going to jump into our console we're going to run this and just make sure that we can see all of the files that have been loaded into our array so i'm going to run this now hopefully i'm not going to get an error nope and you can see here all the files and all the information from the files inside that array so let's get back to our document i'm going to comment this out now the next thing we need to do is we need to process in a chunked way to avoid passing the multi-document limit so if you remember i talked about the multi-document limit before and what we're going to do is we're going to break up our payload into chunks that are equal to or less than the multi-document limit because if we exceed this we're going to get an error on the api side so we're going to create a new empty array here called to process and then we're going to loop through our to scan array so for i and range but we're going to do this in steps so we want to start at 0 and we want to finish at the total length of the array so length or len to scan and then we want to do this in steps that equal the multi-document limit now because we pulled this in from the get guardian client configuration we have all the parameters that we need to meet in the variable multi-document limit and then we're going to create a variable called chunk and it's going to equal everything in our to scan array that we currently looped through using a try block we're going to scan our current chunk using the multi-content scan command from the gig guiding api before we get too ahead of ourselves let's also add in some exceptions where the scan will fail you know for example this could be if the file name is too long for the schema so we're going to add these exceptions in by using accept exception as exc this is where we're going to be using our traceback module so that we can print the exact line that our exceptions are failing on so we can go print exe and brackets two and then we're going to use our sys module and we're going to print off our exceptions we're also going to add in a quick if statement here that if our scan is not a success so if not scan.success we're going to print ourselves a nice friendly error message so error scanning some files results may be incomplete and then we're going to print off our scan and lastly we're going to add our scan results into our to process array so to process dot extend and then i'm going to go scan dot scan results and this will add in our scan results and that's actually all the code that we need to scan our files now obviously if we run this we won't get any results back yet because we haven't printed anything but if this all seems a little bit too much like magic of how simple it is well that's thanks to the get guardian client that we're using as well as a get guarding api so again big advocate for this method but let's print off some results so that we can see if we've detected any secrets so how we're going to print our results is we're going to loop through our scan results so we're going to do this in a for loop so for i scan results in enumerate to process so we're going to pull out our scan results from our two process array and then if our scan results have a policy break we're going to print the chunk file name and we're also going to print the policy break count now if you're wondering what a policy break is so a policy break is essentially a detected secret the api can also detect things like illegal file names if you don't want them banned file extensions we call all of that a policy break so in this case when we're saying does it have any policy breaks we're really asking does it have any secrets but we're going to continue with the language of policy breaks so we keep it consistent so when we get our policy break count well that's the number of secrets that it's found in this particular file so this here is our first real checkpoint so we can go ahead and open back let's save this and open back in our console and let's run this script and we'll be able to detect any secrets that we have in here so now we can take a quick look at our first results and you can see here that in our services.json file we have one policy break settings.pyr py file we have a policy break and a few others so we can see that we have policy breaks but it doesn't give us any more information about what they are so the type of secret that is detected we can actually easily add this in so if we go back to our scanner python script we can now print the policy break type this is going to pull up the type of secret that has been detected so this is really easy again so what we're going to do now is we're going to go for each policy break in our results so scan underscore results dot policy breaks we're going to print the policy break type from our scan results so let's run this again and as you can see the results that we get are the same as before as settings.py has one policy break but it also gives us the type of secret that it is in this case a django secret key and the same for the other ones that we have a basic authentication string we have a google king a google key sorry so we can see what type of secrets they are that have been identified from get guardian so that's really quite cool we can actually even take this a bit further let's say that we want to know the actual secret that is detected so we want to pull up the actual api key or api token and display this and we can do this too and again just a couple of lines of code so we go back to our scanner python script so the secrets themselves are actually called matches so it's the match of the policy break so we're going to print our matches here so we're going to do this same process as before but this time we're going to go for match in policy break dot matches then we're going to print then we're going to print match dot match type and now if we run the skip again we see that we have all the information as before but we have the actual api keys or the actual credentials that triggered the policy break the match that triggered the policy break printed so you should be able to see now how we can extrapolate this process of scanning files within directories and move it into different services now you may also want to implement this in an application somewhere or automate it one very helpful extra little bit that we're going to go into now is printing the results in a json format because json format is much easier to use in other processes so we can actually do this super simply and it's just really a couple of lines of code so we're quickly going to print this in json now so back to our python scanner script so let's start again we'll leave the initial results there but we're going to print a whole new set of results so we're going to start this the same way as before for i scan results in enumerate and to process array and then we're going to go if the scan results have policy breaks then we're going to print our scan results and then here's just one small line to underscore json and that's it now let's go back and quickly run this script again and you will see that yep we get our results in a json format this is one of my favorite features because it's so simple and so useful particularly when you're wanting to automate this in your software development life cycle or in your pipelines now this is obviously a very basic example but hopefully you can see the power of this and you can see where we can potentially go which gets me excited now if you need any help with this at all you can reach out to me or any of the other support team at get guardian by going onto the guardian website and if you're creating something particularly cool i would love to hear about it how you're using the api in your software development life cycle or for your company so that's it today and i hope to see you guys again in the next tutorial that we'll be going through