CodeSecDays 2024 - Join GitGuardian for a full-day exploration of cutting-edge DevSecOps solutions!

Save my spot!

CodeSecDays 2024 - Join GitGuardian for a full-day exploration of cutting-edge DevSecOps solutions!

Save my spot!

Webinar - Detecting source code leaks on GitHub

To participate in stream go to https://www.crowdcast.io/e/detecting-source-code-leaks-github Not a day goes by without hearing about a source code leak in the news: Twitch, Samsung, Nvidia, Microsoft… We’re left wondering who’s going to be next?Join me this time, with Thomas Deschamps, Technical Product Manager at GitGuardian, to discuss:Why and how source code is leakingThe real (security) threats behind source code leaksHow we detect source code leaks on GitHubThe discussion will be followed by a lightning demo ⚡️ of HasMyCodeLeaked, our latest release. We’ll show you how to fingerprint your proprietary code and generate a personal report to identify any repositories at risk!---Did you know 🤔 In 2021, GitHub received around 1,800 DMCA takedown requests to remove more than 19,000 infringing repositories! And that’s just about the tip of the iceberg – most leaks go unnoticed.---Secure your seat now and enter the draw to win swag bags and Amazon gift cards.

Video Transcript

all right hello everyone we're live sorry running a few minutes behind a schedule having last minute uh technical difficulties but uh super excited to be here with you all today so we're streaming um across multiple channels we're live on crowdcast which is the main channel that you can see there and we're also streaming on linkedin and youtube so if you want to participate in this stream which i suggest that you do then uh head on over to the crowdcast uh the crowdcast link that way you'll be able to ask questions participate in the polls and be able to chat to myself and the handsome toma who is joining me today so tomah welcome welcome to this to the stream uh i am the developer advocate at guardian and thomas is one of our product managers someone can you maybe introduce yourself to the audience for uh hello mark thanks for having me so i'm thomas i'm the product manager who is responsible for the has my code leaked product we've launched a month ago which is helping you to find to find out if you private code has linked on public github which is the subject today yes right that is the subject so let me just share my screen for a minute all right so detecting source code leaks on github so that is the topic uh for today that we're gonna discuss we're gonna discuss why uh that is bad so we're going to talk about source code leaks in the news and a little bit about why we think this is kind of a trend on the up and up we're going to look at some of the reasons why source code leaks and the security problems behind that and then we're going to do our first ever live demo of a new product that we have which thomas has been working very hard on with the engineering team here at guardian which is a tool that's going to basically be able to detect your source code throughout public environments namely github and we're going to talk about some unique features that we have in the ability to do that and then as always we're going to have a q a section that you can participate in so as always we have prizes uh available we have swag bags and uh today at the end also gonna have some amazon gift cards uh to give away so you win those uh swag bags by participating asking questions participating in the chat so if you haven't said hello in the chat say hello in the chat um we're going to have some polls so participating in those polls is going to help as well so the most active members will get some swag bags for from guardian as appreciation so before we get started let's talk about where where in the world is everyone tuning in from let me know uh in the comments uh toma you're in the office it looks like you're this is actually this is actually the last day the last day in this office we're changing offices it's moving there yeah moving day so we're basically we're based in uh bestie in paris if you know paris and we're moving to uh le oprah the area around the opera in paris if you haven't if you haven't been opera is like one of my favorite buildings in the world it's the most opulent building i think i've ever seen uh in my life let me know where you are tuning in from uh so i can say uh hello on your part of the world so we've got some from germany texas berlin singapore boston canada nigeria vermont wales nashville tennessee my brother lives in nashville in tennessee finland poland wow a lot of great places india i'd love to see this uh we're getting uh quite a lot more people in these webinars we started off having uh i think our first webinar had about 290 people registered our last one 750 and this one we're on 450 so do you know what i'm pretty proud of that i think the numbers are going out pretty well okay so we got our first poll here which i'll just make live and uh why are source code leaks bad so let me know what you think do you think that they're bad because they contain secrets uh possibly uh which obviously we talk a lyric guardian are they bad because of lost ip perhaps losing the competitive advantage code is after all uh an asset does it expose other security vulnerabilities so other security vulnerabilities could be things like you know maybe you have some business logic flaws uh in your source code um that could be a problem i'm just trying to post this right now um or do you think that source code leaks aren't bad there's a lot of people i'm one of them that thinks that source code shouldn't be such a valuable asset um so that if it does there we are all right i've got the pole up sorry about that um so there are people that don't think source code thinks are that bad um so i'm particularly one of them that we should aim to have our source code to a point where if it does leak out in public it doesn't expose any serious vulnerabilities but unfortunately this isn't the world that we live in and even if we aim for that it's very hard to get there so let me know um all right exposes security vulnerabilities it's winning out of the gate so there are a lot of security vulnerabilities that can be exposed in our source code and also tools that we can that we can use which also can provide attackers with kind of blueprints into our systems 13 votes 16 votes for exposing security vulnerabilities we only have one vote for not that bad i think that's quite good two votes all right all right so exposing security vulnerabilities is definitely um exposing security vulnerabilities with a typo [Music] there we are corrected that uh all right so that's the clear winner there and that is true source code does contain a lot of vulnerabilities that can be picked apart um so we've mentioned business logic flaws perhaps you'll be able to discover some cryptography flaws in your source code you know other areas like that obviously we have the keys that are leaked this is a big one that we find but there's also a lot of other things that can be wrong when your source code leaks out so let's talk about some source code leaks that happen in the news what are some of the headlines uh that we're having you know at the moment so we've seen a huge trend of source code being leaked out so you may have noticed in october twitch's entire source code was leaked and then at the start of this year we had the lapses group that were leaking just about everyone's source code so we had you know samsung nvidia i'll talk about nvidia in a minute some microsoft code was leaked so this is kind of becoming a prevalent prevalent problem and these are kind of the main highlights but source code leaking is you know a fairly a fairly big problem and can actually come about in lots of different ways so we obviously want to make sure that our source code doesn't leak but here's the main problem is that source code is a leaky asset right you imagine well once source code hits your git repositories it's basically broadcast into hundreds of different places thousands of different places and we have no idea so if it's in your git repositories it's on your developer's machines maybe it's backed up into different areas it's probably shared on your internal messaging systems now all of these places we have no visibility over and that's kind of one of the key things when it comes to source code leaks is that we don't have any visibility over you know where the source code actually ends up and we don't often know if we have a problem so you know but uh source code leaking into public areas specifically like github you know is still a big problem so we there's something called a d uh a dmca takedown now this is part of a law uh a digital material law that basically looks at source code that's proprietary so source code from a company that's been linked into a public repository so if this happens one of the questions is you know what on earth do i do what do i do if my proprietary source code is leaked in a public repository perhaps from an employee a previous employee or perhaps just from someone that's using that code so there is a process you can you can apply for a d a dmca takedown to github itself if it's leaked on github and we've seen the number of the amount of these dmca takedowns uh really kind of skyrocket up to 2020 and it's about one in ten thousand repositories so we see the numbers on the graph that looks pretty big you know we there's huge amounts of dmca takedowns are being requested but at the scale of github this actually isn't that much so this isn't necessarily because their proprietary code isn't sprawled across github you know we happen to know that it is quite a big problem but is this really hard to find how do you find your source code that's linked onto github when you've got something at the scale of that there's going to be thousands of files that will match your particular code it's going to be so there's going to be lots of different things that you're going to have to look out for so it's really hard to one even find your code and know you have a problem and then once you do if you do then you have to go through that legal process of making a dmca takedown request which usually a process pretty fast on github side and that code will be removed and we'll take a little bit more look at that so the the problem is actually bigger than we expect and it's also completely unknown it's really hard to know how we how we do this [Music] so how does source code usually leak so we i have some interesting examples to run through later but uh source code leaks happen in a couple of of different ways normally so unsecured version control servers so when we take a look at the twitch breach um so we'll we'll take you know twitch twitch twitch's source code leaked in october uh i've written a number of blogs and done some videos on this we've found six thousand credentials or secrets inside twitch's source code and uh when we look through it that was leaked by someone as a torrent but how did it leak in the first place how did that bad actor get access to twitch's source code well they had a version control system basically it was public um only for a brief period of time and that would have come about because perhaps they were updating their infrastructure or a couple of other reasons but essentially that their source code was was unprotected and then it ended up out in the wild and we'll take a look i know thomas has some more examples on twitch's source code specifically about where that even ended up and how widespread that was so we can have a publicly exposed cloud storage i don't know about you but i've i can't count the amount of times i've read articles about a misconfigured amazon s3 bucket so if we're storing our data and code in these in these places that have misconfigured that have misconfigured areas then uh security researchers malicious actors are going to be able to find them there's ways to kind of scan or fuzz areas to try and identify this it's a common attack path that is used so this is another area where we where they end up uh current or former employees and contractors now now this is a big one too this is interesting because we've talked a little bit about this in the context of secrets is that you know let's have a look at github github's quite unique in that you probably have one account that's for your personal and your professional use so if you're using github at your work you're probably using the same account for your personal use so why is this a problem well it's really easy to make a mistake and accidentally push code into the wrong repositories and the other area of this a slightly more kind of concerning malicious area is that i mentioned the leptus hacking group and some of the source code leaks that they did this year well on their telegram channel they were basically advertising for insiders to give them access to internal uh infrastructure including co-repositories so if you worked for a large telecom company even if you know you don't have access to anything special you probably have access to the the git repositories and an attacker can move laterally find other vulnerabilities in that source code and maybe leak that out to the public um for a ransom which is what happened in the case of lapses so the you know this this employees and contractors can kind of come at a lot of different areas uh another one that i'm going to talk a little bit briefly about is the tooling that we use misconfigured devops tooling so we can implement this stuff in our ci cd pipeline that helps us a lot it's great but it can accidentally expose or give access to different to different source code as we go through there if you don't configure that right so not only do you have to configure your version control systems your data storage you have to keep a check on your employees and what's happening but you also have to make sure that all your tools are correctly configured because a small misconfiguration can can be a big problem and then also fat fingering you know accidentally making a repository public the crazy thing about github is uh that github has a public api which means that when code is made public so in the case of you know a private repository being made public this falls under the public event category which is an event on the github's api that just looks at when code has been gone from being a private repository to a public repository um and this is by far probably one of the best ways to kind of find sensitive information that's not meant to be there um and attackers adversaries security researchers monitor this they monitor this github api particularly this this type of event to try and find uh when code has leaked all right i'm just going to take a quick look in the chat see if there's anything i need to address but no i think we're good we have a few questions in there feel free to ask questions along the way we'll probably only answer them at the end but just so you don't forget all right so now let's talk about the real reasons behind the source code leaks um so you know the ramifications of this so we talked a little bit about in that poll exposing logic and security vulnerabilities so you know this source code is like a blueprint now an attacker may be able to gain access to your application without it but if they have it then you can do a whole lot more and a lot faster you can run it through code analysis tools for example find vulnerabilities to really exploit areas in that it also can kind of attract unwanted attention um you know in the case of twitch and a lot of other areas you know the main headlines of that was that streamers incomes were leaked if you look at the mainstream media and what people were reporting in that it was the income of the streamers now this isn't uh in the the world of things you know that's definitely wasn't the worst thing that was leaked out we found a huge amount of credentials that could have been used in much more malicious ways and thankfully in twitch's case the the attackers leaked it publicly which means that twitch knew about it right they announced it had they kind of kept this private it could have been worse but this was reputational damage that this happened even though this wasn't the the biggest security that could have been reported on this this was twitch's reputation that was damaged um as a result of that so obviously there can be uh consequences on that and also incurring financial losses now there's also regulations around the type of data that cut that um companies and the the security implications that they have around it if your source code eventually leads to access to private information users information this can really lead to some financial litigation and also some other damaging areas in terms of being uncompliant so there's lots of different ways of of why source code is bad uh so um one of the other ones that we always talk about is hard coded credentials so i thought i said multiple times now then we found 6 000 credentials in the twitches breach so we've done analysis on lots of the main breaches that have happened recently so uh we looked at microsoft we looked at nvidia and samsung we reviewed how many secrets were in their source code spoiler alert they all have secrets uh in in their source code you can take a look at that um with examples of type of secrets that were found uh but that were found in there and then there's also other security researchers that have done reports on the other types of vulnerabilities that are in that are in these uh these source codes so it's it can be quite a damaging uh kind of uh application to have your source code leaked and now i just want to take a quick a little bit dive and then i'll stop trying to scare everyone i'll calm down i'll calm down in a minute um but i just i just cherry picked a couple of interesting examples based on on what we see i don't want to spend too much time on this i have other deep dives into these topics but you know looking at how source code leaks have really actually affected companies so one of the the best examples is nvidia so nvidia's source code was leaked out by lapsis um it's not super clear exactly how lapses gave access to that source code but it could have been a number of different reasons but what's fundamental is that there was a bunch of secrets in there critical of most critical were some signing keys and we know that malware was signed using nvidia's keys and this is when the issue of leaked credentials can be a very complicated problem because i have the two keys here and you'll see under the valid from that both of these keys were expired one expired in 2014 one expired in 2018. logic should mean that these wouldn't be accepted anymore but that's not how these types of security credentials work because if you stopped accepting accepting them all the hardware or older fundamental software that hasn't been updated um you know could stop working so it's very very very hard you know to keep track of all of these and even expired keys can cause a problem so these particular keys were probably buried deep into the history of nvidia um and a security researcher was able to find them and then still use them despite them uh being uh being invalid and the other one we've briefly talked about is misconfigurations from devops tools so there was one uh one interesting example um and probably the the most relevant in this uh was from sonocube but there was a security researcher that found out that he could access lots of private source code through misconfigured devops tools so sony cube um you know is is is a great tool the vulnerability wasn't with sonocube itself but because it's so easy to misconfigure it's when you set these things up um you know it it becomes a vulnerability for those using it and you know in this case their security reach researcher was actually able to gain access to hundreds of different companies private source code through these devops tools and then this particular researcher actually published uh this source code uh on a on a git on a git server so these were actually out there in the open um and he did comply with the dmca takedowns as they came through but the companies had to make those and a lot of the companies never actually filed a dmca takedown of the of the source code that was leaked because of this researcher and that is because they simply had no way of knowing that it exists there all right so i'm now going to invite toma to uh kind of join in here now that i've i've sufficiently scared everyone hopefully it has that was the goal to i'm gonna come in and scare everyone and then tom is gonna come in and calm everyone down and see how we can uh solve this problem but uh uh thomas thanks for thanks for thanks for being in here and i want to talk a little bit about uh how we're going to detect these leaks on github so just as a reminder you probably know that that github is the largest uh code sharing platform um that we have so as a reminder of how big it is you know 73 million developers are on github and this is absolutely huge so tom my first question is kind of at the scale of this why is it why is it detecting code so different difficult at the scale of github on mute if you're there can you hear me tamar you're still on mute hannah oh i'll meet you nope i can't announce you um can you beat yourself in stream yard i hear you okay now loud and clear so i was just saying what are some of the what are some of the issues when it comes to like we said github is the biggest culturing platform in fact developers say that github is a like social media social social coding platform so a lot of things happen in github a lot of collaboration a lot of projects gather thousands of developers so it's really something big it's around 73 million developers today 16 million developer joined in 2021 so it's too big in 2021 alone 61 million input three were created on the on the platform this is where open source is written like every big open source project is maintained on github whether it is there are form kubernetes uh rails ruby everything is on github so it's a lot a lot of work and in fact coach checker which who is a gate evangelist and one of the co-founder in github invested in guardian in 2019 so we kind of really close with github and we are we are really close to the technology and to the ecosystem and um what what does that kind of render the challenge if you want to go slide back okay so you know at italian we are specialized in uh securing cultures in um by preventing a secret spring so that means usually what we do is we look at source code and we prevent uh secrets to be written in it so i think it was a question on the on the on the chat so yes we have a pic of it that forbids you to commit secret for example that's something that's always great but we wanted to to do more so we've been monitoring publicly github for almost four years so we have a database of 15 billion patches i mean really 15 billion so that's a lot of things and we wanted to to go deeper i wanted to to investigate in frequent prevent ipd the same way we prevent secret secret spraying that's why we started this initiative of hazmat called it so what we do with hasbro colleagues is really um is really simple when you use github each file you commit has a specific signature it's called the sha one and with a specific signature you can look for matches from your private data into private data so what we do you just we look for matches so all of the things in the code is is boilerplated like you start a new react project you're going to have a lot of code generated by react so that code if it's in your private tributary and in public repository it's okay i mean no one cares about that code but what we discovered is that usually what needs not especially one files or one part of a file it's a whole repository so once in a while you get a lot of matches between private and public and that's what we want to look at is what we build that tool we that allows you to fingerprint your your code using git shower and to look for those those show on in public github so the product is built in two parts you have one part which is a utility that you need to install on your machine so you need to install it in the command line so it's really for the for the developer inside inside you and uh you really don't you don't trust anything because the code is fully open source you can look at the code just some uh some go along some ground packages and we just look at your vcs close include the repositories take take the shower out and uh and output it to a 2005 i'm gonna i'm gonna show you everything uh right afterwards then you need to just need a token to connect to your link to evcs so the computation is done on your local machine so we don't have access to your files there's just no bridge in that way it's completely secure you just send us fingerprints which is basically useless for from our point if you want to steal your your data and you should leak it it will match it and then you upload your your your signatures to our server and we look for matches on public github then we will crunch the data because there is a lot of data to crunch and when we have a result we send you an email with access to a nice dashboard when you can see what repositories are safe and what repositories are not if any so i just run the i've just run it a few minutes ago i'm not sure i got the results right okay can you hear me tamara can can can other people in the in the stream hear me i don't think tommy can hear me but can someone can am is my mic still on so mike do you want me to make a small demo of the of the command line interface uh yeah yeah i would love to but i'm not sure if you can hear me i can't tell you but can you thumbs ups or thumbs down okay so i am going to i'm going to share my screen so okay so you having a look at my terminal so like i said the we have a util which is a cli utl which is written in golang so it's called src fingerprint and if you're using mac os and you're using homegroup you can install it just by running it that simple command so it's a tab so it's already updated so i won't have anything to do so then you have to run a simple command so what you're going to do is pass the vcs token as an argument then and then you're going to use in specific prints that will collect and target your provider so if you're using github enterprise and github.prem you can specify a specific host if you use gitlab you can just specify gitlab.github if using gitlab on prem you can still specify the host we only check the private files because if we check the public files or we should have matches on the public github data and we just are going to clone your repositories and generate fingerprints so i have around 50 repositories on my personal github account and i'm just going to compute it right now so what you need to know is that the limit would be your bandwidth if you have like a low bandwidth it may take time to download the repositories but if you have a nice nice bandwidth it's going quite fast we already at 2020 repositories so you see it's quite it's quite fast i got one big repo so it takes more time for this one [Music] [Music] so it's a it's a really devoid we don't we don't speed up one minute and one second so i think it's quite fast uh maybe not the world record but uh just the time to to have to have a nice glass of water so what you see is that we generated that file which is fingerprint the json l gz so i'm just going to [Music] to unzip that file to show you what's okay so i got it i got a nice json error and i'm going to show you what's inside the file so as you can see there is really nothing that may uh that is sensitive there you have the repository name so data is clearly not sensitive the size the file path and just the sharp like they say which is the signature to us so this file is really not sensitive and that's one the great thing about um of that of that approach is that you say for from your side even if you lose that that file or your that i get stolen it's correctly okay so what i'm going to do after and i'm going to have to stop my screen and reshare it so i just go to has my colleague and go to the bottom of the page and i can choose to drop a file so i got my json gzip i'm adding my email representing my file shares and okay thank you we're waiting for for anime so this time i'm gonna cheat a little bit and show you directly the results but i've i've run the command like five minutes ago so it's really really um it's really fast if you have small small data like so what we see here is that for example maybe i want to check that a specific repository of mine is really safe so i know work it is a important important for me okay work it my favorite repository is safe so then i want to investigate maybe what is high risk and why like i should have to check so what i do is i'm going to play with the features so i'm going to check which one are high risk so i see i only have repositories with with one specific repositories that match that means that's really specific remember that we are we are comparing our data against 15 billions of patches so that's a lot it's uh against millions of repositories so when you have only one match and then that's spooky and then you have that really interesting thing is that the unique matches so like i said when you write code you have a lot of boilerplates you use a lot of boilerplates from from libraries and a lot of generated code from from you your libraries so discord we don't really care but you have everything that you write yourself like all that small things where you put on your love and knowledge really that to where you create some values and sometimes you put some secrets and sometimes you have some vulnerability vulnerabilities so those files they really are unique they are really unique so it's really unique it's really unique so we have the data unique and we're going to look i'm going to take the one we that have more than 50 50 50 uniqueness okay so there is vine video ruby what is that oh it's a twitch it's on twitch so in fact i cheated a little bit so like we said the twitch data got leaked a few months ago so we had a look at it to find some secrets and we wanted to check that had there been any leak from twitch on the on github and in fact there was a whole repository that came that that's so fast in github the day after twitch got got breached it was that repeatedly was called twitch open source and it has been taken down by github so unfortunately if i go now and if i if i don't do that if i do twitch so the the user has disappeared but before that we used to see that the content was blocked by gmc by dmc complaint so it's really interesting to see that we come back to the dmca complaint and in fact if we look at twitch data so we're just gonna have to circle back to find to find dudes let me just a few seconds so i can tell you back into this today so okay so i'm going to show you for twitch so it takes a bit longer to a bit longer to to load because we sorry scream i'm sorry i didn't cheat on this one so we have the reloading time so what we can see from twitch is that they have a lot of repo that has been put to uh to to give her the day after they were breached so it's around woman they have four pages so that's really really a lot of uh it's around 350 repository that were uploaded to github after the bridge so that's a lot of thing and obviously i think they have been looking so they saw they saw the lincoln guitar straight away but just imagine someone upload a bit of a repo or just when you pursue it right now it may it may be an artist so that really what that that project is about is about checking that your private data is safe and just if it's not safe you can ask for gmca take down you can take actions and you know that you've been bridged so that's really the the power of that product and i think everybody everybody should should use it it's free so you can you can be safe and uh and be sure that you you've not been breached and that your repositories are are safe in your in your private business um that's especially important if you have like a sensitive repository that you want to track or for example more trivial facts if you if you do some coding interviews a lot of them end up on github too or if you if you're an independent developer we see a lot of things that copy that for example chrome extension pdm extension and so on so actually a powerful tool uh that really detects that kind of of bridges where the where the repositories are mirrored and that also addresses the issue with sad figures when one of the developer mistakenly push repository to public github instead of your private github or you instead on on-prem instance and that happens more than more than we think and that's actually about it so for switch what is good is that everything was linked to that user twitch open source so it's uh something great for them and you can see that the example of a author file that is common across several requisites so okay i'm gonna give back the mic to you uh to you mac i'm just gonna have to to check my headphones because i have no song for five minutes so i hope you hear me okay but i'm not hearing anything can you you can't hear me now oh yeah you know okay oh yes thank goodness okay okay we're good okay so i think that was it for the demo i think it's time for question or q a i think that's the um yes yes yes time for q a so let me just pull up my slides again here um thanks for that now i i have some questions because i wasn't able to talk to you i was dying on this because i had so many things i wanted to ask but uh we're gonna look through we already have some great questions on on crowdcast so but what i wanted to ask you is for those that maybe don't know you mentioned 15 billion patches that we have you know what what what exactly is is a patch um you know what what is that database that we're scanning that we have 15 billion patches of for those that maybe don't know exactly what that is okay so i think you're all familiar with the with your code file so you have a file where audio code is and but every time you do a commit you make a small change to your file so in fact your file now is just the sum of changes you've made since you created it so for example if you say if you create a file you write puts test you commit it the first patch is just push test so what is a patch is a difference between two state of a file in between two comments so in a patch you have deletion and edition and if you go to any comment on github what you're going to see is the differences in files and those are the patches i'm not sure it was really clear because it's really uh nerdy but no no no i think that's clear i think that's clear too now we have some um some questions here uh in our question section so um one question here uh is that have you have you considered adding vulnerability adding functions to alert for called modules which themselves contain vulnerability so this is kind of like talking about scanning the code itself for for vulnerabilities so i mean um obviously this isn't isn't scanning the code but this might be a good area to talk about you know the relationship with guardians products which scan source code for secrets and how that fits in with has my code leaked it was a random question so yeah i think that's uh that's complementary that's uh the [Music] secrets obviously that's really really important to check and when you have a secret leak you need to address it now it's uh it's a no-brainer for for ipa attention it's more like uh it's good to be related and take action but usually it's not uh like a live death situation so currently we are working on on on porting the product i just demoed into our internal monitoring product so the idea will be for users to have both their secrets monitored in one place and their ipd detection uh running in background so they are secure for you know on both dimensions we think at the guardian that the security is a lot about uh there are a lot of [Music] different things that means that your code is secure and currently you are working on these different things to make sure that one day all your code is secure and secret is one thing uh ip leaks are one of one of the things but there are still many things to come [Music] right great as it's exciting to hear from we have another question here is how do we filter out proprietary code from knowing kind of open source shared libraries um areas like that so how do we distinguish between between uh the open source code and you kind of touch on this in your project and and kind of finding that out in in other shared libraries yeah that's really really good question in fact we that's one of the first thing we stumble across in fact when you have open source code it's present you can find it a lot in on github that means we're going to find a lot of matches for from that code on the github so when we have a file that's become common we say it's a common file we say it's not important so if you have a repository with a lot of boilerplates a lot of open source code or if you have which you forked a project it won't be it won't be shown as high risk because we only show as high risk repository that are unique matches so matches that are significant enough to be considered dangerous for us but but we we still and we still we will get rid of the obvious false positive false positives okay now we have another question here um about uh if we're matching the file version type that we're kind of doing these fingerprints um you know couldn't someone just add in some random you know random white spaces to create a unique you know hash file for example so um you know you talked a little bit about you know how whole repositories are uh are leaked so you know what what what is the likelihood of kind of someone stealing your code and and changing it slightly um to avoid that detection yeah that's uh yeah that's uh that's a good point if if we wanted to to find some look-alike code like you have a nice algorithm and someone studied it but used a different ide or use the linter to change it and change the ordering of the method for example unfortunately we won't get it we need to we can only catch um like a exact map exact matching we are limited to exact matches but the reuse case we are targeting is not someone who is stealing a bit of code from from the employer it's not that we are we are watching at what we are really watching at is someone who mirrors a repository a complete repository someone steals them i mean a representative that really to us was really that issue wanted to focus on that repository that leaked because usually when there is a leak it's not one file it's a lot of file and a lot of repos so we didn't want to focus on finding those small small small things in code we wanted to to do for the for really the bigger the big issues and the big problems right cool now we have a couple of questions here that uh maybe i'll quickly take we have one from gerald that's talking about uh he said that you've noticed that our free version of get guardian uh is only free for oh i think that question's gone now but anyway i'll still answer it oh here we are it's free for up to 25 employees for public repos only not actually true um if if you have less than 25 developers and you want to use our full business plan for gegardian then all you need to do is email support guardian.com with your company name how many employees you currently have and we will upgrade you for free to the complete business plan um so it's a bit of a manual process but uh it's it's it's the checks and balances we have in there so anyway that's a question i can answer yay all right um now we talked you talked a little bit about this but we have a question about what's what determines that you know like the the severity the high risk so you talked about this a little bit before but but um you know like how exactly is that high risk low risk calculated in that severity okay so we have three different buckets so we have high risk low risk and uh category that we call unknown because we didn't find a better name sorry so low risk first lowest low risk it means we did not find uh unique matches so every matching we found were really common so we to those to us it was a level that boilerplate and we have nothing that's specific enough to give us a hint that you have a leak and you have a number and most of the five have no uh no no match on uh on big either so this repository we label them as safe on the other side we have the high-risk repositories that means they have a high recall for unique matches and a high uh high percentage of the files that are contained in the repositories have matches so for human human decide means that a lot of styles are found on public github and because like we said we have a lot of files on github that's not just random usually so if you have if you have one of your private history that is shown as high risk that means their repositories are really close to to yours so there is one thing you can look at on the report there say you can order the the report using the column number of repository matching if you if you have only like one repository that matches with a high number for unique match of unique unique file matching that we said it means basically it's the same way it's the same position got it so so that's really the way we did that so still we are still young but that on that part like the secret we've been doing that for four years now so we are quite confident with secrets with ap leagues a bit less but that method yes we're going to have false positives but like we show you we we've just shown you with twitch um when there are matches we will find them got it got it and of course i mean as a new tool this is definitely going to be improving now we're almost at the end of the hour uh but i do we have a couple more questions and then we'll wrap up with some uh some prizes announcements but uh there's some questions about using this tool on git lab and and um and bitbucket um you know can we can we can we use this tool uh uh on on gitlab our bitbucket repositories um so i'm assuming the answer is yes we can fingerprint the code from bitbucket and gitlab yeah yeah we the uh yes you can only check for public files on github.com so the public part will only will only be github.com but you can fingerprint your code on github github and prime gitlab github on-prem and bitbucket on prime and debug it the only limitation is that you need to the project need to be based on git and not on other other tools like svn or mercurial right yeah and and you know it is important to know is the scale of github in terms of open source you know github is a place where its code is going to leak publicly when it comes to big bucket or or gitlab you know they're a lot more focused on enterprise customers being private so there's not as much open source uh you know kind of communities uh there but all right moving on so uh prizes so we do have some participation now i do want to say that uh we have some amazon gift cards to give away and uh we're gonna be giving these away to people that are using the tool that attended this webinar so if you attended this webinar uh use that tool use the same email that you used to sign up for this one um to give it a go and i will be able to email you and let you know i will select some people and let them know if they've won an amazon gift card but i'm just gonna pull up some analytics of some of the most uh uh active people in this and [Music] he's gonna give me a number i'm just crunching some numbers at the moment here we go all right so uh we have a very active user who's in his uh second stream now um the i don't know if i can pronounce this name so i'm going to post it in the chat this is i had to do this last time too and i did and i got the name wrong as well uh uh so here we are i'll give it a chance and i apologize if i get this name wrote but ola uh olo rafimi ibinzi in bin zitzer [Laughter] i'm very sorry i think i butchered your name but um uh congratulations you've won a swag bag um uh as long along with ivana um so i'll post in the slideshow congratulations i'll email you um with uh with details um and i'll get a postal address from you and we'll send out that swag bag so again i apologize about that name but i gave it an attempt but uh at this point i'd like to thank everyone for joining this live stream and tomorrow particularly i would like to thank you i know we had some technical difficulties uh in this one um so thanks for staying with us um but uh there will be a recording afterwards and we will be publishing lots more information about this so make sure you follow uh get guardian on the relevant uh the relevant channels and um uh we look forward to to seeing you seeing you all these webinars are monthly so also make sure you follow us on uh on on crowdcast that way you'll get notified whenever they want it so again uh thank you all for being here today um that concludes the the webinar for today so thanks again and uh look forward to see you guys next time