To technical blogs

In the modern world each and every application that’s connected to the internet faces a variety of threats, some more serious than others. An application with the complexity and data value of S3 Passenger must be protected from these threats in order to protect Sqills’ and her customers’ interests.

Many threats S3 Passenger faces are managed by numerous measures implemented through infrastructure as well as the application code. However, many of these measures rely on information that should be kept private and secure. As such the public exposure of this information poses a security risk in itself, as it may allow for circumvention of existing security measures. For example: What happens when an API key is published online by a developer?

In order to manage these potential “data” threats they first have to be discovered. In order to achieve this Sqills has developed an innovative application that allows for the targeted discovery and analysis of threats. Any new threats discovered by this application are reported to Sqills as soon as they’re found such that action may be taken to protect the integrity of S3 Passenger.

Finding a needle in a haystack

Discovering threats on the internet may sound easy, often resulting in responses such as “Build a web scraper! We’ve been doing that since the 90’s”. However, the internet is already a huge collection of information, and with approximately half a million web pages being created every day, it is still growing. Scraping all this information and analysing all of it is not only unnecessary but also a waste of resources. So how do we find only the pages that are of interest to us?

The first step is to limit the services that are going to be used as sources, as far as Sqills is concerned the services to monitor are only those used to communicate with customers. This by itself already greatly limits the amount of files that could potentially be analysed. However, 99% of these files are still irrelevant to S3 Passenger and its security. This is why a secondary process called dorking is used to further limit the pages that will be analysed.

Getting dorky

Dorking is a term that may be familiar to those of you that have done reconnaissance work for bug-bounty hunting or hacking purposes. To simplify the meaning of dorking: Imagine you’re playing “Go fish” with Google except you ask for all pages ending in .php instead of the ace of spades. The purpose of a search function is to give you what you ask for, so just like in “Go fish” it has to comply and will provide you with what you asked for. For our use-case we can use this process to find documents that contain information related to Sqills that shouldn’t be publicly available.

Any service that provides an advanced search functionality to query the data within their system is vulnerable to this kind of attack. Many code-hosting and sharing platforms support such functionalities which allows our application to access a wide range of information to see if anything has been leaked. For example, querying Github with the following search query will result in finding .env and .yml files containing the text API_SECRET followed by any character.

ext:env|ext:yml allintext:”API_SECRET*”

Kowalski, analysis!

Dorks provide us with a list of files that, based on a very broad and generalised search query, might include sensitive information. Due to the limitations of even advanced search implementations the resulting list will always contain false-positive results.

In order to filter all these false-positives (or for the data scientists among you, type 1 errors) we need to add another layer of analysis. The filtering and extraction of results is done by analysing the page contents with regular expressions and logical analysis. The regular expressions are used to find and report on information following specific formats, such as API keys and AWS access key pair information. Logical analysis is applied to find information without specific formats. An example of these are password and passphrases, for which shannon entropy is used as the metric for potential hits.

Caching in on infrastructure

The diagram above shows the infrastructural approach for providing a performant scraping system using the previously described techniques. The data management and process orchestration capabilities of the system are provided by a permanent server, the “backend” group in the diagram. Within this server the event-driven systems are responsible for orchestrating the analysis process. The analysis workload is split between two fleets of serverless functions, one for dorking services dubbed “Discovery” and one for the deeper analysis dubbed “Analysis”. These are fleets of serverless functions since each dork and each analysis function runs as a separate function. This was done to increase scalability and reduce function complexity.

Communication between the event-driven system and the serverless function fleet is conducted through asynchronous communication methods such as notification topics and message queues. This allows each component of the analysis process to independently complete work without blocking or waiting for other’s work.

So at this point we have two layers of analysis, one of which relies on 3rd party services and their advanced search functionalities, and one of which relies on page content. Both of these require calls to be made to 3rd party services to retrieve their input or output data, and both are subject to rate-limiting. This is especially so due to the choice of distributing discovery and analysis functionalities over multiple serverless functions, causing each one to do new requests. So how do we keep our system performant whilst not overloading the 3rd party services with requests? The answer, as so often in software development, is caching.

A Redis database is added as a caching service for the serverless function fleets. The discovery fleet is responsible for retrieving the file content for each resulting page that was discovered, and writing it to the Redis cache. Once the content is in the cache, identified by its location as the key, the analysis fleet can retrieve the page contents without hitting the 3rd party service hosting the content, lowering the chance of rate-limits being hit, reducing data transfer costs and improving analysis throughput.

With the design of the system as described above Sqills is currently capable of completing a scan for a singular page in 36ms on average. Extrapolating this data into the full 90 minute daily scanning window allows the system to discover, scan and process 150.000 pages a day.

Into the future

However, prevention is always preferred over mitigation. As such future plans may include providing an easy to use tool that uses the analysis processes developed for this project to scan and report on issues before information is even shared.

This new capability is not linked to S3 Passenger and could also be reused by other organisations. We are investigating publication as an open source project.

Interested? Please contact us.

About the author

I’m Matthew and I’ve been working at Sqills for almost two years now. Until recently as a part-time employee next to my study. Designing and implementing this system was the graduation assignment for my bachelor’s degree of software engineering. My thesis “Designing and implementing a scalable web scraping system to detect, notify and observe leaked sensitive information on public web platforms” was graded with a nine and thus I’ve graduated in January 2021. Since I’ve graduated I’m a full-time employee at Sqills as part of their platform tools development team where I work on log aggregation, Kubernetes deployment and security related tooling.

Do you want to know more? Contact Matthew here!