The genesis and architecture of my CyberGordon project
CyberGordon, an aggregator of your cyber reputation checks
CyberGordon quickly provides you threat and risk information about observables such as IP addresses or domain names by querying multiple threat intelligence sources.
Thanks to each source that provides free access to great Threat Intelligence against phishing and malware. Without them, CyberGordon would have not been there.
Why CyberGordon ?
Whether it be during my investigations at work or personal surfing sessions, I’m too lazy to use several sources to check if a domain or email address is suspicious or malicious. Some awesome OSINT tools exist, but I didn’t have one to aggregates them all into one simple web interface. On top of that, I wanted to start by building a usable and useful tool on AWS infrastructure that I could share with my entourage. I would have liked to share CyberGordon widely, but I’m constrained by the query limits that free API sources provide. Lastly, the lock down during the COVID-19 crisis gave me a lot of time, a rare resource that considerably contributed in completing CyberGordon.
Well why “Gordon” ? As a Batman fan, I chose the commissioner James Gordon, a friend and reliable informant of the Dark Knight 🦇 🕵
On April 2021, Gordon became CyberGordon to better reflect its function.
When I built CyberGordon, I tried to follow several rules:
- Simple: get results after pasting observable(s)
Neither I or my entourage would use a tool that has an extremely complex GUI or that requires sophisticated information for submission. The aim is to copy/paste one or more observables even if they are in a messy format (listed, quoted or in CSV format) submit on a unique form and get a readable summary. I’m still working on improving this last point.
- Scalable: add easily new sources
Scalability often referred to the ability to manage automatically the capacity depending on the user’s demand. My intention was humble: I wanted an evolving system where I can add, or update, easily a source (called engine) without impacting the existing ones and without adding delay during processing of the user’s observable submission.
- Almost free: use adapted and cost-effective services
For my first tool on AWS and as a non-profit service, the cost was of course the most important criteria. All major cloud service providers offer free tiers use for few of their main services for a duration of one year (sometimes less) or for life. I spent some time looking for simple and almost free AWS services before building a draft.
- Serverless: minimal maintenance
It started as a challenge: being a relatively simple tool, I tried to avoid managing a Linux server, even though if I loved managing Debian servers previously… I wanted to test functions (containers of code), where you only manage the code, while the underlying layers (runtime, OS, hardware) are managed by AWS. Of course, code maintenance has to be done at least for each runtime update (Python 3.8 to Python 3.9 for example).
- Secure: apply best practices
Last but not least, even tough the manipulated data is not confidential, I have applied some principles: all data — in motion and at rest — is encrypted with AWS managed key (free), permission’s resources are restricted to the minimal needs (least privileges), public exposure is limited and management actions (API) and users’ HTTP request logs are stored for a duration of 6 months.
Attempt with Slack
Before hosting CyberGordon entirely on AWS, I tried to build a front-end on Slack as a ‘bot’ using Slack Commands feature and processing them on AWS. It works like a charm with one engine, but with two or more it is a mess and unusable. Slack is not suitable for presenting multiple results ; it is a good chat tool for “one question, one short response” capability, but not as a reporting tool…
Slack request is sent to an HTTPS endpoint (hosted on AWS API Gateway) and forwarded to the back-end. Results are then returned using the URL incoming webhook included in the Slack request. As you can see below, results are quickly unreadable when using 2 engines…
A teammate suggested I generate reports on a webpage and send links to Slack user. However, after a lot thinking, this hybrid solution didn’t suit me because it limits the user’s scope only to my Slack workspace users.
Current architecture and how it works
To get all AWS capabilities and cheapest prices, all resources are hosted in “US East (Northern Virginia)” region, except for 2 resources invoked near the user location.
Short summary On website (1) you paste an observable and submits it ; request is parsed (2), sent to a queue (3), dispatched to engines (4) that queries API sources. During this background work, you’re forwarded to the results page (5).
Components are represented on this diagram and described in detail later on.
1. Static website
All front-web assets are stored on a single S3 bucket (object storage). To provide cache, reduced latency and encrypted traffic (HTTPS TLS 1.3), a CloudFront distribution (CDN) is used ; the S3 Bucket policy only allows traffic from CloudFront.
The domain cybergordon.com points to a CloudFront distribution and the DNS zone “cybergordon.com” is entirely managed by AWS Route 53 (DNS service).
2. Request pipeline
By clicking on “Analyze!”, an HTTP POST request with observable(s), passes through the CyberGordon-Request Lambda@Edge function: a Python code deployed on multiple geographical points to be executed closer to the user.
The CyberGordon-Request function generates a request ID (UUID version 4) and parses the observable(s) into 7 predefined types list: IPv4, FQDN, URL, MD5, SHA-1, SHA-256 and Email address. Basically, the function compares the request body with 7 flexible regex that accept new line (\n) or space between each observable.
Then the user is forwarded (HTTP 302 Found) to the results page that is described below.
3. Queue pipeline
Small but powerful part to dispatch observables to engines depending of the types they can check against sources. The CyberGordon-Queue SNS topic receives message and send immediately a copy to each subscriber that accepts submitted observables type(s).
4. Engine pipeline
Each CyberGordon-Engine Lambda functions receives simultaneously the observable(s) list. The engine controls the integrity of the request (using the SHA-1 function), then gets, if applicable, the API token of the remote source in encrypted variables and queries it then in HTTPS. Finally results are stored in a DynamoDB Table (no-SQL database). Results from all engines are stored in a unique database record. In previous implementation, individual results were stored on S3 objects, with a fourfold increase in lead times to retrieve them!
All engines query remote sources to get live and fresh information, except for the Offline Feeds engine (E23): an hourly CloudWatch Event Rule (scheduled task) invokes a Lambda function (Python code) that downloads, transforms in a JSON format and overwrites the existing feed content stored in the main S3 bucket.
5. Result pipeline
HTTP GET /get-result.
This HTTP call is caught by the CyberGordon-Results Lambda@Edge function (like the Request function). This function reads the database record that contains all results and return it as a JSON Document.
Result URL example: https://cybergordon.com/r/e3a3a0c9-33c0-46e1-a612-91788ee76d14
The current architecture is far from being perfect and suffers from several issues that are more or less obvious:
Slowness when getting and merging results from each object result. I could merge engines to one function that could generates only one result object ; in this case the CyberGordon-Result function can be spiked.
Re-enforce the security (input control)
The quality of the Python code, long way…
Provide a front-end API and User Account system
Since 2020 I did some improvements which will be the subject of a future article:
Backup config and code on S3.
Industrialize deployment with CI/CD pipeline and Infrastructure as Code with Terraform.
I’m open to any remarks that will help improve CyberGordon !
Thanks to Carole Boijaud, Youssef Sayegh and my darling for their careful proofreading.
Update April 2021: name, web domain, waiting page and result storage