Warmly Technical Challenge
Tasks
- Link companies to the customers own data, merging existing ones (e.g. from their CRM) and the newly generated;
- Have one single source of truth for all company data, regardless where it comes from;
-
Structure the data in such a way that enables an efficient filtering:
- Company profile details (industry, location, size, etc)
- How is this company relevant to the customer business, if it is;
- The filter engine should then feedback the data ingestion services so they can get more accurate over time (remove the noise);
- Ideally the data flow should be as close to “realtime” as it gets, so analytical reports and dashboards can be built out of it;
- Build a score system based on all the available data so you can show the customer what companies are the best for them to engage.
Introduction
This is the technical diagram portion of the Warmly technical interview. Everything here is my conjecture based on what was presented in the prompt. I would want to collaborative design this if it were a real project.
Coding Challenge
My coding challenge is located over on my gitlab: kleaders/warmly
Architecture Diagram
Design Assumptions
- Cost matters: this is optimized for cost first
- Scalability: this should scale upwards (and downwards) quickly to meet the needs of customers
- Maintainability: I am designing for the ability to maintain this in several years. Everything becomes legacy, so lets try to make it less painful by making things simple.
Initial and Cheapest Scale
Lets start with a simple and pretty cheap version of the site that should work at a reasonable scale for a while. The lambdas will auto scale well, and the kubernetes cluster should scale automatically with some custom configuration.
architecture-beta group warmly(logos:aws)[Warmly] group kubernetes(logos:aws-eks)[Kubernetes] in warmly service db(logos:postgresql)[Serverless PostgreSQL] in warmly service message_queue(logos:aws-sqs)[SQS Queue] in warmly service upload_bucket(logos:aws-s3)[Customer Upload Bucket] in warmly service archive_bucket(logos:aws-s3)[Event Archive Bucket] in warmly service file_ingester(logos:aws-lambda)[File Ingestion Lambda] in warmly service api_ingester(logos:aws-lambda)[API Ingestion Lambda] in warmly service postprocessor(logos:aws-lambda)[Postprocessing Lambda] in warmly service web_dashboard(logos:nodejs-icon)[Web Frontend] in kubernetes service events_api(logos:nodejs-icon)[Event API] in kubernetes service customer_crm(logos:salesforce)[Customer CRM] customer_crm:B <-- T:api_ingester customer_crm:B --> T:upload_bucket upload_bucket:B --> T:file_ingester message_queue:R <-- L:api_ingester message_queue:T <-- B:file_ingester message_queue:R --> L:postprocessor events_api:T --> B:message_queue postprocessor:R --> L:db message_queue:L --> R:archive_bucket web_dashboard:T <--> B:db
Ingestion
These are services that ingest from customer data sources into Warmly. There are several types available that we need to design for.
- Flat files: These are pulled or pushed by customers into our S3 bucket. A lambda is triggered based on a file showing up in the bucket.
- API Ingestion: Cron triggered lambdas that pull from API endpoints and write to the core data source. This is for CRMs that can't push to our Event API.
- Event Stream: These are microservices that expose endpoints allowing for event pushes that are written to the message queue for processing and enrichment. These run in Kubernetes rather than Lambda because they are expected to be running constantly and holding open long term connections to external customers.
SQS is chosen because its cheap and easy to work with. There are downsides that are solved by Kafka (reprocessing, retry, better delivery), but at this scale Kafka is expensive.
Data Storage
In this architecture we have one primary data store which is the postgres database, and a secondary archive of the raw events in s3.
I chose a Serverless (Aurora) PostgreSQL for a few reasons:
- Serverless PostgreSQL is cost efficient for the reliability and scalability you get.
- It scales well for web and real time querying, which will make up the bulk of user experience. Customers won't want idividual event data, but rather hourly/daily/monthly summaries of what prospects have done.
- Its supported in all languages and super easy to find engineers who understand it.
We pair this with an archival S3 bucket with a bucket lifecycle that ages out old events. Having this backup allows us to reprocess or re-enrich older data if we need. There are faster and better ways to handle this (like Kafka), but in my experience Kafka pretty quickly gets expensive in AWS usage and in human time. As we scale up we'd investigate adding Kafka to replace SQS and the archive bucket.
Data Schema
The fields in each of the data would obviously vary quite a bit by customer and data source, so I'll need to speak at a pretty high level here. I'll focus on principles and outputs.
The primary types of data we'd want to collect and analyze are going to be things like Events, CRM Data, Company Data, People, Leads, and Customizations (user preferences and filtering). Our output will be recommendations for relevant companies and people to reach out to.
I'd encourage the teams to keep as much data in raw form as we reasonably can by archiving the raw events to the Archive Bucket. Doing that would allow us to reprocess later and not lose fidelity. It will be more expensive in storage costs, but storage is pretty cheap and this is a data product primarily. So we should preserve our data as much as we can afford to.
Filtering and Matching Companies
Filtering and matching companies to events and the customer's CRM data provides the core data that we based our results on.
A company schema might look like this:
(
id SERIAL,
name TEXT,
aliases TEXT[], -- company name aliases
domains TEXT[], -- email or web domains
ip_addresses cidr[], -- CIDR Ranges that we can use to find a visitor via IP
employees INTEGER DEFAULT 0 NOT NULL,
country INTEGER references country(id),
region INTEGER references regions(id), -- State, province, etc
sub_region INTEGER references subregions(id), -- County, township, etc
);
Enrichment lambda
Here is an example flow of how a postprocessing lambda may work with incoming events.
flowchart TB crm[CRM Data] --> message_queue(SQS) message_queue --> enricher[[Post Processing Lambda]] user_table(User) -. select by email domain .-> enricher company_table(Company) -. select by domain or ip .-> enricher enricher --> decision1{Identified Company?} decision1 -- yes --> daily_summary(upsert to DailySummary) decision1 -- no --> create_or_unknown(create new company or upsert to **unknown**) create_or_unknown -.-> company_table create_or_unknown --> daily_summary
This takes all events and pairs them with any known company via ip
or email_domain
. You could also potentially use third party cookies if they exist. If it doesn't find an existing company it creates a new one or adds it to an unknown
catch all.
The daily summary could contain may different fields based on things we want to track, but in postgres it could look like this:
(
id SERIAL,
customer_id INTEGER references customer(id) NOT NULL,
company_id INTEGER references company(id) NOT NULL,
day DATE DEFAULT 'current_date' NOT NULL,
clicks INTEGER DEFAULT 0 NOT NULL,
visits INTEGER DEFAULT 0 NOT NULL,
contacts INTEGER DEFAULT 0 NOT NULL,
form_submission INTEGER DEFAULT 0 NOT NULL,
chats INTEGER DEFAULT 0 NOT NULL,
extra_signals JSONB DEFAULT '{}' -- Special signals that are not as common, may be like this: { emails: 22, phone_calls: 2, logins: 0 }
);
By extracting the most important signals to columns we can upsert easily by using something like this:
INSERT INTO daily_summaries (company_id, customer_id, date)
VALUES ($company_id, $customer_id, $date)
ON CONFLICT DO NOTHING RETURNING id;
UPDATE daily_summaries SET clicks = clicks + 1 WHERE id = $id;
Filtering Engine
Once we have the data ingested into summarized tables we can then filter and create the reports our customers care about. Doing this requires a customer to give us some basic information about how we should boost company signals or demote uninteresting ones. This information will be stored in another table connected to the customers
table.
Its hard to get too detailed here because it strongly depends on customer feedback and research.
(
id SERIAL,
customer_id INTEGER references customer(id) NOT NULL,
field VARCHAR(255), -- This would be "industry" or "country"
value VARCHAR(255), -- This would be the value, like "crypto" or "Belize"
boost_by INTEGER, -- This is the amount to boost the result (positive boosts, negative demotes)
);
We then select all the customer_filters
in the filtering engine and apply it to the daily_summaries
.
flowchart LR daily_summaries(daily_summaries) --> customer_filters(customer_filters) customer_filters --> boosted_summaries(Boosted Daily Summaries) boosted_summaries --> scoring[[Proprietary Scoring]] clicks -. weighted .-> scoring contacts -. weighted .-> scoring form_submission[form submissions] -. weighted .-> scoring scoring --> final_score(Final Weighted Company score)
Because we already have daily summaries of the signals its pretty quick to run this query in postgres and return these rankings in real time. I would probably also put a cronjob together to create a nice daily or weekly report that handles this asynchronously and caches it, but it should be easy to run this filtering and boosting with sub-second latency.
Scaling Up
At some point in the future this architecture will need to be rethought. PostgreSQL will still be a great location to store end-user, time sensitive data for the web apps and reports. However, we probably would need to move the processing of events out of postgres and S3 into a proper queuing system like Kafka. Kafka is complex and expensive, but would allow us a lot more flexibility and safety in how we process things.
architecture-beta group warmly(logos:aws)[Warmly] group kubernetes(logos:aws-eks)[Kubernetes] in warmly service db(logos:postgresql)[Serverless PostgreSQL] in warmly service message_queue(disk)[Kafka] in warmly service upload_bucket(logos:aws-s3)[Customer Upload Bucket] in warmly service archive_bucket(logos:aws-s3)[Event Archive Bucket] in warmly service file_ingester(logos:aws-lambda)[File Ingestion Lambda] in warmly service api_ingester(logos:nodejs-icon)[API Ingestion job] in kubernetes service postprocessor(logos:nodejs-icon)[Postprocessing jobs] in kubernetes service web_dashboard(logos:nodejs-icon)[Web Frontend] in kubernetes service events_api(logos:nodejs-icon)[Event API] in kubernetes service customer_crm(logos:salesforce)[Customer CRM] customer_crm:B --> T:api_ingester customer_crm:B --> T:upload_bucket upload_bucket:B --> T:file_ingester message_queue:T <-- T:api_ingester message_queue:L <-- R:file_ingester message_queue:B --> T:postprocessor message_queue:T --> B:archive_bucket message_queue:B <-- T:events_api postprocessor:T --> B:db web_dashboard:T <--> B:db
Sorry about the formatting, mermaidjs has their architecture plugin in beta, so its pretty messy with complex diagrams
Notable things that would change:
- SQS replaced with Kafka: This provides better retry, ability to route data more effectively, and the ability to do some inline processing more effectively like Streaming SQL.
- Move processing lambdas to Kubernetes: because we're expecting higher throughput at this scale it would be more cost effective to move most processing into Kubernetes.
- Possibly remove the Event Archive bucket. If it is cost effective it would make more sense to store 100 days or so of events in Kafka and allow kafka to be the primary store of events.
There are several other things that we'd need to consider, but I don't have enough information to properly decide now. These are items I'd keep my eye on but also seek active feedback from the teams.
Postgres Scaling
We'd need to constantly evaluate if postgres is scaling effectively for the generation of reports. Its a great end-user store, but if we find we need a much larger set of data to properly generate reports it would make sense to move to smething more like Cassandra/ScyllaDB, ElasticSearch, or even realtime processing from S3. A lot of those decisions would depend on customer use cases and data scale.
Data Processing tools
I have the web and data processing jobs here listed as nodejs apps, but at some point we may find that Python, Rust, or even Java handles that better. I have personal preferences but I would work with the teams to find the best solution given our skills and the frameworks available.
Caching
Caching is omitted from these diagrams just for simplicity, but adding a CDN or Redis server for hot paths would improve performance and possibly cost.