Warmly Technical Challenge

Kyle Leaders 2024-12-03

Tasks

Link companies to the customers own data, merging existing ones (e.g. from their CRM) and the newly generated;
Have one single source of truth for all company data, regardless where it comes from;
Structure the data in such a way that enables an efficient filtering:
- Company profile details (industry, location, size, etc)
- How is this company relevant to the customer business, if it is;
The filter engine should then feedback the data ingestion services so they can get more accurate over time (remove the noise);
Ideally the data flow should be as close to “realtime” as it gets, so analytical reports and dashboards can be built out of it;
Build a score system based on all the available data so you can show the customer what companies are the best for them to engage.

Introduction

This is the technical diagram portion of the Warmly technical interview. Everything here is my conjecture based on what was presented in the prompt. I would want to collaborative design this if it were a real project.

Coding Challenge

My coding challenge is located over on my gitlab: kleaders/warmly

Architecture Diagram

Design Assumptions

Cost matters: this is optimized for cost first
Scalability: this should scale upwards (and downwards) quickly to meet the needs of customers
Maintainability: I am designing for the ability to maintain this in several years. Everything becomes legacy, so lets try to make it less painful by making things simple.

Initial and Cheapest Scale

Lets start with a simple and pretty cheap version of the site that should work at a reasonable scale for a while. The lambdas will auto scale well, and the kubernetes cluster should scale automatically with some custom configuration.

 architecture-beta
    group warmly(logos:aws)[Warmly]
    group kubernetes(logos:aws-eks)[Kubernetes] in warmly

    service db(logos:postgresql)[Serverless PostgreSQL] in warmly
    service message_queue(logos:aws-sqs)[SQS Queue] in warmly
    service upload_bucket(logos:aws-s3)[Customer Upload Bucket] in warmly
    service archive_bucket(logos:aws-s3)[Event Archive Bucket] in warmly
    service file_ingester(logos:aws-lambda)[File Ingestion Lambda] in warmly
    service api_ingester(logos:aws-lambda)[API Ingestion Lambda] in warmly

    service postprocessor(logos:aws-lambda)[Postprocessing Lambda] in warmly

    service web_dashboard(logos:nodejs-icon)[Web Frontend] in kubernetes
    service events_api(logos:nodejs-icon)[Event API] in kubernetes

    service customer_crm(logos:salesforce)[Customer CRM]


    customer_crm:B <-- T:api_ingester
    customer_crm:B --> T:upload_bucket

    upload_bucket:B --> T:file_ingester

    message_queue:R <-- L:api_ingester
    message_queue:T <-- B:file_ingester
    message_queue:R --> L:postprocessor
    events_api:T --> B:message_queue

    postprocessor:R --> L:db

    message_queue:L --> R:archive_bucket

    web_dashboard:T <--> B:db

Ingestion

These are services that ingest from customer data sources into Warmly. There are several types available that we need to design for.

Flat files: These are pulled or pushed by customers into our S3 bucket. A lambda is triggered based on a file showing up in the bucket.
API Ingestion: Cron triggered lambdas that pull from API endpoints and write to the core data source. This is for CRMs that can't push to our Event API.
Event Stream: These are microservices that expose endpoints allowing for event pushes that are written to the message queue for processing and enrichment. These run in Kubernetes rather than Lambda because they are expected to be running constantly and holding open long term connections to external customers.

SQS is chosen because its cheap and easy to work with. There are downsides that are solved by Kafka (reprocessing, retry, better delivery), but at this scale Kafka is expensive.

Data Storage

In this architecture we have one primary data store which is the postgres database, and a secondary archive of the raw events in s3.

I chose a Serverless (Aurora) PostgreSQL for a few reasons:

Serverless PostgreSQL is cost efficient for the reliability and scalability you get.
It scales well for web and real time querying, which will make up the bulk of user experience. Customers won't want idividual event data, but rather hourly/daily/monthly summaries of what prospects have done.
Its supported in all languages and super easy to find engineers who understand it.

We pair this with an archival S3 bucket with a bucket lifecycle that ages out old events. Having this backup allows us to reprocess or re-enrich older data if we need. There are faster and better ways to handle this (like Kafka), but in my experience Kafka pretty quickly gets expensive in AWS usage and in human time. As we scale up we'd investigate adding Kafka to replace SQS and the archive bucket.

Data Schema

The fields in each of the data would obviously vary quite a bit by customer and data source, so I'll need to speak at a pretty high level here. I'll focus on principles and outputs.

The primary types of data we'd want to collect and analyze are going to be things like Events, CRM Data, Company Data, People, Leads, and Customizations (user preferences and filtering). Our output will be recommendations for relevant companies and people to reach out to.

I'd encourage the teams to keep as much data in raw form as we reasonably can by archiving the raw events to the Archive Bucket. Doing that would allow us to reprocess later and not lose fidelity. It will be more expensive in storage costs, but storage is pretty cheap and this is a data product primarily. So we should preserve our data as much as we can afford to.

Filtering and Matching Companies

Filtering and matching companies to events and the customer's CRM data provides the core data that we based our results on.

A company schema might look like this:

CREATE TABLE companies (
    id SERIAL,
    name TEXT,
    aliases TEXT[], -- company name aliases
    domains TEXT[], -- email or web domains
    ip_addresses cidr[], -- CIDR Ranges that we can use to find a visitor via IP
    employees INTEGER DEFAULT 0 NOT NULL,
    country INTEGER references country(id),
    region INTEGER references regions(id), -- State, province, etc
    sub_region INTEGER references subregions(id), -- County, township, etc
);

Enrichment lambda

Here is an example flow of how a postprocessing lambda may work with incoming events.

flowchart TB
    crm[CRM Data] --> message_queue(SQS)
    message_queue --> enricher[[Post Processing Lambda]]
    user_table(User) -. select by email domain .-> enricher
    company_table(Company) -. select by domain or ip .-> enricher
    enricher --> decision1{Identified Company?}
    decision1 -- yes --> daily_summary(upsert to DailySummary)
    decision1 -- no --> create_or_unknown(create new company or upsert to **unknown**)
    create_or_unknown -.-> company_table
    create_or_unknown --> daily_summary

This takes all events and pairs them with any known company via ip or email_domain. You could also potentially use third party cookies if they exist. If it doesn't find an existing company it creates a new one or adds it to an unknown catch all.

The daily summary could contain may different fields based on things we want to track, but in postgres it could look like this:

CREATE TABLE daily_summaries (
    id SERIAL,
    customer_id INTEGER references customer(id) NOT NULL,
    company_id INTEGER references company(id) NOT NULL,
    day DATE DEFAULT 'current_date' NOT NULL,
    clicks INTEGER DEFAULT 0 NOT NULL,
    visits INTEGER DEFAULT 0 NOT NULL,
    contacts INTEGER DEFAULT 0 NOT NULL,
    form_submission INTEGER DEFAULT 0 NOT NULL,
    chats INTEGER DEFAULT 0 NOT NULL,
    extra_signals JSONB DEFAULT '{}' -- Special signals that are not as common, may be like this: { emails: 22, phone_calls: 2, logins: 0 }
);

By extracting the most important signals to columns we can upsert easily by using something like this:

INSERT INTO daily_summaries (company_id, customer_id, date)
    VALUES ($company_id, $customer_id, $date)
ON CONFLICT DO NOTHING RETURNING id;

UPDATE daily_summaries SET clicks = clicks + 1 WHERE id = $id;

Filtering Engine

Once we have the data ingested into summarized tables we can then filter and create the reports our customers care about. Doing this requires a customer to give us some basic information about how we should boost company signals or demote uninteresting ones. This information will be stored in another table connected to the customers table.

Its hard to get too detailed here because it strongly depends on customer feedback and research.

CREATE TABLE customer_filters (
    id SERIAL,
    customer_id INTEGER references customer(id) NOT NULL,
    field VARCHAR(255), -- This would be "industry" or "country"
    value VARCHAR(255), -- This would be the value, like "crypto" or "Belize"
    boost_by INTEGER, -- This is the amount to boost the result (positive boosts, negative demotes)
);

We then select all the customer_filters in the filtering engine and apply it to the daily_summaries.

flowchart LR
    daily_summaries(daily_summaries) --> customer_filters(customer_filters)
    customer_filters --> boosted_summaries(Boosted Daily Summaries)
    boosted_summaries --> scoring[[Proprietary Scoring]]
    clicks -. weighted .-> scoring
    contacts -. weighted .-> scoring
    form_submission[form submissions] -. weighted .-> scoring
    scoring --> final_score(Final Weighted Company score)

Because we already have daily summaries of the signals its pretty quick to run this query in postgres and return these rankings in real time. I would probably also put a cronjob together to create a nice daily or weekly report that handles this asynchronously and caches it, but it should be easy to run this filtering and boosting with sub-second latency.

Scaling Up

At some point in the future this architecture will need to be rethought. PostgreSQL will still be a great location to store end-user, time sensitive data for the web apps and reports. However, we probably would need to move the processing of events out of postgres and S3 into a proper queuing system like Kafka. Kafka is complex and expensive, but would allow us a lot more flexibility and safety in how we process things.

  architecture-beta
    group warmly(logos:aws)[Warmly]
    group kubernetes(logos:aws-eks)[Kubernetes] in warmly
    service db(logos:postgresql)[Serverless PostgreSQL] in warmly
    service message_queue(disk)[Kafka] in warmly
    service upload_bucket(logos:aws-s3)[Customer Upload Bucket] in warmly
    service archive_bucket(logos:aws-s3)[Event Archive Bucket] in warmly
    service file_ingester(logos:aws-lambda)[File Ingestion Lambda] in warmly
    service api_ingester(logos:nodejs-icon)[API Ingestion job] in kubernetes
    service postprocessor(logos:nodejs-icon)[Postprocessing jobs] in kubernetes
    service web_dashboard(logos:nodejs-icon)[Web Frontend] in kubernetes
    service events_api(logos:nodejs-icon)[Event API] in kubernetes
    service customer_crm(logos:salesforce)[Customer CRM]

    customer_crm:B --> T:api_ingester
    customer_crm:B --> T:upload_bucket
    upload_bucket:B --> T:file_ingester
    message_queue:T <-- T:api_ingester
    message_queue:L <-- R:file_ingester
    message_queue:B --> T:postprocessor
    message_queue:T --> B:archive_bucket
    message_queue:B <-- T:events_api
    postprocessor:T --> B:db
    web_dashboard:T <--> B:db

Sorry about the formatting, mermaidjs has their architecture plugin in beta, so its pretty messy with complex diagrams

Notable things that would change:

SQS replaced with Kafka: This provides better retry, ability to route data more effectively, and the ability to do some inline processing more effectively like Streaming SQL.
Move processing lambdas to Kubernetes: because we're expecting higher throughput at this scale it would be more cost effective to move most processing into Kubernetes.
Possibly remove the Event Archive bucket. If it is cost effective it would make more sense to store 100 days or so of events in Kafka and allow kafka to be the primary store of events.

There are several other things that we'd need to consider, but I don't have enough information to properly decide now. These are items I'd keep my eye on but also seek active feedback from the teams.

Postgres Scaling

We'd need to constantly evaluate if postgres is scaling effectively for the generation of reports. Its a great end-user store, but if we find we need a much larger set of data to properly generate reports it would make sense to move to smething more like Cassandra/ScyllaDB, ElasticSearch, or even realtime processing from S3. A lot of those decisions would depend on customer use cases and data scale.

Data Processing tools

I have the web and data processing jobs here listed as nodejs apps, but at some point we may find that Python, Rust, or even Java handles that better. I have personal preferences but I would work with the teams to find the best solution given our skills and the frameworks available.

Caching

Caching is omitted from these diagrams just for simplicity, but adding a CDN or Redis server for hot paths would improve performance and possibly cost.