Web Crawler System Design

Shivansh Dubey
5 min readMay 12, 2024

--

Problem statement:

Build a service that would take a parent URL, try to find all the sub-urls from the parent. Constraints of the systems were 100 of users access the system to post for crawling in one hour. I was given the post take the parent URL and get request mainly get the status of the given jobId- running, failed or sucess.

Requirement Gathering:

Functional Requirements :

  1. Crawl Web Pages: The crawler should be able to visit and retrieve web pages from the internet.
  2. URL Extraction: Extract URLs from HTML content to identify new pages to crawl.
  3. Content Indexing: Store the crawled content and metadata for later processing or analysis.
  4. Throttling and Politeness: Implement mechanisms to avoid overloading servers by controlling the rate of requests.
  5. Resuming Interrupted Crawls: Allow the crawler to resume crawling from where it left off in case of interruptions.
  6. Reporting and Logging: Provide logs and reports for monitoring and debugging purposes.
  7. Customizable Configuration: Allow users to customize crawl settings such as crawl depth, crawl time, max URLs etc.
  8. Preven infinite Url redirections.
  9. Determine job finish criterias

Non Functional Requirements:

1 — Storage Estimation

  • Daily active users : ~ 10M
  • Avg requests per user per day : 2 requests
  • Avg Urls found in job : 1000
  • Avg URL size : 1 byte * 100
  • Total Storage for 1 year : 10000000 * 2 * 1000 * 100 bytes * 365 = 730000000 MB ~ 712890GB ~ 700TB

2 — Memory Estimation :

  • Avg time to complete job ~ 24 hours
  • Daily request ~ 10M * 2 ~ 20M
  • Avg size of one Url content hash ~ 50bytes
  • Avg size of all urls from one doc : 50 bytes * 1000 pages ~ 50000 bytes
  • Total memory estimates : 1 day * 20000000 * 100000 bytes ~ 2000000 Mb ~ 1953 GB ~ 2 TB

3 — Write to Read Ration : (1000 : 1) i. e Write heavy platform, Latency isn’t concern here

High level Design :

API Design :

Api to create web crawlling request for a Page

POST /v1/url/scan
Body :
{
"root_url" : "<String>",
"user_id" : Long,
"finish_criteria" : <FINISH_CRAWLER_CRITERIA>,
}
Headers :
Authentication : Bearer
API-X-TOKEN : Bearer <token>

Response :
StatusCode : 200
{
"request_status" : "<String>",
"jobId" : "<String>" //UUID
}
StatusCode : 404
{
"error_message" : "Too many requests, User quota exhausted!"
}
StatusCode : 500
{
"error_message" : "Something went wrong, Please try after sometime!"
}

API to retreive job Status

GET /v1/job/status?job_id=<jobID>

Response:
StatusCode 200
{
"job_status" : "<JOB_STATUS>"
}

StatusCode 404
{
"error_message" : "No job found with given jobID"
}

StatusCode : 500
{
"error_message" : "Something went wrong, Please try after sometime!"
}

API to retrieve crawlled URLs for given JobId

GET /v1/urls?job_id=<job_id>

Response :
StatusCode 200
{
"job_id" : "String",
"job_status" : "JOB_STATUS",
"job_completion_time" : LocalDateTime,
"extracted_urls" : ["url1", "url2" ...]
}

StatusCode 202
{
"job_id" : "String",
"job_status" : "JOB_STATUS",
"message" : "Job is incompleted at this moment"
}

StatusCode 404
{
"error_message" : "No job found with given jobID"
}

StatusCode : 500
{
"error_message" : "Something went wrong, Please try after sometime!"
}

Database Design :

To store Job details, we can use document based storage like mongoDb, document structure will look like :

{
"jobID" : String,
"root_url" : "<String>"
"finish_criteria" : <FINISH_CRAWLER_CRITERIA>,
"status" : <JOB_STATUS>,
"created_at" : LocalDateTime,
"last_updated_at" : LocalDateTime,
"job_completed_At" : LocalDateTime,
"extractedUrls" : ["url1", "url2", ...]
}

To store url page related info, we can persist them in CassandraDB

create table keystore.page_by_pageHash (
jobId Long,
pageHash VARCHAR,
url text,
level int,
childUrls list<text>
) PRIMARY KEY ((pageHash), jobId)

Basic Algorithm :

processNewRequest(Request request){
boolean requestValidation = validateRequestsAndRatelimitingChecks(request);
if requestValidation == true:
String jobID = UUID.get();
JobDocument doc = getJobDocument(jobID, request);
persistJobDocument(doc);
scanUrlAsync(createKafkaRecord(doc.getRootUrl(), jobId, level = 1));
return (JobStatus.CREATED, jobID)
else
throw new RuntimeException("Invalid request");
}

scanUrlAsync(KafkaRecord record){
String url = record.get("url");
Integer level = record.get("level");
String jobId = record.get("jobId");
Page currentPage = pageParser.get(url);
String pageHash = currentPage.getPageHash();
PageRecord record= cassandraTemplate.get(TBL.page_by_pageHash, pageHash);
if( record == null):
String[] childUrls = urlExtractor.get(currentPage);
PageRecord newRecord = new PageRecord(jobId, url, level, pageHash, childUrls);
cassandraTemplate.save(TBL_PAGE_RECORD, newRecord);
JobDocument jobDoc = mongoDb.get(jobId);
jobDoc.extractedUrls().add(url);
mongoDb.save(jobDoc);
boolean isJobFinished = validateJob(newRecord, jobDoc);
if(!isJobFinished)
childUrls.forEach( url -> scanUrlAsync(createKafkaRecord(url, jobId, level + 1)));
else:
log.info("Job finished for jobID %s".format(jobId));
else:
log.info("Page already computed %s".format(url));

}

High level Design :

Deep Dive :

Above implementation is a scalable implementation of BFS algorithm.
Requests are authenticated and redirected via api gateway layer and load balancers.
Job creation requests lands on Job creation service, will creates a new job and initiate web crawling async by dropping message on Kafka topic url_scan_topic_v1.

Message is poll from url_scan_topic_v1 topic in URL-scanner-service which is responsible to maintain web crawling process politeness by making sure particular DNS related crawls are not bombared, this is achived by redirecting all urls page extraction of similar domain in sequential manner, thus domainName is used as key while pushings message to topic page_download_topic_v1. Apart from this URL scanner service also makes sure we are not processing an already processed URL, validate infinite looping via page hashing, updating JobStatus.

Page-downloader-service consumes kafka event that provides url that needs to be downloaded. Content of url is downloaded, persisted in Cassandra DB against the pageHash key and further produces a message to kafka topic for content extractions.
Content-Extraction-service is scan for child URLs in the PageContent and updates them in PageDocument. For all the new urls found, redirect each of them to url_scan_topic_v1 for further BFS.

Kafka is used here for async computation.

URL-scanner-service, page-downloader-service, content-extraction-service microservices can be scalled independently as and when required.

Job details are stored in document DB MongoDB and Page related details are stored in ScyllaDB or CassandraDb.

System Health Monitoring:

System will be monitored with grafana and prometheous to keep track of topic lag, APIs rps etc.
Logs will be persisted and monitored via elastic search and Kibana.
Rate limiting at user quota level will be handled at API gateway layer itself.
All requests will be authenticated via Auth2 authentication mechanism.

--

--

Shivansh Dubey
Shivansh Dubey

Written by Shivansh Dubey

Currently working as SDE-III @ Meesho

No responses yet