Back to Insights
AI & Systems10 min read

Browser Automation at Scale: 50K Tasks Daily

How we architected an enterprise browser automation platform processing 50K+ daily tasks with 99.9% reliability using Selenium and Playwright.

AP

Anshuman Parmar

October 2025

Browser Automation at Scale: 50K Tasks Daily

Introduction

When I joined Thunder Marketing Corporation, we had a challenge: automate browser-based workflows at enterprise scale. Not hundreds of tasks—tens of thousands daily, with 99.9% reliability requirements.

This article shares how we built a browser automation platform processing 50K+ tasks daily, the architectural decisions that made it possible, and the lessons learned along the way.

The Challenge

Our requirements were demanding:

  • Volume: 50,000+ automated tasks per day
  • Reliability: 99.9% success rate (only 50 failures allowed per day)
  • Latency: Most tasks complete within 30 seconds
  • Diversity: Handle multiple websites with different structures
  • Resilience: Graceful degradation when target sites change

Traditional automation approaches couldn't meet these requirements.

Architecture Overview

We built a distributed system with these components:

text
1┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
2│ Task Queue │────▶│ Worker Pool │────▶│ Result Store │
3│ (Redis) │ │ (Kubernetes) │ │ (PostgreSQL) │
4└─────────────────┘ └─────────────────┘ └─────────────────┘
5 │ │ │
6 │ ┌───────────────┐ │
7 └─────────────▶│ Scheduler │◀──────────────┘
8 │ (FastAPI) │
9 └───────────────┘

Component Breakdown

  1. Task Queue (Redis): Holds pending tasks with priority levels
  2. Worker Pool (Kubernetes): Scalable browser workers running Playwright
  3. Scheduler (FastAPI): Orchestrates task distribution and retries
  4. Result Store (PostgreSQL): Persists results and audit logs

Why Playwright Over Selenium

We started with Selenium but migrated to Playwright for several reasons:

FeatureSeleniumPlaywright
Auto-waitManualBuilt-in
Browser contextsSlowFast, isolated
Network interceptionLimitedFirst-class
DebuggingBasicExcellent (trace viewer)
ParallelizationComplexSimple
-------------------------------
Browser contextsSlowFast, isolated
Network interceptionLimitedFirst-class
DebuggingBasicExcellent (trace viewer)
ParallelizationComplexSimple
Auto-waitManualBuilt-in
Network interceptionLimitedFirst-class
DebuggingBasicExcellent (trace viewer)
ParallelizationComplexSimple
Browser contextsSlowFast, isolated
DebuggingBasicExcellent (trace viewer)
ParallelizationComplexSimple
Network interceptionLimitedFirst-class
ParallelizationComplexSimple
DebuggingBasicExcellent (trace viewer)

The Migration

python
1# Before: Selenium with explicit waits everywhere
2from selenium.webdriver.support.ui import WebDriverWait
3from selenium.webdriver.support import expected_conditions as EC
4
5element = WebDriverWait(driver, 10).until(
6 EC.presence_of_element_located((By.ID, "submit"))
7)
8element.click()
9
10# After: Playwright with auto-wait
11await page.click("#submit") # Auto-waits for element

This alone reduced our flaky tests by 40%.

Scaling to 50K Tasks Daily

Worker Pool Design

Each worker runs in a Kubernetes pod with:

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: browser-worker
5spec:
6 replicas: 20 # Scales based on queue depth
7 template:
8 spec:
9 containers:
10 - name: worker
11 image: browser-worker:latest
12 resources:
13 requests:
14 memory: "2Gi"
15 cpu: "1000m"
16 limits:
17 memory: "4Gi"
18 cpu: "2000m"

Horizontal Pod Autoscaling

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: browser-worker-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: browser-worker
10 minReplicas: 10
11 maxReplicas: 50
12 metrics:
13 - type: External
14 external:
15 metric:
16 name: redis_queue_length
17 target:
18 type: AverageValue
19 averageValue: 100

We scale based on queue depth, not CPU—because browser automation is I/O bound.

Achieving 99.9% Reliability

Strategy 1: Intelligent Retries

Not all failures are equal. We classify them:

python
1class FailureType(Enum):
2 TRANSIENT = "transient" # Network timeout, retry immediately
3 RATE_LIMITED = "rate_limit" # Back off exponentially
4 STRUCTURAL = "structural" # Site changed, alert humans
5 PERMANENT = "permanent" # Invalid input, don't retry
6
7async def execute_with_retry(task: Task) -> Result:
8 for attempt in range(MAX_RETRIES):
9 try:
10 return await execute_task(task)
11 except AutomationError as e:
12 failure_type = classify_failure(e)
13
14 if failure_type == FailureType.PERMANENT:
15 raise # Don't retry
16 elif failure_type == FailureType.RATE_LIMITED:
17 await asyncio.sleep(2 ** attempt * 10) # Exponential backoff
18 elif failure_type == FailureType.STRUCTURAL:
19 alert_on_call(task, e)
20 raise
21 else:
22 await asyncio.sleep(attempt * 2) # Linear backoff

Strategy 2: Health Checks and Circuit Breakers

python
1from circuitbreaker import circuit
2
3@circuit(failure_threshold=5, recovery_timeout=60)
4async def automate_site_a(task: Task) -> Result:
5 # If this fails 5 times in a row, stop trying for 60 seconds
6 async with async_playwright() as p:
7 browser = await p.chromium.launch()
8 # ... automation logic

Strategy 3: Self-Healing Selectors

Sites change their HTML. We use multiple selector strategies:

python
1class ResilientLocator:
2 def __init__(self, strategies: list[str]):
3 self.strategies = strategies
4
5 async def find(self, page) -> ElementHandle:
6 for strategy in self.strategies:
7 try:
8 element = await page.wait_for_selector(
9 strategy,
10 timeout=5000
11 )
12 if element:
13 return element
14 except:
15 continue
16 raise ElementNotFound(self.strategies)
17
18# Usage
19submit_button = ResilientLocator([
20 "#submit-btn", # ID
21 "button[type='submit']", # Attribute
22 "text=Submit", # Text content
23 "button:has-text('Submit')", # Playwright-specific
24])

Monitoring and Observability

You can't maintain 99.9% reliability without visibility.

Metrics We Track

python
1from prometheus_client import Counter, Histogram, Gauge
2
3tasks_total = Counter(
4 'automation_tasks_total',
5 'Total tasks processed',
6 ['site', 'status']
7)
8
9task_duration = Histogram(
10 'automation_task_duration_seconds',
11 'Task execution time',
12 ['site'],
13 buckets=[1, 5, 10, 30, 60, 120]
14)
15
16queue_depth = Gauge(
17 'automation_queue_depth',
18 'Current queue depth',
19 ['priority']
20)

Alerting Rules

yaml
1groups:
2- name: automation
3 rules:
4 - alert: HighFailureRate
5 expr: |
6 sum(rate(automation_tasks_total{status="failed"}[5m]))
7 / sum(rate(automation_tasks_total[5m])) > 0.01
8 for: 5m
9 labels:
10 severity: critical
11 annotations:
12 summary: "Automation failure rate above 1%"

AI-Powered Enhancements

We integrated LLMs to handle edge cases:

Dynamic Element Detection

When standard selectors fail, we use GPT-4 Vision:

python
1async def find_element_with_ai(page, description: str):
2 screenshot = await page.screenshot()
3
4 response = await openai.chat.completions.create(
5 model="gpt-4-vision-preview",
6 messages=[{
7 "role": "user",
8 "content": [
9 {"type": "text", "text": f"Find the {description} element and return its approximate coordinates"},
10 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot}"}}
11 ]
12 }]
13 )
14
15 coordinates = parse_coordinates(response)
16 await page.click(position=coordinates)

This handles sites with obfuscated selectors or dynamic class names.

Results

After 9 months of iteration:

  • 50K+ tasks daily with consistent throughput
  • 99.9% success rate (averaging 30-40 failures per day)
  • P95 latency under 25 seconds for standard tasks
  • 60% cost reduction compared to manual processing
  • 85% improvement in system reliability vs. initial version

Key Takeaways

  1. Choose the right tool: Playwright's auto-wait and browser contexts are game-changers
  2. Design for failure: Intelligent retries and circuit breakers are essential
  3. Make selectors resilient: Multiple fallback strategies prevent breakage
  4. Scale horizontally: Browser automation is I/O bound; scale on queue depth
  5. Observe everything: You can't fix what you can't see

Browser automation at scale is challenging, but with the right architecture, it's achievable.


Building automation systems? Let's connect on LinkedIn or check out my work on GitHub.

AP

WRITTEN BY

Anshuman Parmar

Senior Full Stack Developer specializing in AI systems, browser automation, and scalable web applications. Building production-grade solutions that deliver measurable business impact.

Enjoyed this article?

Explore more insights on AI, automation, and system design.

View All Insights