Browser Automation at Scale: 50K Tasks Daily

Introduction

When I joined Thunder Marketing Corporation, we had a challenge: automate browser-based workflows at enterprise scale. Not hundreds of tasks—tens of thousands daily, with 99.9% reliability requirements.

This article shares how we built a browser automation platform processing 50K+ tasks daily, the architectural decisions that made it possible, and the lessons learned along the way.

The Challenge

Our requirements were demanding:

Volume: 50,000+ automated tasks per day
Reliability: 99.9% success rate (only 50 failures allowed per day)
Latency: Most tasks complete within 30 seconds
Diversity: Handle multiple websites with different structures
Resilience: Graceful degradation when target sites change

Traditional automation approaches couldn't meet these requirements.

Architecture Overview

We built a distributed system with these components:

text

1┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
2│   Task Queue    │────▶│  Worker Pool    │────▶│  Result Store   │
3│   (Redis)       │     │  (Kubernetes)   │     │  (PostgreSQL)   │
4└─────────────────┘     └─────────────────┘     └─────────────────┘
5         │                       │                       │
6         │              ┌───────────────┐               │
7         └─────────────▶│   Scheduler   │◀──────────────┘
8                        │   (FastAPI)   │
9                        └───────────────┘

Component Breakdown

Task Queue (Redis): Holds pending tasks with priority levels
Worker Pool (Kubernetes): Scalable browser workers running Playwright
Scheduler (FastAPI): Orchestrates task distribution and retries
Result Store (PostgreSQL): Persists results and audit logs

Why Playwright Over Selenium

We started with Selenium but migrated to Playwright for several reasons:

Feature	Selenium	Playwright
Auto-wait	Manual	Built-in
Browser contexts	Slow	Fast, isolated
Network interception	Limited	First-class
Debugging	Basic	Excellent (trace viewer)
Parallelization	Complex	Simple

---------	----------	------------
Browser contexts	Slow	Fast, isolated
Network interception	Limited	First-class
Debugging	Basic	Excellent (trace viewer)
Parallelization	Complex	Simple

Auto-wait	Manual	Built-in
Network interception	Limited	First-class
Debugging	Basic	Excellent (trace viewer)
Parallelization	Complex	Simple

Browser contexts	Slow	Fast, isolated
Debugging	Basic	Excellent (trace viewer)
Parallelization	Complex	Simple

Network interception	Limited	First-class
Parallelization	Complex	Simple

Debugging	Basic	Excellent (trace viewer)

The Migration

python

1# Before: Selenium with explicit waits everywhere
2from selenium.webdriver.support.ui import WebDriverWait
3from selenium.webdriver.support import expected_conditions as EC
4
5element = WebDriverWait(driver, 10).until(
6    EC.presence_of_element_located((By.ID, "submit"))
7)
8element.click()
9
10# After: Playwright with auto-wait
11await page.click("#submit")  # Auto-waits for element

This alone reduced our flaky tests by 40%.

Scaling to 50K Tasks Daily

Worker Pool Design

Each worker runs in a Kubernetes pod with:

yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: browser-worker
5spec:
6  replicas: 20  # Scales based on queue depth
7  template:
8    spec:
9      containers:
10      - name: worker
11        image: browser-worker:latest
12        resources:
13          requests:
14            memory: "2Gi"
15            cpu: "1000m"
16          limits:
17            memory: "4Gi"
18            cpu: "2000m"

Horizontal Pod Autoscaling

yaml

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: browser-worker-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: browser-worker
10  minReplicas: 10
11  maxReplicas: 50
12  metrics:
13  - type: External
14    external:
15      metric:
16        name: redis_queue_length
17      target:
18        type: AverageValue
19        averageValue: 100

We scale based on queue depth, not CPU—because browser automation is I/O bound.

Achieving 99.9% Reliability

Strategy 1: Intelligent Retries

Not all failures are equal. We classify them:

python

1class FailureType(Enum):
2    TRANSIENT = "transient"      # Network timeout, retry immediately
3    RATE_LIMITED = "rate_limit"  # Back off exponentially
4    STRUCTURAL = "structural"    # Site changed, alert humans
5    PERMANENT = "permanent"      # Invalid input, don't retry
6
7async def execute_with_retry(task: Task) -> Result:
8    for attempt in range(MAX_RETRIES):
9        try:
10            return await execute_task(task)
11        except AutomationError as e:
12            failure_type = classify_failure(e)
13
14            if failure_type == FailureType.PERMANENT:
15                raise  # Don't retry
16            elif failure_type == FailureType.RATE_LIMITED:
17                await asyncio.sleep(2 ** attempt * 10)  # Exponential backoff
18            elif failure_type == FailureType.STRUCTURAL:
19                alert_on_call(task, e)
20                raise
21            else:
22                await asyncio.sleep(attempt * 2)  # Linear backoff

Strategy 2: Health Checks and Circuit Breakers

python

1from circuitbreaker import circuit
2
3@circuit(failure_threshold=5, recovery_timeout=60)
4async def automate_site_a(task: Task) -> Result:
5    # If this fails 5 times in a row, stop trying for 60 seconds
6    async with async_playwright() as p:
7        browser = await p.chromium.launch()
8        # ... automation logic

Strategy 3: Self-Healing Selectors

Sites change their HTML. We use multiple selector strategies:

python

1class ResilientLocator:
2    def __init__(self, strategies: list[str]):
3        self.strategies = strategies
4
5    async def find(self, page) -> ElementHandle:
6        for strategy in self.strategies:
7            try:
8                element = await page.wait_for_selector(
9                    strategy,
10                    timeout=5000
11                )
12                if element:
13                    return element
14            except:
15                continue
16        raise ElementNotFound(self.strategies)
17
18# Usage
19submit_button = ResilientLocator([
20    "#submit-btn",                    # ID
21    "button[type='submit']",          # Attribute
22    "text=Submit",                    # Text content
23    "button:has-text('Submit')",      # Playwright-specific
24])

Monitoring and Observability

You can't maintain 99.9% reliability without visibility.

Metrics We Track

python

1from prometheus_client import Counter, Histogram, Gauge
2
3tasks_total = Counter(
4    'automation_tasks_total',
5    'Total tasks processed',
6    ['site', 'status']
7)
8
9task_duration = Histogram(
10    'automation_task_duration_seconds',
11    'Task execution time',
12    ['site'],
13    buckets=[1, 5, 10, 30, 60, 120]
14)
15
16queue_depth = Gauge(
17    'automation_queue_depth',
18    'Current queue depth',
19    ['priority']
20)

Alerting Rules

yaml

1groups:
2- name: automation
3  rules:
4  - alert: HighFailureRate
5    expr: |
6      sum(rate(automation_tasks_total{status="failed"}[5m]))
7      / sum(rate(automation_tasks_total[5m])) > 0.01
8    for: 5m
9    labels:
10      severity: critical
11    annotations:
12      summary: "Automation failure rate above 1%"

AI-Powered Enhancements

We integrated LLMs to handle edge cases:

Dynamic Element Detection

When standard selectors fail, we use GPT-4 Vision:

python

1async def find_element_with_ai(page, description: str):
2    screenshot = await page.screenshot()
3
4    response = await openai.chat.completions.create(
5        model="gpt-4-vision-preview",
6        messages=[{
7            "role": "user",
8            "content": [
9                {"type": "text", "text": f"Find the {description} element and return its approximate coordinates"},
10                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot}"}}
11            ]
12        }]
13    )
14
15    coordinates = parse_coordinates(response)
16    await page.click(position=coordinates)

This handles sites with obfuscated selectors or dynamic class names.

Results

After 9 months of iteration:

50K+ tasks daily with consistent throughput
99.9% success rate (averaging 30-40 failures per day)
P95 latency under 25 seconds for standard tasks
60% cost reduction compared to manual processing
85% improvement in system reliability vs. initial version

Key Takeaways

Choose the right tool: Playwright's auto-wait and browser contexts are game-changers
Design for failure: Intelligent retries and circuit breakers are essential
Make selectors resilient: Multiple fallback strategies prevent breakage
Scale horizontally: Browser automation is I/O bound; scale on queue depth
Observe everything: You can't fix what you can't see

Browser automation at scale is challenging, but with the right architecture, it's achievable.

Building automation systems? Let's connect on LinkedIn or check out my work on GitHub.

Introduction

The Challenge

Architecture Overview

Component Breakdown

Why Playwright Over Selenium

The Migration

Scaling to 50K Tasks Daily

Worker Pool Design

Horizontal Pod Autoscaling

Achieving 99.9% Reliability

Strategy 1: Intelligent Retries

Strategy 2: Health Checks and Circuit Breakers

Strategy 3: Self-Healing Selectors

Monitoring and Observability

Metrics We Track

Alerting Rules

AI-Powered Enhancements

Dynamic Element Detection

Results

Key Takeaways

Enjoyed this article?