Introduction
When I joined Thunder Marketing Corporation, we had a challenge: automate browser-based workflows at enterprise scale. Not hundreds of tasks—tens of thousands daily, with 99.9% reliability requirements.
This article shares how we built a browser automation platform processing 50K+ tasks daily, the architectural decisions that made it possible, and the lessons learned along the way.
The Challenge
Our requirements were demanding:
- Volume: 50,000+ automated tasks per day
- Reliability: 99.9% success rate (only 50 failures allowed per day)
- Latency: Most tasks complete within 30 seconds
- Diversity: Handle multiple websites with different structures
- Resilience: Graceful degradation when target sites change
Traditional automation approaches couldn't meet these requirements.
Architecture Overview
We built a distributed system with these components:
1┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐2│ Task Queue │────▶│ Worker Pool │────▶│ Result Store │3│ (Redis) │ │ (Kubernetes) │ │ (PostgreSQL) │4└─────────────────┘ └─────────────────┘ └─────────────────┘5 │ │ │6 │ ┌───────────────┐ │7 └─────────────▶│ Scheduler │◀──────────────┘8 │ (FastAPI) │9 └───────────────┘Component Breakdown
- Task Queue (Redis): Holds pending tasks with priority levels
- Worker Pool (Kubernetes): Scalable browser workers running Playwright
- Scheduler (FastAPI): Orchestrates task distribution and retries
- Result Store (PostgreSQL): Persists results and audit logs
Why Playwright Over Selenium
We started with Selenium but migrated to Playwright for several reasons:
| Feature | Selenium | Playwright |
|---|---|---|
| Auto-wait | Manual | Built-in |
| Browser contexts | Slow | Fast, isolated |
| Network interception | Limited | First-class |
| Debugging | Basic | Excellent (trace viewer) |
| Parallelization | Complex | Simple |
| --------- | ---------- | ------------ |
|---|---|---|
| Browser contexts | Slow | Fast, isolated |
| Network interception | Limited | First-class |
| Debugging | Basic | Excellent (trace viewer) |
| Parallelization | Complex | Simple |
| Auto-wait | Manual | Built-in |
|---|---|---|
| Network interception | Limited | First-class |
| Debugging | Basic | Excellent (trace viewer) |
| Parallelization | Complex | Simple |
| Browser contexts | Slow | Fast, isolated |
|---|---|---|
| Debugging | Basic | Excellent (trace viewer) |
| Parallelization | Complex | Simple |
| Network interception | Limited | First-class |
|---|---|---|
| Parallelization | Complex | Simple |
| Debugging | Basic | Excellent (trace viewer) |
|---|
The Migration
1# Before: Selenium with explicit waits everywhere2from selenium.webdriver.support.ui import WebDriverWait3from selenium.webdriver.support import expected_conditions as EC4
5element = WebDriverWait(driver, 10).until(6 EC.presence_of_element_located((By.ID, "submit"))7)8element.click()9
10# After: Playwright with auto-wait11await page.click("#submit") # Auto-waits for elementThis alone reduced our flaky tests by 40%.
Scaling to 50K Tasks Daily
Worker Pool Design
Each worker runs in a Kubernetes pod with:
1apiVersion: apps/v12kind: Deployment3metadata:4 name: browser-worker5spec:6 replicas: 20 # Scales based on queue depth7 template:8 spec:9 containers:10 - name: worker11 image: browser-worker:latest12 resources:13 requests:14 memory: "2Gi"15 cpu: "1000m"16 limits:17 memory: "4Gi"18 cpu: "2000m"Horizontal Pod Autoscaling
1apiVersion: autoscaling/v22kind: HorizontalPodAutoscaler3metadata:4 name: browser-worker-hpa5spec:6 scaleTargetRef:7 apiVersion: apps/v18 kind: Deployment9 name: browser-worker10 minReplicas: 1011 maxReplicas: 5012 metrics:13 - type: External14 external:15 metric:16 name: redis_queue_length17 target:18 type: AverageValue19 averageValue: 100We scale based on queue depth, not CPU—because browser automation is I/O bound.
Achieving 99.9% Reliability
Strategy 1: Intelligent Retries
Not all failures are equal. We classify them:
1class FailureType(Enum):2 TRANSIENT = "transient" # Network timeout, retry immediately3 RATE_LIMITED = "rate_limit" # Back off exponentially4 STRUCTURAL = "structural" # Site changed, alert humans5 PERMANENT = "permanent" # Invalid input, don't retry6
7async def execute_with_retry(task: Task) -> Result:8 for attempt in range(MAX_RETRIES):9 try:10 return await execute_task(task)11 except AutomationError as e:12 failure_type = classify_failure(e)13
14 if failure_type == FailureType.PERMANENT:15 raise # Don't retry16 elif failure_type == FailureType.RATE_LIMITED:17 await asyncio.sleep(2 ** attempt * 10) # Exponential backoff18 elif failure_type == FailureType.STRUCTURAL:19 alert_on_call(task, e)20 raise21 else:22 await asyncio.sleep(attempt * 2) # Linear backoffStrategy 2: Health Checks and Circuit Breakers
1from circuitbreaker import circuit2
3@circuit(failure_threshold=5, recovery_timeout=60)4async def automate_site_a(task: Task) -> Result:5 # If this fails 5 times in a row, stop trying for 60 seconds6 async with async_playwright() as p:7 browser = await p.chromium.launch()8 # ... automation logicStrategy 3: Self-Healing Selectors
Sites change their HTML. We use multiple selector strategies:
1class ResilientLocator:2 def __init__(self, strategies: list[str]):3 self.strategies = strategies4
5 async def find(self, page) -> ElementHandle:6 for strategy in self.strategies:7 try:8 element = await page.wait_for_selector(9 strategy,10 timeout=500011 )12 if element:13 return element14 except:15 continue16 raise ElementNotFound(self.strategies)17
18# Usage19submit_button = ResilientLocator([20 "#submit-btn", # ID21 "button[type='submit']", # Attribute22 "text=Submit", # Text content23 "button:has-text('Submit')", # Playwright-specific24])Monitoring and Observability
You can't maintain 99.9% reliability without visibility.
Metrics We Track
1from prometheus_client import Counter, Histogram, Gauge2
3tasks_total = Counter(4 'automation_tasks_total',5 'Total tasks processed',6 ['site', 'status']7)8
9task_duration = Histogram(10 'automation_task_duration_seconds',11 'Task execution time',12 ['site'],13 buckets=[1, 5, 10, 30, 60, 120]14)15
16queue_depth = Gauge(17 'automation_queue_depth',18 'Current queue depth',19 ['priority']20)Alerting Rules
1groups:2- name: automation3 rules:4 - alert: HighFailureRate5 expr: |6 sum(rate(automation_tasks_total{status="failed"}[5m]))7 / sum(rate(automation_tasks_total[5m])) > 0.018 for: 5m9 labels:10 severity: critical11 annotations:12 summary: "Automation failure rate above 1%"AI-Powered Enhancements
We integrated LLMs to handle edge cases:
Dynamic Element Detection
When standard selectors fail, we use GPT-4 Vision:
1async def find_element_with_ai(page, description: str):2 screenshot = await page.screenshot()3
4 response = await openai.chat.completions.create(5 model="gpt-4-vision-preview",6 messages=[{7 "role": "user",8 "content": [9 {"type": "text", "text": f"Find the {description} element and return its approximate coordinates"},10 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot}"}}11 ]12 }]13 )14
15 coordinates = parse_coordinates(response)16 await page.click(position=coordinates)This handles sites with obfuscated selectors or dynamic class names.
Results
After 9 months of iteration:
- 50K+ tasks daily with consistent throughput
- 99.9% success rate (averaging 30-40 failures per day)
- P95 latency under 25 seconds for standard tasks
- 60% cost reduction compared to manual processing
- 85% improvement in system reliability vs. initial version
Key Takeaways
- Choose the right tool: Playwright's auto-wait and browser contexts are game-changers
- Design for failure: Intelligent retries and circuit breakers are essential
- Make selectors resilient: Multiple fallback strategies prevent breakage
- Scale horizontally: Browser automation is I/O bound; scale on queue depth
- Observe everything: You can't fix what you can't see
Browser automation at scale is challenging, but with the right architecture, it's achievable.
Building automation systems? Let's connect on LinkedIn or check out my work on GitHub.