Back to Insights
AI & Systems6 min read

Integrating LLMs in Production: GPT-4, Claude, and Beyond

Practical lessons from integrating multiple LLM providers into production systems—orchestration, fallbacks, and cost optimization.

AP

Anshuman Parmar

June 2025

Integrating LLMs in Production: GPT-4, Claude, and Beyond

Introduction

Integrating LLMs into production systems is more than just API calls. After deploying AI-powered automation systems at Thunder Marketing and building agentic AI architectures at Sazag Infotech, I've learned that the real challenges are reliability, cost management, and orchestration.

This article shares practical lessons from integrating GPT-4, Claude, and Gemini into production systems.

The Multi-Provider Strategy

Relying on a single LLM provider is risky:

  • Outages happen: OpenAI has had multiple significant outages
  • Rate limits: Heavy usage can hit limits unexpectedly
  • Cost variation: Different providers excel at different tasks
  • Capability differences: Claude handles long contexts better; GPT-4 excels at reasoning

We use a multi-provider approach:

python
1from enum import Enum
2from typing import Protocol
3
4class LLMProvider(Enum):
5 OPENAI = "openai"
6 ANTHROPIC = "anthropic"
7 GOOGLE = "google"
8
9class LLMClient(Protocol):
10 async def complete(self, prompt: str, **kwargs) -> str:
11 ...
12
13class MultiProviderLLM:
14 def __init__(self):
15 self.providers = {
16 LLMProvider.OPENAI: OpenAIClient(),
17 LLMProvider.ANTHROPIC: AnthropicClient(),
18 LLMProvider.GOOGLE: GoogleClient(),
19 }
20 self.fallback_order = [
21 LLMProvider.OPENAI,
22 LLMProvider.ANTHROPIC,
23 LLMProvider.GOOGLE,
24 ]
25
26 async def complete(
27 self,
28 prompt: str,
29 preferred_provider: LLMProvider | None = None,
30 **kwargs
31 ) -> str:
32 providers = (
33 [preferred_provider] + self.fallback_order
34 if preferred_provider
35 else self.fallback_order
36 )
37
38 for provider in providers:
39 try:
40 return await self.providers[provider].complete(prompt, **kwargs)
41 except (RateLimitError, ServiceUnavailable) as e:
42 logger.warning(f"{provider} failed: {e}")
43 continue
44
45 raise AllProvidersFailedError()

Provider Selection: When to Use What

Based on our production experience:

Use CaseBest ProviderWhy
Complex reasoningGPT-4Best logical capabilities
Long documentsClaude200K context window
Code generationGPT-4 / ClaudeBoth excellent
Fast, cheap tasksGPT-3.5 / Gemini FlashCost-effective
Vision tasksGPT-4V / Claude 3Best multimodal
-----------------------------
Long documentsClaude200K context window
Code generationGPT-4 / ClaudeBoth excellent
Fast, cheap tasksGPT-3.5 / Gemini FlashCost-effective
Vision tasksGPT-4V / Claude 3Best multimodal
Complex reasoningGPT-4Best logical capabilities
Code generationGPT-4 / ClaudeBoth excellent
Fast, cheap tasksGPT-3.5 / Gemini FlashCost-effective
Vision tasksGPT-4V / Claude 3Best multimodal
Long documentsClaude200K context window
Fast, cheap tasksGPT-3.5 / Gemini FlashCost-effective
Vision tasksGPT-4V / Claude 3Best multimodal
Code generationGPT-4 / ClaudeBoth excellent
Vision tasksGPT-4V / Claude 3Best multimodal
Fast, cheap tasksGPT-3.5 / Gemini FlashCost-effective

Dynamic Provider Selection

python
1def select_provider(task: Task) -> LLMProvider:
2 if task.requires_vision:
3 return LLMProvider.OPENAI # GPT-4V
4
5 if task.context_length > 100_000:
6 return LLMProvider.ANTHROPIC # Claude's long context
7
8 if task.complexity == "simple":
9 return LLMProvider.GOOGLE # Gemini Flash for cost
10
11 return LLMProvider.OPENAI # GPT-4 as default

Cost Optimization

LLM costs can explode quickly. Here's how we keep them manageable.

1. Prompt Caching

Many prompts are repeated. Cache them:

python
1import hashlib
2from functools import lru_cache
3
4class CachedLLM:
5 def __init__(self, llm: LLMClient, cache: Redis):
6 self.llm = llm
7 self.cache = cache
8
9 async def complete(self, prompt: str, **kwargs) -> str:
10 # Create cache key from prompt + params
11 cache_key = hashlib.sha256(
12 f"{prompt}:{kwargs}".encode()
13 ).hexdigest()
14
15 # Check cache
16 cached = await self.cache.get(cache_key)
17 if cached:
18 return cached
19
20 # Call LLM
21 result = await self.llm.complete(prompt, **kwargs)
22
23 # Cache result (1 hour TTL)
24 await self.cache.setex(cache_key, 3600, result)
25
26 return result

2. Tiered Model Usage

Use cheaper models when possible:

python
1async def smart_complete(prompt: str, task_type: str) -> str:
2 if task_type in ["classification", "extraction", "simple_qa"]:
3 # Use cheaper model
4 return await gpt35_client.complete(prompt)
5
6 if task_type in ["summarization", "translation"]:
7 # Medium tier
8 return await claude_instant_client.complete(prompt)
9
10 # Complex tasks get GPT-4
11 return await gpt4_client.complete(prompt)

3. Prompt Optimization

Shorter prompts = lower costs:

python
1# Bad: Verbose prompt
2prompt = """
3You are a helpful assistant that extracts information from text.
4Your task is to carefully read the following document and extract
5all the key information including names, dates, and amounts.
6Please be thorough and accurate in your extraction.
7Here is the document:
8{document}
9"""
10
11# Good: Concise prompt
12prompt = """Extract names, dates, and amounts from this document:
13{document}
14
15Return as JSON: {{"names": [], "dates": [], "amounts": []}}"""

This reduced our token usage by 30%.

Orchestration with LangChain

For complex workflows, LangChain provides excellent abstractions:

python
1from langchain.chains import LLMChain, SequentialChain
2from langchain.prompts import PromptTemplate
3
4# Step 1: Extract key points
5extract_chain = LLMChain(
6 llm=llm,
7 prompt=PromptTemplate(
8 input_variables=["document"],
9 template="Extract key points from: {document}"
10 ),
11 output_key="key_points"
12)
13
14# Step 2: Generate summary
15summary_chain = LLMChain(
16 llm=llm,
17 prompt=PromptTemplate(
18 input_variables=["key_points"],
19 template="Summarize these points: {key_points}"
20 ),
21 output_key="summary"
22)
23
24# Combine into pipeline
25pipeline = SequentialChain(
26 chains=[extract_chain, summary_chain],
27 input_variables=["document"],
28 output_variables=["summary"]
29)
30
31result = await pipeline.arun(document=doc)

Agentic AI with LangGraph

For complex decision-making, we use LangGraph:

python
1from langgraph.graph import StateGraph, END
2from typing import TypedDict
3
4class AgentState(TypedDict):
5 task: str
6 research: str
7 plan: str
8 result: str
9
10def should_continue(state: AgentState) -> str:
11 if state.get("result"):
12 return END
13 if state.get("plan"):
14 return "execute"
15 if state.get("research"):
16 return "plan"
17 return "research"
18
19# Build the graph
20workflow = StateGraph(AgentState)
21
22workflow.add_node("research", research_node)
23workflow.add_node("plan", planning_node)
24workflow.add_node("execute", execution_node)
25
26workflow.add_conditional_edges(
27 "research",
28 should_continue,
29 {"plan": "plan", END: END}
30)
31workflow.add_conditional_edges(
32 "plan",
33 should_continue,
34 {"execute": "execute", END: END}
35)
36workflow.add_conditional_edges(
37 "execute",
38 should_continue,
39 {END: END}
40)
41
42workflow.set_entry_point("research")
43agent = workflow.compile()

Reliability Patterns

Structured Outputs

Force consistent outputs with Pydantic:

python
1from langchain.output_parsers import PydanticOutputParser
2from pydantic import BaseModel
3
4class ExtractedData(BaseModel):
5 names: list[str]
6 dates: list[str]
7 amounts: list[float]
8
9parser = PydanticOutputParser(pydantic_object=ExtractedData)
10
11prompt = f"""Extract data from this document:
12{document}
13
14{parser.get_format_instructions()}"""
15
16response = await llm.complete(prompt)
17data = parser.parse(response) # Validated ExtractedData object

Retry with Backoff

python
1from tenacity import retry, stop_after_attempt, wait_exponential
2
3@retry(
4 stop=stop_after_attempt(3),
5 wait=wait_exponential(multiplier=1, min=4, max=60)
6)
7async def robust_llm_call(prompt: str) -> str:
8 return await llm.complete(prompt)

Monitoring and Observability

python
1from prometheus_client import Counter, Histogram
2
3llm_requests = Counter(
4 'llm_requests_total',
5 'Total LLM requests',
6 ['provider', 'model', 'status']
7)
8
9llm_latency = Histogram(
10 'llm_request_duration_seconds',
11 'LLM request latency',
12 ['provider', 'model']
13)
14
15llm_tokens = Counter(
16 'llm_tokens_total',
17 'Total tokens used',
18 ['provider', 'model', 'type'] # type: prompt/completion
19)

Results

Our LLM integration strategy delivered:

  • 85% task automation accuracy in production
  • 99.5% availability with multi-provider fallbacks
  • 40% cost reduction through caching and tiered models
  • Sub-2s latency for most requests
  • Zero vendor lock-in with abstraction layers

Key Takeaways

  1. Multi-provider is essential: Don't depend on a single LLM provider
  2. Match model to task: Use cheaper models for simple tasks
  3. Cache aggressively: Many prompts repeat; cache the results
  4. Structure your outputs: Pydantic parsers ensure consistency
  5. Monitor everything: Track costs, latency, and success rates

LLMs are powerful tools, but production integration requires careful architecture. The patterns in this article have proven reliable across multiple enterprise deployments.


Building with LLMs? Let's connect on LinkedIn or explore my projects on GitHub.

AP

WRITTEN BY

Anshuman Parmar

Senior Full Stack Developer specializing in AI systems, browser automation, and scalable web applications. Building production-grade solutions that deliver measurable business impact.

Enjoyed this article?

Explore more insights on AI, automation, and system design.

View All Insights