Conversational Chatbot — CareerAlign GenAI Architectures

1

Architecture Overview

The Conversational Chatbot extends the Simple Chat API by adding memory. Instead of treating each message independently, it maintains a history of the conversation and sends previous turns along with each new request. This enables the LLM to understand references like "it", "that", and "as I mentioned".

When to Use

Customer support chatbots that need to track issue context
Interactive tutoring or coaching systems
Any application where users expect follow-up questions to work
Internal knowledge assistants with extended dialogues

Complexity Level

Low-Medium. The core pattern is simple (append messages to an array), but memory management becomes critical as conversations grow. You need strategies for context window limits and session persistence.

Key Insight

The hardest part of building a chatbot is not the LLM call — it is managing memory efficiently. A conversation that exceeds the context window will either fail or lose important context.

2

Architecture Diagram

Architecture diagram — Conversational Chatbot with session store and memory loop

3

Components Deep Dive

📚

Window Buffer Memory

Keep the last N message pairs (e.g., 10 turns). Simple, predictable token usage. Older context is dropped entirely. Best for short, focused conversations.

📝

Summary Memory

Periodically summarize older messages into a condensed form. Keeps key context while reducing tokens. Use the LLM itself to generate the running summary.

🔀

Hybrid Memory

Combine summary of old turns + full recent turns. Best of both worlds: preserves long-term context while keeping recent detail. Most production chatbots use this pattern.

🔑

Session Management

Each conversation gets a unique session ID. Map session IDs to message histories in your store. Handle session creation, expiry, and cleanup.

🗃

Storage Backend

Redis for fast, ephemeral sessions. PostgreSQL for persistent history. DynamoDB for serverless scale. In-memory dict for prototyping only.

✂

Context Truncation

When conversation exceeds context window, truncate strategically. Always keep the system prompt and most recent messages. Never silently fail on context overflow.

4

Implementation

Window Buffer Chatbot

import anthropic

client = anthropic.Anthropic()

class ChatBot:
    def __init__(self, system_prompt, max_history=20):
        self.system = system_prompt
        self.max_history = max_history
        self.messages = []  # list of {role, content} dicts

    def chat(self, user_message: str) -> str:
        # Add user message
        self.messages.append({"role": "user", "content": user_message})

        # Truncate to window
        if len(self.messages) > self.max_history:
            self.messages = self.messages[-self.max_history:]

        # Call LLM with full history
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=self.system,
            messages=self.messages,
        )
        assistant_msg = response.content[0].text

        # Store assistant reply
        self.messages.append({"role": "assistant", "content": assistant_msg})
        return assistant_msg

Session-Based with Redis

import json, redis, uuid

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def create_session() -> str:
    session_id = str(uuid.uuid4())
    r.setex(session_id, 3600, json.dumps([]))  # 1hr TTL
    return session_id

def chat_with_session(session_id: str, user_msg: str) -> str:
    # Load history
    history = json.loads(r.get(session_id) or "[]")
    history.append({"role": "user", "content": user_msg})

    # Call LLM
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=history[-20:],  # window of last 20
    )
    reply = response.content[0].text
    history.append({"role": "assistant", "content": reply})

    # Save back
    r.setex(session_id, 3600, json.dumps(history))
    return reply

5

Data Flow

Step-by-step flow for each conversation turn:

1. User sends message — Includes session ID in request header or body
2. Load session history — Retrieve previous messages from session store using session ID
3. Apply memory strategy — Truncate to window, summarize old turns, or hybrid approach
4. Build messages array — System prompt + trimmed history + new user message
5. Call LLM — Send assembled messages to the model
6. Save both turns — Store user message + assistant reply back to session store
7. Return response — Stream or return complete text to the user

6

Trade-offs & Considerations

Memory Strategy	Pros	Cons
Window Buffer	Simple, predictable token cost	Loses early context entirely
Summary Memory	Preserves key context long-term	Extra LLM call to summarize, may lose detail
Hybrid	Best balance of context and cost	More complex to implement
Full History	Never loses context	Hits context window limit, expensive

Watch Out

Token costs scale linearly with conversation length. A 50-turn conversation sends all 50 turns with every request. This is the #1 cost trap in chatbot architectures.

7

Production Checklist

Session store with TTL and automatic cleanup (Redis, DynamoDB)
Token counting before sending to detect context window overflow
Graceful degradation when history is truncated (inform the user)
Session authentication — users can only access their own sessions
Conversation export for user data portability
Memory strategy selection based on conversation type
Concurrent request handling per session (queue or lock)
Analytics: conversation length distribution, drop-off turn number