# LLM for Context Compression/Summarization ## Overview Research on best LLMs for context compression (summarizing old messages to save tokens). **Use case**: Compress old conversation history when context gets too long. --- ## Ranking: Performance First Based on general benchmarks and summarization capability: | Rank | Model | Provider | Strengths | |------|-------|----------|-----------| | 1 | **GPT-4.1** | OpenAI | Best overall reasoning, good summarization | | 2 | **Claude 4 Sonnet** | Anthropic | Excellent at long context tasks | | 3 | **Gemini 2.5 Pro** | Google | Massive context, strong reasoning | | 4 | **GPT-4o** | OpenAI | Balanced, reliable | | 5 | **Gemini 2.0 Flash** | Google | Fast + good quality | | 6 | **Claude 3.5 Sonnet** | Anthropic | Good value, fast | | 7 | **Llama 3.3 70B** | Meta | Open source, good reasoning | | 8 | **Qwen 3** | Alibaba | Excellent for coding/summarization | | 9 | **Mistral Large** | Mistral | European option, fast | | 10 | **Gemma 3** | Google | Lightweight, free | **Note**: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well. --- ## Ranking: Price First (Cheapest) Sorted by input cost (per 1M tokens): ### Free Models (OpenRouter) | Model | Input | Output | Context | Notes | |-------|-------|--------|---------|-------| | **stepfun/step-3.5-flash:free** | $0 | $0 | 256K | ✅ Currently using | | **minimax/minimax-m2.5:free** | $0 | $0 | 196K | Good quality | | **meta-llama/llama-3.3-70b:free** | $0 | $0 | 128K | Solid | | **arcee-ai/trinity-mini:free** | $0 | $0 | 131K | Lightweight | ### Paid Models (Cheapest) | Model | Input | Output | Context | Notes | |-------|-------|--------|---------|-------| | **google/gemini-1.5-flash-8b** | $0.0375 | $0.15 | 1M | 🏆 Best cheap | | **google/gemini-2.0-flash-lite** | $0.075 | $0.30 | 1M | Fast | | **qwen/qwen3.5-flash-02-23** | $0.065 | $0.26 | 1M | Great context | | **openai/gpt-5-nano** | $0.05 | $0.40 | 200K | Cheap | | **openai/gpt-4.1-nano** | $0.10 | $0.40 | 1M | Good | | **openai/gpt-4o-mini** | $0.15 | $0.60 | 128K | Reliable | | **anthropic/claude-3-haiku** | $0.25 | $1.25 | 200K | Fast | --- ## Ranking: Value for Money Combines performance + price (subjective scoring): | Rank | Model | Input Cost | Performance | Value Score | |------|-------|------------|-------------|-------------| | 1 🏆 | **google/gemini-2.0-flash-lite** | $0.075 | 7/10 | ⭐⭐⭐⭐⭐ | | 2 | **qwen/qwen3.5-flash** | $0.065 | 6/10 | ⭐⭐⭐⭐⭐ | | 3 | **stepfun/step-3.5-flash:free** | $0 | 5/10 | ⭐⭐⭐⭐⭐ | | 4 | **minimax/minimax-m2.5:free** | $0 | 5/10 | ⭐⭐⭐⭐ | | 5 | **openai/gpt-4o-mini** | $0.15 | 8/10 | ⭐⭐⭐⭐ | | 6 | **google/gemini-1.5-flash-8b** | $0.0375 | 6/10 | ⭐⭐⭐⭐ | | 7 | **anthropic/claude-3.5-haiku** | $0.40 | 7/10 | ⭐⭐⭐ | | 8 | **openai/gpt-4.1** | $1.10 | 9/10 | ⭐⭐⭐ | --- ## Recommendation for Context Compression ### For This Project (Kugetsu/Pi) **Option 1: Free (Current)** - `stepfun/step-3.5-flash:free` - Works, no cost - Good enough for simple summarization **Option 2: Best Value** - `google/gemini-2.0-flash-lite` - $0.075/M tokens - 1M context window - Fast and reliable **Option 3: Best Performance** - `openai/gpt-4.1-nano` - $0.10/M tokens - Excellent reasoning for better summaries --- ## How Compression Would Work ```typescript // Pseudocode for compression async function compressContext(messages: Message[]): Promise { // 1. Take old messages (not recent) const oldMessages = messages.slice(1, -10); // Skip system + keep recent // 2. Send to compression model const summary = await llm.compress(` Summarize this conversation concisely: ${formatMessages(oldMessages)} `); // 3. Return summarized context return [ messages[0], // system { role: "user", content: `[Previous conversation summarized: ${summary}]` }, ...messages.slice(-10) // recent messages ]; } ``` --- ## Summary | Priority | Recommended Model | Cost | |----------|------------------|------| | **Performance** | GPT-4.1 or Claude 4 Sonnet | $$ | | **Price** | stepfun/free or Gemini Flash Lite | $0-0.075 | | **Value** | Gemini 2.0 Flash Lite | $0.075 | For this POC, I'd recommend: - **Free**: Keep using `stepfun/step-3.5-flash:free` - **Production**: Switch to `google/gemini-2.0-flash-lite` ($0.075/M)