4.3 KiB
4.3 KiB
LLM for Context Compression/Summarization
Overview
Research on best LLMs for context compression (summarizing old messages to save tokens).
Use case: Compress old conversation history when context gets too long.
Ranking: Performance First
Based on general benchmarks and summarization capability:
| Rank | Model | Provider | Strengths |
|---|---|---|---|
| 1 | GPT-4.1 | OpenAI | Best overall reasoning, good summarization |
| 2 | Claude 4 Sonnet | Anthropic | Excellent at long context tasks |
| 3 | Gemini 2.5 Pro | Massive context, strong reasoning | |
| 4 | GPT-4o | OpenAI | Balanced, reliable |
| 5 | Gemini 2.0 Flash | Fast + good quality | |
| 6 | Claude 3.5 Sonnet | Anthropic | Good value, fast |
| 7 | Llama 3.3 70B | Meta | Open source, good reasoning |
| 8 | Qwen 3 | Alibaba | Excellent for coding/summarization |
| 9 | Mistral Large | Mistral | European option, fast |
| 10 | Gemma 3 | Lightweight, free |
Note: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.
Ranking: Price First (Cheapest)
Sorted by input cost (per 1M tokens):
Free Models (OpenRouter)
| Model | Input | Output | Context | Notes |
|---|---|---|---|---|
| stepfun/step-3.5-flash:free | $0 | $0 | 256K | ✅ Currently using |
| minimax/minimax-m2.5:free | $0 | $0 | 196K | Good quality |
| meta-llama/llama-3.3-70b:free | $0 | $0 | 128K | Solid |
| arcee-ai/trinity-mini:free | $0 | $0 | 131K | Lightweight |
Paid Models (Cheapest)
| Model | Input | Output | Context | Notes |
|---|---|---|---|---|
| google/gemini-1.5-flash-8b | $0.0375 | $0.15 | 1M | 🏆 Best cheap |
| google/gemini-2.0-flash-lite | $0.075 | $0.30 | 1M | Fast |
| qwen/qwen3.5-flash-02-23 | $0.065 | $0.26 | 1M | Great context |
| openai/gpt-5-nano | $0.05 | $0.40 | 200K | Cheap |
| openai/gpt-4.1-nano | $0.10 | $0.40 | 1M | Good |
| openai/gpt-4o-mini | $0.15 | $0.60 | 128K | Reliable |
| anthropic/claude-3-haiku | $0.25 | $1.25 | 200K | Fast |
Ranking: Value for Money
Combines performance + price (subjective scoring):
| Rank | Model | Input Cost | Performance | Value Score |
|---|---|---|---|---|
| 1 🏆 | google/gemini-2.0-flash-lite | $0.075 | 7/10 | ⭐⭐⭐⭐⭐ |
| 2 | qwen/qwen3.5-flash | $0.065 | 6/10 | ⭐⭐⭐⭐⭐ |
| 3 | stepfun/step-3.5-flash:free | $0 | 5/10 | ⭐⭐⭐⭐⭐ |
| 4 | minimax/minimax-m2.5:free | $0 | 5/10 | ⭐⭐⭐⭐ |
| 5 | openai/gpt-4o-mini | $0.15 | 8/10 | ⭐⭐⭐⭐ |
| 6 | google/gemini-1.5-flash-8b | $0.0375 | 6/10 | ⭐⭐⭐⭐ |
| 7 | anthropic/claude-3.5-haiku | $0.40 | 7/10 | ⭐⭐⭐ |
| 8 | openai/gpt-4.1 | $1.10 | 9/10 | ⭐⭐⭐ |
Recommendation for Context Compression
For This Project (Kugetsu/Pi)
Option 1: Free (Current)
stepfun/step-3.5-flash:free- Works, no cost- Good enough for simple summarization
Option 2: Best Value
google/gemini-2.0-flash-lite- $0.075/M tokens- 1M context window
- Fast and reliable
Option 3: Best Performance
openai/gpt-4.1-nano- $0.10/M tokens- Excellent reasoning for better summaries
How Compression Would Work
// Pseudocode for compression
async function compressContext(messages: Message[]): Promise<Message[]> {
// 1. Take old messages (not recent)
const oldMessages = messages.slice(1, -10); // Skip system + keep recent
// 2. Send to compression model
const summary = await llm.compress(`
Summarize this conversation concisely:
${formatMessages(oldMessages)}
`);
// 3. Return summarized context
return [
messages[0], // system
{ role: "user", content: `[Previous conversation summarized: ${summary}]` },
...messages.slice(-10) // recent messages
];
}
Summary
| Priority | Recommended Model | Cost |
|---|---|---|
| Performance | GPT-4.1 or Claude 4 Sonnet | $$ |
| Price | stepfun/free or Gemini Flash Lite | $0-0.075 |
| Value | Gemini 2.0 Flash Lite | $0.075 |
For this POC, I'd recommend:
- Free: Keep using
stepfun/step-3.5-flash:free - Production: Switch to
google/gemini-2.0-flash-lite($0.075/M)