131 lines
4.3 KiB
Markdown
131 lines
4.3 KiB
Markdown
# LLM for Context Compression/Summarization
|
|
|
|
## Overview
|
|
|
|
Research on best LLMs for context compression (summarizing old messages to save tokens).
|
|
|
|
**Use case**: Compress old conversation history when context gets too long.
|
|
|
|
---
|
|
|
|
## Ranking: Performance First
|
|
|
|
Based on general benchmarks and summarization capability:
|
|
|
|
| Rank | Model | Provider | Strengths |
|
|
|------|-------|----------|-----------|
|
|
| 1 | **GPT-4.1** | OpenAI | Best overall reasoning, good summarization |
|
|
| 2 | **Claude 4 Sonnet** | Anthropic | Excellent at long context tasks |
|
|
| 3 | **Gemini 2.5 Pro** | Google | Massive context, strong reasoning |
|
|
| 4 | **GPT-4o** | OpenAI | Balanced, reliable |
|
|
| 5 | **Gemini 2.0 Flash** | Google | Fast + good quality |
|
|
| 6 | **Claude 3.5 Sonnet** | Anthropic | Good value, fast |
|
|
| 7 | **Llama 3.3 70B** | Meta | Open source, good reasoning |
|
|
| 8 | **Qwen 3** | Alibaba | Excellent for coding/summarization |
|
|
| 9 | **Mistral Large** | Mistral | European option, fast |
|
|
| 10 | **Gemma 3** | Google | Lightweight, free |
|
|
|
|
**Note**: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.
|
|
|
|
---
|
|
|
|
## Ranking: Price First (Cheapest)
|
|
|
|
Sorted by input cost (per 1M tokens):
|
|
|
|
### Free Models (OpenRouter)
|
|
|
|
| Model | Input | Output | Context | Notes |
|
|
|-------|-------|--------|---------|-------|
|
|
| **stepfun/step-3.5-flash:free** | $0 | $0 | 256K | ✅ Currently using |
|
|
| **minimax/minimax-m2.5:free** | $0 | $0 | 196K | Good quality |
|
|
| **meta-llama/llama-3.3-70b:free** | $0 | $0 | 128K | Solid |
|
|
| **arcee-ai/trinity-mini:free** | $0 | $0 | 131K | Lightweight |
|
|
|
|
### Paid Models (Cheapest)
|
|
|
|
| Model | Input | Output | Context | Notes |
|
|
|-------|-------|--------|---------|-------|
|
|
| **google/gemini-1.5-flash-8b** | $0.0375 | $0.15 | 1M | 🏆 Best cheap |
|
|
| **google/gemini-2.0-flash-lite** | $0.075 | $0.30 | 1M | Fast |
|
|
| **qwen/qwen3.5-flash-02-23** | $0.065 | $0.26 | 1M | Great context |
|
|
| **openai/gpt-5-nano** | $0.05 | $0.40 | 200K | Cheap |
|
|
| **openai/gpt-4.1-nano** | $0.10 | $0.40 | 1M | Good |
|
|
| **openai/gpt-4o-mini** | $0.15 | $0.60 | 128K | Reliable |
|
|
| **anthropic/claude-3-haiku** | $0.25 | $1.25 | 200K | Fast |
|
|
|
|
---
|
|
|
|
## Ranking: Value for Money
|
|
|
|
Combines performance + price (subjective scoring):
|
|
|
|
| Rank | Model | Input Cost | Performance | Value Score |
|
|
|------|-------|------------|-------------|-------------|
|
|
| 1 🏆 | **google/gemini-2.0-flash-lite** | $0.075 | 7/10 | ⭐⭐⭐⭐⭐ |
|
|
| 2 | **qwen/qwen3.5-flash** | $0.065 | 6/10 | ⭐⭐⭐⭐⭐ |
|
|
| 3 | **stepfun/step-3.5-flash:free** | $0 | 5/10 | ⭐⭐⭐⭐⭐ |
|
|
| 4 | **minimax/minimax-m2.5:free** | $0 | 5/10 | ⭐⭐⭐⭐ |
|
|
| 5 | **openai/gpt-4o-mini** | $0.15 | 8/10 | ⭐⭐⭐⭐ |
|
|
| 6 | **google/gemini-1.5-flash-8b** | $0.0375 | 6/10 | ⭐⭐⭐⭐ |
|
|
| 7 | **anthropic/claude-3.5-haiku** | $0.40 | 7/10 | ⭐⭐⭐ |
|
|
| 8 | **openai/gpt-4.1** | $1.10 | 9/10 | ⭐⭐⭐ |
|
|
|
|
---
|
|
|
|
## Recommendation for Context Compression
|
|
|
|
### For This Project (Kugetsu/Pi)
|
|
|
|
**Option 1: Free (Current)**
|
|
- `stepfun/step-3.5-flash:free` - Works, no cost
|
|
- Good enough for simple summarization
|
|
|
|
**Option 2: Best Value**
|
|
- `google/gemini-2.0-flash-lite` - $0.075/M tokens
|
|
- 1M context window
|
|
- Fast and reliable
|
|
|
|
**Option 3: Best Performance**
|
|
- `openai/gpt-4.1-nano` - $0.10/M tokens
|
|
- Excellent reasoning for better summaries
|
|
|
|
---
|
|
|
|
## How Compression Would Work
|
|
|
|
```typescript
|
|
// Pseudocode for compression
|
|
async function compressContext(messages: Message[]): Promise<Message[]> {
|
|
// 1. Take old messages (not recent)
|
|
const oldMessages = messages.slice(1, -10); // Skip system + keep recent
|
|
|
|
// 2. Send to compression model
|
|
const summary = await llm.compress(`
|
|
Summarize this conversation concisely:
|
|
${formatMessages(oldMessages)}
|
|
`);
|
|
|
|
// 3. Return summarized context
|
|
return [
|
|
messages[0], // system
|
|
{ role: "user", content: `[Previous conversation summarized: ${summary}]` },
|
|
...messages.slice(-10) // recent messages
|
|
];
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
| Priority | Recommended Model | Cost |
|
|
|----------|------------------|------|
|
|
| **Performance** | GPT-4.1 or Claude 4 Sonnet | $$ |
|
|
| **Price** | stepfun/free or Gemini Flash Lite | $0-0.075 |
|
|
| **Value** | Gemini 2.0 Flash Lite | $0.075 |
|
|
|
|
For this POC, I'd recommend:
|
|
- **Free**: Keep using `stepfun/step-3.5-flash:free`
|
|
- **Production**: Switch to `google/gemini-2.0-flash-lite` ($0.075/M)
|