kage-research/llm-compression-research.md

# LLM for Context Compression/Summarization

## Overview

Research on best LLMs for context compression (summarizing old messages to save tokens).

**Use case**: Compress old conversation history when context gets too long.

---

## Ranking: Performance First

Based on general benchmarks and summarization capability:

| Rank | Model | Provider | Strengths |
|------|-------|----------|-----------|
| 1 | **GPT-4.1** | OpenAI | Best overall reasoning, good summarization |
| 2 | **Claude 4 Sonnet** | Anthropic | Excellent at long context tasks |
| 3 | **Gemini 2.5 Pro** | Google | Massive context, strong reasoning |
| 4 | **GPT-4o** | OpenAI | Balanced, reliable |
| 5 | **Gemini 2.0 Flash** | Google | Fast + good quality |
| 6 | **Claude 3.5 Sonnet** | Anthropic | Good value, fast |
| 7 | **Llama 3.3 70B** | Meta | Open source, good reasoning |
| 8 | **Qwen 3** | Alibaba | Excellent for coding/summarization |
| 9 | **Mistral Large** | Mistral | European option, fast |
| 10 | **Gemma 3** | Google | Lightweight, free |

**Note**: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.

---

## Ranking: Price First (Cheapest)

Sorted by input cost (per 1M tokens):

### Free Models (OpenRouter)

| Model | Input | Output | Context | Notes |
|-------|-------|--------|---------|-------|
| **stepfun/step-3.5-flash:free** | $0 | $0 | 256K | ✅ Currently using |
| **minimax/minimax-m2.5:free** | $0 | $0 | 196K | Good quality |
| **meta-llama/llama-3.3-70b:free** | $0 | $0 | 128K | Solid |
| **arcee-ai/trinity-mini:free** | $0 | $0 | 131K | Lightweight |

### Paid Models (Cheapest)

| Model | Input | Output | Context | Notes |
|-------|-------|--------|---------|-------|
| **google/gemini-1.5-flash-8b** | $0.0375 | $0.15 | 1M | 🏆 Best cheap |
| **google/gemini-2.0-flash-lite** | $0.075 | $0.30 | 1M | Fast |
| **qwen/qwen3.5-flash-02-23** | $0.065 | $0.26 | 1M | Great context |
| **openai/gpt-5-nano** | $0.05 | $0.40 | 200K | Cheap |
| **openai/gpt-4.1-nano** | $0.10 | $0.40 | 1M | Good |
| **openai/gpt-4o-mini** | $0.15 | $0.60 | 128K | Reliable |
| **anthropic/claude-3-haiku** | $0.25 | $1.25 | 200K | Fast |

---

## Ranking: Value for Money

Combines performance + price (subjective scoring):

| Rank | Model | Input Cost | Performance | Value Score |
|------|-------|------------|-------------|-------------|
| 1 🏆 | **google/gemini-2.0-flash-lite** | $0.075 | 7/10 | ⭐⭐⭐⭐⭐ |
| 2 | **qwen/qwen3.5-flash** | $0.065 | 6/10 | ⭐⭐⭐⭐⭐ |
| 3 | **stepfun/step-3.5-flash:free** | $0 | 5/10 | ⭐⭐⭐⭐⭐ |
| 4 | **minimax/minimax-m2.5:free** | $0 | 5/10 | ⭐⭐⭐⭐ |
| 5 | **openai/gpt-4o-mini** | $0.15 | 8/10 | ⭐⭐⭐⭐ |
| 6 | **google/gemini-1.5-flash-8b** | $0.0375 | 6/10 | ⭐⭐⭐⭐ |
| 7 | **anthropic/claude-3.5-haiku** | $0.40 | 7/10 | ⭐⭐⭐ |
| 8 | **openai/gpt-4.1** | $1.10 | 9/10 | ⭐⭐⭐ |

---

## Recommendation for Context Compression

### For This Project (Kugetsu/Pi)

**Option 1: Free (Current)**
- `stepfun/step-3.5-flash:free` - Works, no cost
- Good enough for simple summarization

**Option 2: Best Value**
- `google/gemini-2.0-flash-lite` - $0.075/M tokens
- 1M context window
- Fast and reliable

**Option 3: Best Performance**
- `openai/gpt-4.1-nano` - $0.10/M tokens
- Excellent reasoning for better summaries

---

## How Compression Would Work

```typescript
// Pseudocode for compression
async function compressContext(messages: Message[]): Promise<Message[]> {
  // 1. Take old messages (not recent)
  const oldMessages = messages.slice(1, -10); // Skip system + keep recent

  // 2. Send to compression model
  const summary = await llm.compress(`
    Summarize this conversation concisely:
    ${formatMessages(oldMessages)}
  `);

  // 3. Return summarized context
  return [
    messages[0], // system
    { role: "user", content: `[Previous conversation summarized: ${summary}]` },
    ...messages.slice(-10) // recent messages
  ];
}
```

---

## Summary

| Priority | Recommended Model | Cost |
|----------|------------------|------|
| **Performance** | GPT-4.1 or Claude 4 Sonnet | $$ |
| **Price** | stepfun/free or Gemini Flash Lite | $0-0.075 |
| **Value** | Gemini 2.0 Flash Lite | $0.075 |

For this POC, I'd recommend:
- **Free**: Keep using `stepfun/step-3.5-flash:free`
- **Production**: Switch to `google/gemini-2.0-flash-lite` ($0.075/M)