LLM for Context Compression/Summarization

Overview

Research on best LLMs for context compression (summarizing old messages to save tokens).

Use case: Compress old conversation history when context gets too long.

Ranking: Performance First

Based on general benchmarks and summarization capability:

Rank	Model	Provider	Strengths
1	GPT-4.1	OpenAI	Best overall reasoning, good summarization
2	Claude 4 Sonnet	Anthropic	Excellent at long context tasks
3	Gemini 2.5 Pro	Google	Massive context, strong reasoning
4	GPT-4o	OpenAI	Balanced, reliable
5	Gemini 2.0 Flash	Google	Fast + good quality
6	Claude 3.5 Sonnet	Anthropic	Good value, fast
7	Llama 3.3 70B	Meta	Open source, good reasoning
8	Qwen 3	Alibaba	Excellent for coding/summarization
9	Mistral Large	Mistral	European option, fast
10	Gemma 3	Google	Lightweight, free

Note: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.

Ranking: Price First (Cheapest)

Sorted by input cost (per 1M tokens):

Free Models (OpenRouter)

Model	Input	Output	Context	Notes
stepfun/step-3.5-flash:free	$0	$0	256K	✅ Currently using
minimax/minimax-m2.5:free	$0	$0	196K	Good quality
meta-llama/llama-3.3-70b:free	$0	$0	128K	Solid
arcee-ai/trinity-mini:free	$0	$0	131K	Lightweight

Paid Models (Cheapest)

Model	Input	Output	Context	Notes
google/gemini-1.5-flash-8b	$0.0375	$0.15	1M	🏆 Best cheap
google/gemini-2.0-flash-lite	$0.075	$0.30	1M	Fast
qwen/qwen3.5-flash-02-23	$0.065	$0.26	1M	Great context
openai/gpt-5-nano	$0.05	$0.40	200K	Cheap
openai/gpt-4.1-nano	$0.10	$0.40	1M	Good
openai/gpt-4o-mini	$0.15	$0.60	128K	Reliable
anthropic/claude-3-haiku	$0.25	$1.25	200K	Fast

Ranking: Value for Money

Combines performance + price (subjective scoring):

Rank	Model	Input Cost	Performance	Value Score
1 🏆	google/gemini-2.0-flash-lite	$0.075	7/10	⭐⭐⭐⭐⭐
2	qwen/qwen3.5-flash	$0.065	6/10	⭐⭐⭐⭐⭐
3	stepfun/step-3.5-flash:free	$0	5/10	⭐⭐⭐⭐⭐
4	minimax/minimax-m2.5:free	$0	5/10	⭐⭐⭐⭐
5	openai/gpt-4o-mini	$0.15	8/10	⭐⭐⭐⭐
6	google/gemini-1.5-flash-8b	$0.0375	6/10	⭐⭐⭐⭐
7	anthropic/claude-3.5-haiku	$0.40	7/10	⭐⭐⭐
8	openai/gpt-4.1	$1.10	9/10	⭐⭐⭐

Recommendation for Context Compression

For This Project (Kugetsu/Pi)

Option 1: Free (Current)

stepfun/step-3.5-flash:free - Works, no cost
Good enough for simple summarization

Option 2: Best Value

google/gemini-2.0-flash-lite - $0.075/M tokens
1M context window
Fast and reliable

Option 3: Best Performance

openai/gpt-4.1-nano - $0.10/M tokens
Excellent reasoning for better summaries

How Compression Would Work

// Pseudocode for compression
async function compressContext(messages: Message[]): Promise<Message[]> {
  // 1. Take old messages (not recent)
  const oldMessages = messages.slice(1, -10); // Skip system + keep recent
  
  // 2. Send to compression model
  const summary = await llm.compress(`
    Summarize this conversation concisely:
    ${formatMessages(oldMessages)}
  `);
  
  // 3. Return summarized context
  return [
    messages[0], // system
    { role: "user", content: `[Previous conversation summarized: ${summary}]` },
    ...messages.slice(-10) // recent messages
  ];
}

Summary

Priority	Recommended Model	Cost
Performance	GPT-4.1 or Claude 4 Sonnet	$$
Price	stepfun/free or Gemini Flash Lite	$0-0.075
Value	Gemini 2.0 Flash Lite	$0.075

For this POC, I'd recommend:

Free: Keep using stepfun/step-3.5-flash:free
Production: Switch to google/gemini-2.0-flash-lite ($0.075/M)

4.3 KiB Raw Blame History