Files
kage-research/llm-compression-research.md
2026-04-09 00:39:52 +00:00

4.3 KiB

LLM for Context Compression/Summarization

Overview

Research on best LLMs for context compression (summarizing old messages to save tokens).

Use case: Compress old conversation history when context gets too long.


Ranking: Performance First

Based on general benchmarks and summarization capability:

Rank Model Provider Strengths
1 GPT-4.1 OpenAI Best overall reasoning, good summarization
2 Claude 4 Sonnet Anthropic Excellent at long context tasks
3 Gemini 2.5 Pro Google Massive context, strong reasoning
4 GPT-4o OpenAI Balanced, reliable
5 Gemini 2.0 Flash Google Fast + good quality
6 Claude 3.5 Sonnet Anthropic Good value, fast
7 Llama 3.3 70B Meta Open source, good reasoning
8 Qwen 3 Alibaba Excellent for coding/summarization
9 Mistral Large Mistral European option, fast
10 Gemma 3 Google Lightweight, free

Note: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.


Ranking: Price First (Cheapest)

Sorted by input cost (per 1M tokens):

Free Models (OpenRouter)

Model Input Output Context Notes
stepfun/step-3.5-flash:free $0 $0 256K Currently using
minimax/minimax-m2.5:free $0 $0 196K Good quality
meta-llama/llama-3.3-70b:free $0 $0 128K Solid
arcee-ai/trinity-mini:free $0 $0 131K Lightweight

Paid Models (Cheapest)

Model Input Output Context Notes
google/gemini-1.5-flash-8b $0.0375 $0.15 1M 🏆 Best cheap
google/gemini-2.0-flash-lite $0.075 $0.30 1M Fast
qwen/qwen3.5-flash-02-23 $0.065 $0.26 1M Great context
openai/gpt-5-nano $0.05 $0.40 200K Cheap
openai/gpt-4.1-nano $0.10 $0.40 1M Good
openai/gpt-4o-mini $0.15 $0.60 128K Reliable
anthropic/claude-3-haiku $0.25 $1.25 200K Fast

Ranking: Value for Money

Combines performance + price (subjective scoring):

Rank Model Input Cost Performance Value Score
1 🏆 google/gemini-2.0-flash-lite $0.075 7/10
2 qwen/qwen3.5-flash $0.065 6/10
3 stepfun/step-3.5-flash:free $0 5/10
4 minimax/minimax-m2.5:free $0 5/10
5 openai/gpt-4o-mini $0.15 8/10
6 google/gemini-1.5-flash-8b $0.0375 6/10
7 anthropic/claude-3.5-haiku $0.40 7/10
8 openai/gpt-4.1 $1.10 9/10

Recommendation for Context Compression

For This Project (Kugetsu/Pi)

Option 1: Free (Current)

  • stepfun/step-3.5-flash:free - Works, no cost
  • Good enough for simple summarization

Option 2: Best Value

  • google/gemini-2.0-flash-lite - $0.075/M tokens
  • 1M context window
  • Fast and reliable

Option 3: Best Performance

  • openai/gpt-4.1-nano - $0.10/M tokens
  • Excellent reasoning for better summaries

How Compression Would Work

// Pseudocode for compression
async function compressContext(messages: Message[]): Promise<Message[]> {
  // 1. Take old messages (not recent)
  const oldMessages = messages.slice(1, -10); // Skip system + keep recent
  
  // 2. Send to compression model
  const summary = await llm.compress(`
    Summarize this conversation concisely:
    ${formatMessages(oldMessages)}
  `);
  
  // 3. Return summarized context
  return [
    messages[0], // system
    { role: "user", content: `[Previous conversation summarized: ${summary}]` },
    ...messages.slice(-10) // recent messages
  ];
}

Summary

Priority Recommended Model Cost
Performance GPT-4.1 or Claude 4 Sonnet $$
Price stepfun/free or Gemini Flash Lite $0-0.075
Value Gemini 2.0 Flash Lite $0.075

For this POC, I'd recommend:

  • Free: Keep using stepfun/step-3.5-flash:free
  • Production: Switch to google/gemini-2.0-flash-lite ($0.075/M)