Initial commit: kage-research project files

2026-04-09 00:39:52 +00:00
commit 71fc8b4495
19 changed files with 5303 additions and 0 deletions
--- a/llm-compression-research.md
+++ b/llm-compression-research.md
@@ -0,0 +1,130 @@
+# LLM for Context Compression/Summarization
+
+## Overview
+
+Research on best LLMs for context compression (summarizing old messages to save tokens).
+
+**Use case**: Compress old conversation history when context gets too long.
+
+---
+
+## Ranking: Performance First
+
+Based on general benchmarks and summarization capability:
+
+| Rank | Model | Provider | Strengths |
+|------|-------|----------|-----------|
+| 1 | **GPT-4.1** | OpenAI | Best overall reasoning, good summarization |
+| 2 | **Claude 4 Sonnet** | Anthropic | Excellent at long context tasks |
+| 3 | **Gemini 2.5 Pro** | Google | Massive context, strong reasoning |
+| 4 | **GPT-4o** | OpenAI | Balanced, reliable |
+| 5 | **Gemini 2.0 Flash** | Google | Fast + good quality |
+| 6 | **Claude 3.5 Sonnet** | Anthropic | Good value, fast |
+| 7 | **Llama 3.3 70B** | Meta | Open source, good reasoning |
+| 8 | **Qwen 3** | Alibaba | Excellent for coding/summarization |
+| 9 | **Mistral Large** | Mistral | European option, fast |
+| 10 | **Gemma 3** | Google | Lightweight, free |
+
+**Note**: Performance is subjective and varies by use case. For summarization specifically, fast models (Flash) often work well.
+
+---
+
+## Ranking: Price First (Cheapest)
+
+Sorted by input cost (per 1M tokens):
+
+### Free Models (OpenRouter)
+
+| Model | Input | Output | Context | Notes |
+|-------|-------|--------|---------|-------|
+| **stepfun/step-3.5-flash:free** | $0 | $0 | 256K | ✅ Currently using |
+| **minimax/minimax-m2.5:free** | $0 | $0 | 196K | Good quality |
+| **meta-llama/llama-3.3-70b:free** | $0 | $0 | 128K | Solid |
+| **arcee-ai/trinity-mini:free** | $0 | $0 | 131K | Lightweight |
+
+### Paid Models (Cheapest)
+
+| Model | Input | Output | Context | Notes |
+|-------|-------|--------|---------|-------|
+| **google/gemini-1.5-flash-8b** | $0.0375 | $0.15 | 1M | 🏆 Best cheap |
+| **google/gemini-2.0-flash-lite** | $0.075 | $0.30 | 1M | Fast |
+| **qwen/qwen3.5-flash-02-23** | $0.065 | $0.26 | 1M | Great context |
+| **openai/gpt-5-nano** | $0.05 | $0.40 | 200K | Cheap |
+| **openai/gpt-4.1-nano** | $0.10 | $0.40 | 1M | Good |
+| **openai/gpt-4o-mini** | $0.15 | $0.60 | 128K | Reliable |
+| **anthropic/claude-3-haiku** | $0.25 | $1.25 | 200K | Fast |
+
+---
+
+## Ranking: Value for Money
+
+Combines performance + price (subjective scoring):
+
+| Rank | Model | Input Cost | Performance | Value Score |
+|------|-------|------------|-------------|-------------|
+| 1 🏆 | **google/gemini-2.0-flash-lite** | $0.075 | 7/10 | ⭐⭐⭐⭐⭐ |
+| 2 | **qwen/qwen3.5-flash** | $0.065 | 6/10 | ⭐⭐⭐⭐⭐ |
+| 3 | **stepfun/step-3.5-flash:free** | $0 | 5/10 | ⭐⭐⭐⭐⭐ |
+| 4 | **minimax/minimax-m2.5:free** | $0 | 5/10 | ⭐⭐⭐⭐ |
+| 5 | **openai/gpt-4o-mini** | $0.15 | 8/10 | ⭐⭐⭐⭐ |
+| 6 | **google/gemini-1.5-flash-8b** | $0.0375 | 6/10 | ⭐⭐⭐⭐ |
+| 7 | **anthropic/claude-3.5-haiku** | $0.40 | 7/10 | ⭐⭐⭐ |
+| 8 | **openai/gpt-4.1** | $1.10 | 9/10 | ⭐⭐⭐ |
+
+---
+
+## Recommendation for Context Compression
+
+### For This Project (Kugetsu/Pi)
+
+**Option 1: Free (Current)**
+- `stepfun/step-3.5-flash:free` - Works, no cost
+- Good enough for simple summarization
+
+**Option 2: Best Value**
+- `google/gemini-2.0-flash-lite` - $0.075/M tokens
+- 1M context window
+- Fast and reliable
+
+**Option 3: Best Performance**
+- `openai/gpt-4.1-nano` - $0.10/M tokens
+- Excellent reasoning for better summaries
+
+---
+
+## How Compression Would Work
+
+```typescript
+// Pseudocode for compression
+async function compressContext(messages: Message[]): Promise<Message[]> {
+  // 1. Take old messages (not recent)
+  const oldMessages = messages.slice(1, -10); // Skip system + keep recent
+  
+  // 2. Send to compression model
+  const summary = await llm.compress(`
+    Summarize this conversation concisely:
+    ${formatMessages(oldMessages)}
+  `);
+  
+  // 3. Return summarized context
+  return [
+    messages[0], // system
+    { role: "user", content: `[Previous conversation summarized: ${summary}]` },
+    ...messages.slice(-10) // recent messages
+  ];
+}
+```
+
+---
+
+## Summary
+
+| Priority | Recommended Model | Cost |
+|----------|------------------|------|
+| **Performance** | GPT-4.1 or Claude 4 Sonnet | $$ |
+| **Price** | stepfun/free or Gemini Flash Lite | $0-0.075 |
+| **Value** | Gemini 2.0 Flash Lite | $0.075 |
+
+For this POC, I'd recommend:
+- **Free**: Keep using `stepfun/step-3.5-flash:free`
+- **Production**: Switch to `google/gemini-2.0-flash-lite` ($0.075/M)