- Hermes Detective Agency: Open-ended mystery investigation game - Roles: Chief (human), Witness (Kimi), Detective (Hermes) - 5 difficulty levels, community cases, open-ended solving - Scoring: Alignment %, Evidence %, Time - Features: Retry, Journal, Observe mode - Tech: Kimi Vision + Hermes Agent + Pollinations Changelog: - Research phase: Kimi capabilities, Hermes agent, image APIs - Brainstorming: 14 ideas explored - Comparison matrix: Detective selected as winner - Concept finalized with all design decisions
52 lines
1.7 KiB
Markdown
52 lines
1.7 KiB
Markdown
# Research: Kimi Visual Capabilities
|
|
|
|
**Date:** 2026-04-19
|
|
**Purpose:** Validate Kimi's visual strengths for hackathon project
|
|
|
|
## Kimi K2.5 - Multimodal Model
|
|
|
|
### Core Capabilities
|
|
- **Text + Images + Video** input support
|
|
- 256K context length
|
|
- Thinking/non-thinking modes
|
|
- Agent task support
|
|
|
|
### Visual API Models
|
|
- `moonshot-v1-8k-vision-preview`
|
|
- `moonshot-v1-32k-vision-preview`
|
|
- `moonshot-v1-128k-vision-preview`
|
|
- `kimi-k2.5` (latest, supports video)
|
|
|
|
### Supported Formats
|
|
**Images:** png, jpeg, webp, gif
|
|
**Video:** mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, 3gpp
|
|
|
|
### Unique Visual Features
|
|
1. **Visual Coding** - Kimi Code, Kimi Claw for coding with visual context
|
|
2. **Video Understanding** - Analyzes video content (unique for multimodal models)
|
|
3. **Real-time Visual Chat** - Interactive visual conversation
|
|
|
|
## Kimi K2 Benchmarks (Coding/Agent)
|
|
|
|
| Benchmark | Kimi K2 Score | Notes |
|
|
|-----------|---------------|-------|
|
|
| SWE-bench Verified (Single Attempt) | **65.8%** | Global SOTA for open-source |
|
|
| SWE-bench Multilingual | 47.3% | Outperforms most proprietary |
|
|
| LiveCodeBench v6 | 53.7% | Strong coding |
|
|
| TerminalBench | 30.0% | Agentic tool use |
|
|
| Aider-Polyglot | 60.0% | Code editing |
|
|
| Tau2-Bench (avg) | ~64% | Tool use tasks |
|
|
|
|
## Kimi Visual Strengths Summary
|
|
|
|
✅ **Video understanding** (unique advantage)
|
|
✅ **Visual coding** capabilities
|
|
✅ **Image + Text multimodal**
|
|
✅ **Strong agentic tool use**
|
|
✅ **256K context** for large visual inputs
|
|
|
|
## Sources
|
|
- https://platform.moonshot.cn/docs/guide/kimi-k2-5-quickstart
|
|
- https://moonshotai.github.io/Kimi-K2/
|
|
- https://platform.moonshot.cn/docs/guide/use-kimi-vision-model
|