# Research: Kimi Visual Capabilities **Date:** 2026-04-19 **Purpose:** Validate Kimi's visual strengths for hackathon project ## Kimi K2.5 - Multimodal Model ### Core Capabilities - **Text + Images + Video** input support - 256K context length - Thinking/non-thinking modes - Agent task support ### Visual API Models - `moonshot-v1-8k-vision-preview` - `moonshot-v1-32k-vision-preview` - `moonshot-v1-128k-vision-preview` - `kimi-k2.5` (latest, supports video) ### Supported Formats **Images:** png, jpeg, webp, gif **Video:** mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, 3gpp ### Unique Visual Features 1. **Visual Coding** - Kimi Code, Kimi Claw for coding with visual context 2. **Video Understanding** - Analyzes video content (unique for multimodal models) 3. **Real-time Visual Chat** - Interactive visual conversation ## Kimi K2 Benchmarks (Coding/Agent) | Benchmark | Kimi K2 Score | Notes | |-----------|---------------|-------| | SWE-bench Verified (Single Attempt) | **65.8%** | Global SOTA for open-source | | SWE-bench Multilingual | 47.3% | Outperforms most proprietary | | LiveCodeBench v6 | 53.7% | Strong coding | | TerminalBench | 30.0% | Agentic tool use | | Aider-Polyglot | 60.0% | Code editing | | Tau2-Bench (avg) | ~64% | Tool use tasks | ## Kimi Visual Strengths Summary ✅ **Video understanding** (unique advantage) ✅ **Visual coding** capabilities ✅ **Image + Text multimodal** ✅ **Strong agentic tool use** ✅ **256K context** for large visual inputs ## Sources - https://platform.moonshot.cn/docs/guide/kimi-k2-5-quickstart - https://moonshotai.github.io/Kimi-K2/ - https://platform.moonshot.cn/docs/guide/use-kimi-vision-model