Files
hermes-detective/docs/research-kimi-visual-capabilities.md
shoko ecfd0b1160 feat: Initial commit - Hermes Detective Agency concept
- Hermes Detective Agency: Open-ended mystery investigation game
- Roles: Chief (human), Witness (Kimi), Detective (Hermes)
- 5 difficulty levels, community cases, open-ended solving
- Scoring: Alignment %, Evidence %, Time
- Features: Retry, Journal, Observe mode
- Tech: Kimi Vision + Hermes Agent + Pollinations

Changelog:
- Research phase: Kimi capabilities, Hermes agent, image APIs
- Brainstorming: 14 ideas explored
- Comparison matrix: Detective selected as winner
- Concept finalized with all design decisions
2026-04-20 00:00:30 +00:00

1.7 KiB

Research: Kimi Visual Capabilities

Date: 2026-04-19
Purpose: Validate Kimi's visual strengths for hackathon project

Kimi K2.5 - Multimodal Model

Core Capabilities

  • Text + Images + Video input support
  • 256K context length
  • Thinking/non-thinking modes
  • Agent task support

Visual API Models

  • moonshot-v1-8k-vision-preview
  • moonshot-v1-32k-vision-preview
  • moonshot-v1-128k-vision-preview
  • kimi-k2.5 (latest, supports video)

Supported Formats

Images: png, jpeg, webp, gif
Video: mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, 3gpp

Unique Visual Features

  1. Visual Coding - Kimi Code, Kimi Claw for coding with visual context
  2. Video Understanding - Analyzes video content (unique for multimodal models)
  3. Real-time Visual Chat - Interactive visual conversation

Kimi K2 Benchmarks (Coding/Agent)

Benchmark Kimi K2 Score Notes
SWE-bench Verified (Single Attempt) 65.8% Global SOTA for open-source
SWE-bench Multilingual 47.3% Outperforms most proprietary
LiveCodeBench v6 53.7% Strong coding
TerminalBench 30.0% Agentic tool use
Aider-Polyglot 60.0% Code editing
Tau2-Bench (avg) ~64% Tool use tasks

Kimi Visual Strengths Summary

Video understanding (unique advantage)
Visual coding capabilities
Image + Text multimodal
Strong agentic tool use
256K context for large visual inputs

Sources