From ecfd0b11609108531a68da4c01e5f4a0d6fcd795 Mon Sep 17 00:00:00 2001 From: shoko Date: Mon, 20 Apr 2026 00:00:30 +0000 Subject: [PATCH] feat: Initial commit - Hermes Detective Agency concept - Hermes Detective Agency: Open-ended mystery investigation game - Roles: Chief (human), Witness (Kimi), Detective (Hermes) - 5 difficulty levels, community cases, open-ended solving - Scoring: Alignment %, Evidence %, Time - Features: Retry, Journal, Observe mode - Tech: Kimi Vision + Hermes Agent + Pollinations Changelog: - Research phase: Kimi capabilities, Hermes agent, image APIs - Brainstorming: 14 ideas explored - Comparison matrix: Detective selected as winner - Concept finalized with all design decisions --- .issues/001-hermes-hackathon-project.md | 63 +++ CHANGELOG.md | 204 ++++++++ docs/chosen-detective-game.md | 502 +++++++++++++++++++ docs/ideas/001-visual-narrative-agent.md | 79 +++ docs/ideas/007-vision-spot-the-difference.md | 138 +++++ docs/ideas/008-visual-detective.md | 397 +++++++++++++++ docs/ideas/COMPARISON.md | 132 +++++ docs/research-hermes-agent.md | 47 ++ docs/research-image-generation-apis.md | 72 +++ docs/research-kimi-visual-capabilities.md | 51 ++ 10 files changed, 1685 insertions(+) create mode 100644 .issues/001-hermes-hackathon-project.md create mode 100644 CHANGELOG.md create mode 100644 docs/chosen-detective-game.md create mode 100644 docs/ideas/001-visual-narrative-agent.md create mode 100644 docs/ideas/007-vision-spot-the-difference.md create mode 100644 docs/ideas/008-visual-detective.md create mode 100644 docs/ideas/COMPARISON.md create mode 100644 docs/research-hermes-agent.md create mode 100644 docs/research-image-generation-apis.md create mode 100644 docs/research-kimi-visual-capabilities.md diff --git a/.issues/001-hermes-hackathon-project.md b/.issues/001-hermes-hackathon-project.md new file mode 100644 index 0000000..3f4a8cd --- /dev/null +++ b/.issues/001-hermes-hackathon-project.md @@ -0,0 +1,63 @@ +# Issue 001: Hermes Agent Creative Hackathon Project + +**Status:** active +**Created:** 2026-04-19 +**Tags:** hackathon, hermes-agent, creative + +## Summary + +Participate in the Hermes Agent Creative Hackathon. 16 days, $25k in prizes across two tracks. + +## Tracks + +| Track | Pool | 1st | 2nd | 3rd | +|-------|------|-----|-----|-----| +| Main | $15,000 | $10,000 | $3,500 | $1,500 | +| Kimi | $5,000 | $3,500 | $1,000 | $500 | + +Plus $5k in Kimi Credits for winners. + +## Requirements + +- Submit via Discord (`⁠creative-hackathon-submissions` channel) +- Tweet demo video + writeup tagging @NousResearch +- Kimi Track: must prove Kimi model usage (eligible for both tracks) +- Judged on: creativity, usefulness, presentation +- Deadline: EOD Sunday, May 3rd + +## Creative Domains + +Video, image, audio, 3D, long-form writing, creative software, interactive media + +## Next Steps + +- [ ] Brainstorm 3-5 project ideas +- [ ] Compare ideas (uniqueness, feasibility, wow factor) +- [ ] Decide on final concept +- [ ] Build prototype +- [ ] Test & iterate +- [ ] Produce submission video + writeup + +## Research Summary + +### Kimi Visual Strengths (Validated) +- **Video understanding** - Unique multimodal capability +- **Visual coding** - Kimi Code, Kimi Claw +- **Image + Text** - Full multimodal support +- **Strong benchmarks** - SWE-bench 65.8%, Tau2-Bench ~64% + +### Hermes Agent Capabilities +- **Function calling** - Trained for reliable tool use +- **Structured output** - JSON/Pydantic +- **OpenAI-compatible** - Easy integration +- **Multi-turn agents** - Agentic workflow execution + +### Potential Project Concept +**Visual Agent Pipeline:** Hermes Agent orchestrates workflows, Kimi K2.5 handles visual understanding (image/video analysis), creating a creative visual AI application. + +### Kimi Track Eligibility +To qualify for Kimi Track, must prove Kimi model usage in demo video. + +## Notes + +Session started 2026-04-19. Working directory: `/tmp/hermes-hack` diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..6e024f3 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,204 @@ +# Session Log + +## 2026-04-19 + +### 001 - Session Start: Hermes Hackathon + +**What:** Started Hermes Agent Creative Hackathon collaboration. + +**Context:** +- Hackathon: 16 days, $25k prizes (Main $15k, Kimi $5k, $5k Kimi credits) +- Presented by Kimi Moonshot & Nous Research +- Two tracks: Main (any creative use) and Kimi Track (must use Kimi models) +- Deadline: EOD Sunday, May 3rd + +**Action:** +- Set up workflow structure (`.issues/`, `docs/`, git init) +- Created first issue file: `001-hermes-hackathon-project.md` + +**Next:** +- Define project concept and creative domain focus +- Explore Hermes Agent capabilities +- Sketch initial prototype idea + +### 002 - Research Completed + +**What:** Validated Kimi and Hermes Agent capabilities. + +**Findings:** +- Kimi K2.5: multimodal (text+image+video), video understanding, visual coding +- Kimi benchmarks: SWE-bench 65.8%, Tau2 ~64% +- Hermes 3: function calling, structured output, OpenAI-compatible +- Hermes built-in skills: manim-video, ascii-video, ascii-art (accessibility-focused) + +**Action:** +- Created `docs/research-kimi-visual-capabilities.md` +- Created `docs/research-hermes-agent.md` +- Created `docs/research-image-generation-apis.md` +- Updated issue file with research summary + +**Next:** +- Define concrete project concept +- Choose specific creative angle (visual coding? video analysis? image generation?) +- Start rapid prototyping + +--- + +### 003 - Image Gen API Research + +**What:** Found affordable/free image generation API. + +**Findings:** +- **Pollinations AI** ✅: Free tier, OpenAI-compatible, multiple models (Flux, etc.) + - Endpoint: `https://gen.pollinations.ai/image/{prompt}` + - Simple: just curl it, no auth needed for basic + - Models: flux, zimage, wan-image, qwen-image, gptimage + - Cost: Free tier (pollen credits), $1 ≈ 1 Pollen paid + +**Action:** +- Created `docs/research-image-generation-apis.md` +- Updated idea 001 with image gen options + +**Next:** +- Sketch more project ideas for comparison +- Do idea benchmark matrix + +--- + +### 004 - Brainstorming Session + +**What:** Generated 7 project ideas, deeper dive on Idea 007. + +**Ideas Generated:** +1. 001: Visual Narrative Agent (text → image loop) +2. 002: Visual Memory Journal (AI scrapbook) +3. 003: Reverse Design Critic (UI critique + fix) +4. 004: Visual Poem Generator (two-AI art collaboration) +5. 005: Scene-to-Scene Video Storyteller (visual journey) +6. 006: Real-time Visual Debugger (screenshot → fix) +7. 007: Spot the Difference Agent (NEW FOCUS) + +**User Preferences:** +- Want high visual analysis, low reasoning +- Single page webapp, no auth +- Show step-by-step AI process +- Gamification (leaderboard, daily puzzles) + +**Selected for deeper dive:** 007 Vision Puzzle + +**Action:** +- Created `docs/ideas/007-vision-spot-the-difference.md` + +**Next:** +- Compare all ideas to pick winner + +--- + +### 005 - Ideas Comparison + +**What:** Created comparison matrix for all brainstormed ideas. + +**Ideas Compared:** 14 concepts across visual games, interactive, and creative + +**Scoring Criteria:** +- Visual Analysis (30%) +- Multi-Turn (20%) +- Human-AI Interaction (20%) +- Cost Efficiency (15%) +- Uniqueness (10%) +- Fun (5%) + +**Results:** + +| Rank | Idea | Score | +|------|------|-------| +| 🥇 | **033v2 Detective** | **4.7** | +| 🥈 | Auction | 3.9 | +| 🥉 | 032v2 Art Critic | 3.7 | +| 4 | 013 Image Alchemy | 3.6 | +| 5 | 009 Image Tarot | 3.5 | + +**Winner: 033v2 Detective** + +**Why:** +- Best multi-turn (5+ rounds) +- Human actively directs (Chief role) +- Kimi does real visual work +- Cost efficient (mostly text) +- Natural mystery/narrative arc + +**Action:** +- Created `docs/ideas/COMPARISON.md` +- Created `docs/ideas/008-visual-detective.md` (includes multi-agent v2) + +**Next:** +- Discuss and finalize concept details + +--- + +### 006 - Concept Documented + +**What:** Documented chosen 033v2 Detective as `chosen-detective-game.md`. + +**Documented:** +- Elevator pitch +- Game roles (Chief, Witness, Detective) +- Evidence types (crime scene, documents, photos, etc.) +- Round structure (7 rounds per case) +- Scoring system +- UI concept sketch +- Difficulty tiers (Rookie → Chief) +- Daily cases + leaderboard + +**Action:** +- Created `docs/chosen-detective-game.md` + +**Open Questions (for discussion):** +1. How much Witness describes unprompted? +2. Can Detective be wrong? +3. Red herrings — yes/no? +4. Plot twist mid-case? +5. Timer? +6. Replay past cases? +7. Hints system? +8. Skip evidence? +9. Case sources (pre-made/generated)? +10. Image sources (real/AI/illustrated)? +11. Share results? +12. Community cases? + +**Next:** +- Discuss and finalize concept details + +--- + +### 007 - Concept Finalized + +**What:** Finalized Hermes Detective Agency concept after extensive discussion. + +**Key Decisions:** +- **Difficulty:** 5 levels (Easy → Impossible), one case per day +- **Open-ended solving:** No single truth, multiple valid theories +- **Scoring:** Alignment %, Evidence cited %, Time (turns × 10min) +- **Hints:** Embedded in evidence (too obvious, barely obvious, not too obvious) +- **Witness:** Dynamic appearance based on triggers (harder cases) +- **Truth reveal:** Available anytime, doesn't end game +- **Retry:** Unlimited attempts, every documented +- **Journal:** Private by default, publish stats/journal optional +- **Observe:** Watch others' published solves +- **Case source:** 5 starter cases (one per difficulty) + community generation +- **Community:** Visits + reviews (no auth, manipulable but requires effort) +- **Discovery:** Jungle (browse) vs path (direct links from creator) +- **Case format:** YAML-based template +- **Creator tools:** Hermes skill + format validator + +**Action:** +- Updated `docs/chosen-detective-game.md` with full finalized concept + +**Next:** +- Technical architecture +- UI/UX design +- Prompt engineering +- Prototype development + +--- diff --git a/docs/chosen-detective-game.md b/docs/chosen-detective-game.md new file mode 100644 index 0000000..07c2b7f --- /dev/null +++ b/docs/chosen-detective-game.md @@ -0,0 +1,502 @@ +# Project: Hermes Detective Agency + +**Chosen Concept:** 033v2 Detective +**Date:** 2026-04-19 +**Status:** Concept Finalized +**Tags:** hermes-agent, kimi-vision, game, multi-agent, open-ended, community + +--- + +## Concept Summary + +A mystery investigation game where a human (Chief) directs two AI agents — a **Witness** (powered by Kimi Vision) and a **Detective** (powered by Hermes) — to investigate visual cases. + +**Core philosophy:** Open-ended solving. No single truth. Evidence guides, but multiple theories are valid. + +--- + +## Elevator Pitch + +> *"You're the Chief. Your Witness sees everything. Your Detective connects the dots. Build YOUR theory. See how it aligns with others."* + +--- + +## The Story + +You run a small detective agency. Your two AI assistants have superhuman abilities: + +- **Witness** can look at any image and describe it perfectly — every detail, every inconsistency, every hidden clue. +- **Detective** can take those observations and build theories, spot patterns, and identify suspects. + +Your job? **Direct the investigation.** Tell them what to look at. Ask the right questions. Build your theory. + +**Key difference:** There's no single "right answer." The creator has an intended story, but your theory is valid if evidence supports it. + +--- + +## Game Roles + +### Chief (Human) +The player. You run the investigation. + +| Action | Effect | +|--------|--------| +| Examine evidence | Witness + Kimi analyze | +| Question suspects | Detective probes, Witness watches | +| Compare items | Kimi highlights differences | +| Build theory | Cite evidence, form conclusion | +| Request truth | See creator's intended story (optional) | + +### Witness (Agent A + Kimi) +The eyes. Analyzes visual evidence. Appears based on triggers. + +| Input | Output | +|-------|--------| +| Crime scene photo | "I see glass shards, muddy footprints, a broken frame..." | +| Suspect photo | "This person has paint on their sleeve..." | +| Document | Extracts text, notes inconsistencies | +| Item close-up | Identifies details Chief might miss | + +**Dynamic Appearance:** In harder cases, Witness doesn't appear until triggered. + +### Detective (Agent B) +The brain. Builds theories, responds to questions. + +| Input | Output | +|-------|--------| +| Witness observations | "Based on evidence, the thief entered through..." | +| Suspect profiles | "Suspect A has motive: insurance fraud..." | +| Human questions | "Good question, Chief. Let me look into that..." | +| Theory building | Helps Chief cite evidence for their theory | + +--- + +## Difficulty System + +### Difficulty Levels + +| Difficulty | Description | Evidence | Suspects | Red Herrings | Plot Twist | +|-----------|-------------|----------|----------|---------------|------------| +| **Easy** | Obvious clues, clear path | 4-5 | 2 | ❌ | ❌ | +| **Medium** | Requires comparison | 6-7 | 3 | ❌ | ❌ | +| **Hard** | Red herrings present | 8-9 | 4 | ✅ | ❌ | +| **Hardcore** | Plot twist mid-case | 10-11 | 4 | ✅ | ✅ | +| **Impossible** | All elements, complex | 12+ | 5 | ✅ | ✅ | + +### Daily Structure + +``` +One case per day, everyone gets the same case +Same difficulty for all players +Different case each day +``` + +### Starter Pack (5 Cases) + +| Week | Difficulty | Theme | +|------|------------|-------| +| 1 | Easy | Simple theft | +| 2 | Medium | Missing person | +| 3 | Hard | Corporate fraud | +| 4 | Hardcore | Art heist (plot twist) | +| 5 | Impossible | Multi-layered conspiracy | + +**Approach:** Add cases incrementally during development. + +--- + +## Evidence System + +### Evidence Types + +| Type | What Kimi Sees | Example Clue | +|------|---------------|--------------| +| **Crime scene** | Scene layout, objects, anomalies | "Window was broken from inside" | +| **Surveillance** | People, actions, timestamps | "Person lingered at door for 3 minutes" | +| **Documents** | Text, handwriting, context | "Letter mentions 'meeting at midnight'" | +| **Photos** | People, items, locations | "Suspect's shoes match the footprint" | +| **Maps** | Routes, access points, exits | "Only one entrance visible to street" | +| **Items** | Condition, marks, connections | "Key is copy — grooves don't match original" | + +### Evidence Citation + +Evidence helps build theory. Not all evidence is required. + +``` +Chief's Theory: "I think Suspect B did it." + +📎 Cited Evidence: +- Evidence #3: Crime scene photo +- Evidence #5: Security footage +- Evidence #8: Witness testimony +→ 3/10 evidence cited (30%) + +💬 Detective: "That's a solid theory. The evidence +supports B, but have you considered Evidence #7?" +``` + +### Hints Embedded in Evidence + +Not a separate button. Hints are part of the evidence design. + +| Level | Visibility | Example | +|-------|-----------|---------| +| **Too obvious** | Easy to find | "Letter saying 'I did it'" | +| **Barely obvious** | Check certain places | "Muddy shoes near suspect's home" | +| **Not too obvious** | Requires attention | "Timeline inconsistency in letter" | + +### Witness Trigger System + +In harder cases, Witness appears based on triggers. + +``` +Trigger Example: +Turn 1: Chief examines crime scene photo +Turn 2: Chief finds a hair sample on the floor + ↓ [Trigger activated] +Turn 3: 👁️ Witness appears + ↓ "I recognize this hair... it belongs to Suspect B's dog" +Turn 4: Chief examines suspect's home +Turn 5: 👁️ Witness appears again (new trigger) + ↓ "I saw Suspect B leaving the gallery at midnight..." +``` + +**Indicator:** Each piece of evidence has a note indicating if it triggers Witness appearance. + +--- + +## Open-Ended Solving + +### Core Philosophy + +> **No single truth. Multiple valid theories.** + +| Before | After | +|--------|-------| +| One correct answer | Multiple valid theories | +| Wrong accusation = Fail | Theory valid if evidence supports | +| One winner | Everyone discusses | +| Truth ends game | Truth is guidance, not mandate | + +### Theory Building + +``` +👤 Chief builds theory: +"I think Suspect B did it, with help from Suspect A. +B had access (night guard), A had keys (curator). +They split the insurance money." + +📎 Chief cites evidence: +- Evidence #3: Crime scene (window not broken) +- Evidence #5: Security footage (B was inside) +- Evidence #7: A has master keys +- Evidence #9: Financial records (recent debt) + +💬 Detective responds: +"That's a coherent theory. Your cited evidence +supports collaboration between A and B." +``` + +### Truth Reveal + +**Available anytime. Does NOT end the game.** + +| When | Why | +|------|-----| +| After building theory | "Did I get it right?" | +| When stuck | "Give me guidance" | +| Never | "I want to figure it out myself" | +| After solving | "See how close I was" | + +``` +📜 THE TRUTH (Creator's Intended) + +The case was designed as: +"A and B collaborated. A had keys, B had access. +But C was the real mastermind, funding the whole thing." + +👤 Your theory: +"Suspect B acted alone." + +💬 Comparison: +- Your theory missed the collaboration element +- You correctly identified B as main actor +- Evidence you cited: 80% relevant +- 🎯 65% alignment with intended truth + +💬 But: Your theory is still valid based on evidence! +Discussion continues. Truth is guidance, not mandate. +``` + +--- + +## Scoring System + +### Per Case Statistics + +| Metric | Calculation | +|--------|-------------| +| **Time** | Turns × 10 min (simplified) | +| **Evidence** | Evidence cited / Total evidence | +| **Alignment** | How close to creator's intended story | +| **Coherence** | Theory makes sense based on evidence | + +### Statistics Display + +``` +┌─────────────────────────────────────┐ +│ 📊 CASE STATISTICS │ +├─────────────────────────────────────┤ +│ ⏱️ Time: 6 turns × 10 min = 60 min │ +│ 📎 Evidence: 7/10 cited (70%) │ +│ 🎯 Alignment: 85% with creator │ +│ 💬 Theory coherence: Strong │ +├─────────────────────────────────────┤ +│ ⭐ Rating: Sharp Detective │ +└─────────────────────────────────────┘ +``` + +### Rating Tiers + +| Alignment | Rating | +|-----------|--------| +| 90-100% | Master Detective | +| 75-89% | Sharp Detective | +| 50-74% | Promising Detective | +| 25-49% | Apprentice | +| 0-24% | Rookie | + +--- + +## Retry & Journal System + +### Multiple Attempts + +User can solve same case multiple times. + +``` +Case #47 — The Hartwell Heist + +Your Attempts: +├── Attempt #1: 85% alignment, 6 turns 📖 +├── Attempt #2: 92% alignment, 4 turns 📖 +├── Attempt #3: In progress... +└── Best: 92% alignment +``` + +### Journal Documentation + +Every attempt is documented (solve or not). + +``` +Attempt #1: April 19, 2026 +├── Status: Solved +├── Evidence cited: 7/10 +├── Alignment: 85% +├── Theory: "Suspect B acted alone" +└── Notes: "Missed the A-B collaboration" +``` + +### Privacy Settings + +| Setting | Description | +|---------|-------------| +| **Private** | Only you see your attempts | +| **Publish stats** | Everyone sees your stats (default) | +| **Publish journal** | Anyone can read your solve | + +--- + +## Replay (Observe Mode) + +Watch how others solved the case. + +``` +📺 OBSERVE MODE + +@alice's Solve of Case #47 + +Turn 1: Examined crime scene +Turn 2: Found hair sample → Witness appeared +Turn 3: Questioned Suspect B +Turn 4: Examined financial records +Turn 5: Cited evidence, formed theory +Turn 6: Requested truth reveal + +⏱️ 6 turns | 🎯 85% alignment | ⭐ Sharp +``` + +**Only published journals are observable.** + +--- + +## Case Creation System + +### Starter Cases + +5 cases (one per difficulty) as templates. + +**Source:** Real solved cases adapted for the game. + +### Community Cases + +Anyone can create and share cases. + +#### Creation Flow + +``` +1. Choose reference case (optional) + "Let's base this on the Isabella Stewart Gardner theft" + +2. Gather/create evidence + Upload images (crime scene, suspects, documents) + +3. Write case brief + ├── Title, difficulty + ├── Suspect list (names, photos) + ├── Evidence set + ├── Hidden truth (creator's intended story) + ├── Red herrings (optional) + ├── Plot twist (optional) + └── Witness triggers (which evidence triggers Witness) + +4. Test it + Play through yourself to verify solvability + +5. Publish + ├── Private link (friends only) + └── Public (case library) +``` + +### Case Format + +```yaml +case: + title: "The Hartwell Heist" + difficulty: medium + difficulty_description: "Requires comparison of evidence" + + evidence: + - id: 1 + type: photo + image: crime_scene.jpg + description: "Crime scene photograph" + triggers_witness: true + hint_level: not_too_obvious + + - id: 2 + type: document + image: letter.jpg + description: "Anonymous letter found" + triggers_witness: false + hint_level: barely_obvious + + suspects: + - name: "Suspect A" + photo: suspect_a.jpg + description: "Gallery curator" + + truth: + summary: "A and B collaborated..." + alignment_criteria: + - "Correctly identified collaboration" + - "Identified A as key holder" + - "Identified B as main actor" + + witness_triggers: + - evidence_id: 1 + testimony: "I see glass on the floor inside..." +``` + +### Case Creator Tools + +| Tool | Purpose | +|------|---------| +| **Skill** | Hermes skill for case creation guidance | +| **Validator** | Verify case format is correct | + +--- + +## Community Moderation + +### Discovery Philosophy + +> **Community cases are the jungle. Direct links are the path.** + +| Discovery Method | Quality | Effort | +|-----------------|---------|--------| +| Case library (browse) | Mixed (jungle) | Low | +| Direct link from creator | Same quality | Medium | +| Social media / community | Trusted (curated) | High | + +### Quality Signals + +| Signal | Description | +|--------|-------------| +| **Visits** | How many times case was played | +| **Reviews** | 👍 or 👎 (no text, requires effort to spam) | + +``` +Case #47B — "The Missing Heirloom" +├── Visits: 234 +├── 👍 45 | 👎 3 +└── Quality score: High +``` + +**Note:** Review manipulation is possible but requires effort. Not perfect, but workable. + +### Sharing Flow + +``` +Creator creates case + ↓ +Tests locally + ↓ +Publishes to community + ↓ +Shares link on social media / Discord + ↓ +Players try directly from creator +``` + +--- + +## Summary of Decisions + +| Element | Decision | +|---------|----------| +| Difficulty | 5 levels (Easy → Impossible) | +| Daily structure | One case per day, same for all | +| Timer | ❌ No (first phase) | +| Hints | ✅ Embedded in evidence | +| Retry | ✅ Unlimited attempts | +| Journal | ✅ Every attempt documented | +| Observe | ✅ Watch published solves | +| Privacy | Private by default | +| Publish | Stats always, journal optional | +| Scoring | Alignment %, Evidence %, Time | +| Open-ended | ✅ No single truth | +| Truth reveal | Available anytime | +| Case source | Real cases + community | +| Witness | Dynamic (triggers in hard cases) | +| Red herrings | ✅ Hard+ difficulty | +| Plot twist | ✅ Hardcore+ difficulty | +| Community | Visits + reviews (no auth) | + +--- + +## What's Next + +Once we finalize the concept: +- Technical architecture +- UI/UX design +- Prompt engineering +- Case creation template +- Prototype development + +--- + +## Related Documents + +- `docs/ideas/COMPARISON.md` — Full comparison matrix +- `docs/ideas/008-visual-detective.md` — Initial brainstorm diff --git a/docs/ideas/001-visual-narrative-agent.md b/docs/ideas/001-visual-narrative-agent.md new file mode 100644 index 0000000..2dac70c --- /dev/null +++ b/docs/ideas/001-visual-narrative-agent.md @@ -0,0 +1,79 @@ +# Idea 001: Visual Narrative Agent + +**Date:** 2026-04-19 +**Status:** Idea +**Tags:** hermes-agent, kimi-vision, storytelling, image-generation + +## Concept + +An agentic storytelling system where Hermes orchestrates a narrative loop with Kimi's visual analysis and built-in image generation skills to produce coherent visual stories. + +## User Flow + +1. User provides text prompt (e.g., "A lone astronaut discovers an ancient alien garden on Mars") +2. Hermes plans story structure (scenes, pacing, visual style) +3. For each scene: + - Hermes generates image prompt + - Generate image (Hermes built-in skill: manim / ascii) + - Kimi analyzes generated image + - Kimi's feedback refines next scene's prompt +4. Return compiled visual story to user + +## Key Differentiator + +Most story-to-image tools: **Generate → Done** + +This concept: **Generate → Analyze → Refine → Loop** + +Kimi serves as the **visual reasoning engine** — tells Hermes if the generated image matches the intended scene, catches inconsistencies, and informs prompt refinement for the next scene. + +## Tech Stack + +| Component | Source | Role | +|-----------|--------|------| +| Hermes Agent | Nous Research | Orchestration, planning, decision loop | +| Kimi Vision | Moonshot AI (via gateway) | Image analysis, visual feedback | +| Image Generation | Pollinations AI | Free tier, multiple models (Flux, etc.) | + +### Image Generation Options + +| Provider | Free Tier | Quality | Use Case | +|---------|-----------|---------|----------| +| **Pollinations** ✅ | ✅ Yes | Good | Primary (simple, free) | +| **Flux (local)** | ✅ Free | High | If GPU available | +| **Hermes skills** | ✅ Free | Niche | Fallback/ASCII aesthetic | + +### Pollinations API (Primary) +- **Endpoint:** `https://gen.pollinations.ai/image/{prompt}` +- **Models:** flux, zimage, wan-image, qwen-image, etc. +- **Cost:** Free tier (pollen credits), ~$1/1 Pollen paid +- **Auth:** Optional for free tier + +## Strengths + +- ✅ Combines Hermes + Kimi + Pollinations natively +- ✅ Agentic visual feedback loop is unique +- ✅ Visual coherence check via Kimi ensures quality +- ✅ Free tier = low barrier to test +- ✅ User controls output format (default: image) + +## Weaknesses + +- ⚠️ Pollinations quality vs DALL-E/Midjourney (may need to test) +- ⚠️ Kimi requires gateway access (no direct API key) +- ⚠️ Loop adds latency (generate → analyze → refine) +- ⚠️ Need to verify Pollinations reliability + +## Uniqueness Score + +**7/10** — Agentic visual feedback loop is novel, but need to verify if built-in image generation is compelling enough + +## Next Steps + +- [ ] Explore Hermes built-in image skills (manim, ascii) +- [ ] Define output format options +- [ ] Sketch technical architecture + +## Related Ideas + +- See: `002-xxx.md`, `003-xxx.md` for alternatives diff --git a/docs/ideas/007-vision-spot-the-difference.md b/docs/ideas/007-vision-spot-the-difference.md new file mode 100644 index 0000000..2b416a9 --- /dev/null +++ b/docs/ideas/007-vision-spot-the-difference.md @@ -0,0 +1,138 @@ +# Idea 007: Spot the Difference Agent + +**Date:** 2026-04-19 +**Status:** Idea +**Tags:** hermes-agent, kimi-vision, puzzle, gamification, webapp + +## Concept + +A daily "Spot the Difference" puzzle webapp where AI (Kimi + Hermes) analyzes two images and shows its step-by-step process in finding the differences. + +**Core insight:** Use visual analysis strength, minimize reasoning load. + +## User Flow + +1. User opens webapp → sees today's "Spot the Difference" puzzle (two similar images) +2. User can play manually (click on differences) OR +3. User clicks "Let AI Solve" → watches AI's step-by-step analysis +4. AI shows its reasoning process: "Scanning left-to-right... Found difference #1: color mismatch in top-left..." +5. Leaderboard shows attempt stats (anonymous) + +## Why This Works + +| Aspect | Implementation | +|--------|----------------| +| **Visual Analysis** | Kimi compares images pixel-level + semantic | +| **Low Reasoning** | Pattern matching, not complex logic | +| **Step-by-Step** | Show each finding with visual highlight | +| **Gamification** | Daily puzzle, leaderboard, no auth | + +## Puzzle Types + +### Primary: Spot the Difference (v1) +- Two images with subtle differences +- Kimi identifies all differences +- Each found difference highlighted + explanation + +### Secondary (future): +- Find the anomaly (what's wrong in this image?) +- Count the objects (how many X in this image?) +- What's different? (semantic analysis) + +## Technical Stack + +| Component | Source | Role | +|-----------|--------|------| +| Frontend | Single HTML page | Display puzzle, show AI process | +| Image Analysis | Kimi Vision (via gateway) | Compare images, find differences | +| Orchestration | Hermes Agent | Coordinate flow, format output | +| Image Gen | Pollinations AI | Generate daily puzzle pairs | + +### Daily Puzzle Generation +``` +Hermes + Pollinations → Generate base image +Hermes + Pollinations → Generate modified image (with subtle changes) +Store both → Serve to users daily +``` + +### AI Solving Process +``` +1. Hermes receives both images +2. Send to Kimi Vision for analysis +3. Kimi returns list of differences with locations +4. Hermes formats step-by-step explanation +5. Frontend animates each finding +``` + +## Features + +### Core +- [ ] Daily puzzle auto-rotates +- [ ] Two-image display (side by side) +- [ ] "Let AI Solve" button +- [ ] Step-by-step visualization of AI findings +- [ ] Show each difference with highlight + explanation + +### Gamification (no auth) +- [ ] Attempt counter (per user session, localStorage) +- [ ] Leaderboard (anonymous, session-based) +- [ ] "Perfect solve" badge (AI found all differences on first pass) + +### Nice to Have +- [ ] Difficulty levels (Easy/Medium/Hard) +- [ ] Share result as image +- [ ] Hint system (Kimi finds 1, user finds rest) + +## Step-by-Step Output Format + +``` +🔍 Scanning image... +✅ Difference #1 found: "The lamp color changed from blue to red" + 📍 Location: Top-left corner + 👆 [Highlighted on image] + +✅ Difference #2 found: "Window shape is slightly different" + 📍 Location: Center-right + 👆 [Highlighted on image] + +... + +🎯 Solved! Found X differences in Y steps. +⏱️ Time: Z seconds +``` + +## Comparison with Other Ideas + +| Aspect | 001 Visual Narrative | 007 Spot the Difference | +|--------|---------------------|------------------------| +| Visual Analysis | Heavy | **Heavy** | +| Reasoning | Medium | **Light** | +| Demo Impact | High | **High** | +| Gamification | Low | **High** | +| Uniqueness | 7/10 | **9/10** | +| Step-by-Step | Yes | **Yes (more natural)** | + +## Why Stronger than 001 + +1. **Tangible use case** — People actually play spot the difference +2. **Clear AI demonstration** — "Watch AI see what you see" +3. **Gamification** — Daily puzzle + leaderboard = engagement +4. **Low reasoning, high vision** — Perfect for Kimi's strength +5. **Step-by-step is natural** — Not forced, it's how you'd solve it + +## Risks + +- ⚠️ Need reliable daily puzzle generation (harder than it sounds) +- ⚠️ Kimi analysis quality depends on image complexity +- ⚠️ Need diverse puzzle set to not repeat + +## Next Steps + +- [ ] Test Kimi's spot-the-difference capability +- [ ] Design puzzle generation pipeline +- [ ] Mock up webapp UI +- [ ] Prototype step-by-step visualization + +## Related Ideas + +- See: `001-visual-narrative-agent.md` diff --git a/docs/ideas/008-visual-detective.md b/docs/ideas/008-visual-detective.md new file mode 100644 index 0000000..8d3bf83 --- /dev/null +++ b/docs/ideas/008-visual-detective.md @@ -0,0 +1,397 @@ +# Idea 008: Visual Detective + +**Date:** 2026-04-19 + +## Concept + +Upload a "crime scene" or mystery image. Kimi analyzes every detail. Hermes pieces together clues and generates a detective story/hypothesis. + +## Why Strong + +- Heavy visual analysis (Kimi reads the scene) +- Low reasoning (observation, not complex logic) +- Storytelling naturally fits step-by-step +- Mystery genre = engaging + +## User Flow + +1. Upload image (or get random daily mystery) +2. Kimi: "I see a broken window, muddy footprints, overturned chair..." +3. Hermes: "Based on these clues, here's what likely happened..." +4. Output: Detective story with visual evidence + +## Tech + +- Kimi Vision: Scene analysis +- Hermes: Narrative orchestration +- Pollinations: Generate mystery images + +## Unique? + +- Nobody's doing "AI detective" with your photos +- Could be daily mystery + community solving + +--- + +## 009: Image Tarot Reader + +**Date:** 2026-04-19 + +## Concept + +Upload any image. AI interprets it like a tarot card reading. + +## Why Strong + +- Fun/flirty, low stakes +- Heavy visual analysis (Kimi interprets symbolism) +- Storytelling fits perfectly +- Shareable results + +## User Flow + +1. Upload image OR random draw +2. Kimi: Analyzes composition, colors, objects, mood +3. Hermes: "This represents [Tarot card]. Your reading: [Narrative]" +4. Output: Tarot card + 3-card spread interpretation + +## Step-by-Step + +``` +🃏 Drawing your card... +👁️ Analyzing your image... + +Visual Elements Detected: +• A winding road (path in life) +• Setting sun (endings/new beginnings) +• Standing figure (you, the observer) + +🎴 Your Card: The Fool +Interpretation: A new journey awaits. Trust the path ahead... + +Past: Confusion about direction +Present: Standing at the crossroads +Future: Leap of faith required +``` + +## Tech + +- Kimi Vision: Symbol analysis +- Hermes: Tarot narrative generation +- Pollinations: Generate thematic card visuals + +--- + +## 010: Color Emotion Translator + +**Date:** 2026-04-19 + +## Concept + +Upload image. AI analyzes dominant colors and translates them into emotions/mood. + +## Why Strong + +- Pure visual analysis +- Art/design focused +- Generates color palette + emotion report +- Useful for designers + +## User Flow + +1. Upload image +2. Kimi: Extracts colors, analyzes saturation, harmony +3. Hermes: Translates to emotions, generates palette +4. Output: Color palette + emotion breakdown + suggested uses + +## Step-by-Step + +``` +🔍 Scanning colors... +🎨 Extracting dominant palette... + +Detected Colors: +• #2D4A3E (Deep Forest Green) - 45% +• #F5E6D3 (Warm Cream) - 30% +• #8B4513 (Saddle Brown) - 15% +• #CD853F (Peru Gold) - 10% + +🎭 Emotional Profile: +Primary: Grounded, natural, calm +Secondary: Warm, nostalgic, organic +Accent: Vintage, artisanal, trustworthy + +💡 Recommendations: +• Brand Identity for eco-friendly products +• Interior design: cozy cabin aesthetic +• Packaging: artisanal food products +``` + +--- + +## 011: Before/After Time Machine + +**Date:** 2026-04-19 + +## Concept + +Upload an old/historical photo. AI shows what it would look like today or vice versa. + +## Why Strong + +- Historical/educational angle +- Visual transformation is compelling +- Shows AI's understanding of time/changes + +## User Flow + +1. Upload old OR new photo +2. Select transformation direction +3. Kimi: Analyzes context, era, subject +4. Hermes: Predicts/adapts to target era +5. Output: Side-by-side transformation + +## Step-by-Step + +``` +📸 Analyzing source image... +📅 Detected era: 1950s New York Street + +Identifying elements: +• Black & white photography style +• Vintage automobiles (1950s models) +• Fashion: fedoras, swing coats +• Architecture: Art Deco buildings + +🔮 Projecting to 2024... + +Transformation breakdown: +• Colorization: Added natural skin tones + sky colors +• Vehicles: Replaced with modern equivalents +• Architecture: Updated signage, added modern elements +• Fashion: Modernized while preserving style + +✨ Your 1950s scene in 2024! +``` + +--- + +## 012: Visual Haiku Generator + +**Date:** 2026-04-19 + +## Concept + +Upload any image. AI generates a haiku (5-7-5) based on visual elements. + +## Why Strong + +- Minimal reasoning, pure visual +- Artistic/creative output +- Japanese aesthetic + AI = unique +- Highly shareable + +## User Flow + +1. Upload image +2. Kimi: Analyzes scene, mood, elements +3. Hermes: Crafts haiku (strict 5-7-5) +4. Output: Image + haiku + syllable breakdown + +## Step-by-Step + +``` +🖼️ Analyzing your image... + +Scene Elements: +• Autumn forest path +• Golden leaves falling +• Soft morning light through trees + +✍️ Crafting haiku... + +Forest whispers +Golden footsteps on leaves— +Silence speaks loud + +📝 Syllable breakdown: +"Forest" (2) - whisper (2) +s(1) - il(1) -ence (1) - speaks (1) - loud (1) +"Golden" (2) - foot (1) -steps (1) - on (1) - leaves (1) +(5) - (7) - (5) ✅ +``` + +--- + +## 013: Image Alchemy + +**Date:** 2026-04-19 + +## Concept + +Upload two random images. AI "fuses" them into a new concept based on their shared elements. + +## Why Strong + +- Surprising/comedic combinations +- Pure visual + semantic analysis +- Unique creative output +- Viral potential + +## User Flow + +1. Upload image A (or random) +2. Upload image B (or random) +3. Kimi: Analyzes both separately +4. Hermes: Finds connections, creates fusion +5. Output: New concept + fused image prompt + +## Step-by-Step + +``` +🌀 Analyzing Image A: A Viking ship +• Norse aesthetic +• Ocean voyage +• Historical warrior culture + +🌀 Analyzing Image B: A Coffee shop +• Cozy atmosphere +• Barista craft +• Modern social space + +🔮 Alchemizing... + +Found connections: +• Craft (warrior's craft → barista's craft) +• Ritual (battle ritual → coffee ritual) +• Journey (ocean voyage → daily commute) + +⚗️ Alchemy Result: + +"THE VIKING BARISTA" + +A warrior of the morning, +steering through storms of exhaustion, +claiming the sacred cup. + +Your coffee shop serves mead in horn-shaped mugs, +the barista wears a helmet of foam, +and every latte is a conquest. +``` + +--- + +## 014: Visual Lie Detector + +**Date:** 2026-04-19 + +## Concept + +Upload a photo + claim. AI analyzes if the image supports or contradicts the claim. + +## Why Strong + +- Useful in era of fake news +- Pure visual verification +- Educational about image analysis +- "Is this real?" tool + +## User Flow + +1. Paste claim + upload image +2. Kimi: Analyzes image details +3. Hermes: Compares claim vs evidence +4. Output: Verdict + reasoning + +## Step-by-Step + +``` +🔍 Analyzing claim: "This photo was taken in Paris" + +🔬 Image Analysis: +• Architecture: Haussmannian buildings ✓ +• Street signs: French ✓ +• License plates: European format ✓ +• Language: French on signs ✓ +• Vegetation: Consistent with Paris climate ✓ +• Shadows: Consistent with claimed time of day ✓ + +✅ VERDICT: LIKELY AUTHENTIC + +Confidence: 94% +Supporting evidence: 8/8 elements match +Caveats: Metadata not verified +``` + +--- + +## 015: Object Archaeology + +**Date:** 2026-04-19 + +## Concept + +Upload an object close-up. AI identifies it, tells its history/story. + +## Why Strong + +- Educational +- Heavy visual (identification + knowledge) +- Discovery/antiquities angle +- Could work with museum APIs + +## User Flow + +1. Upload object photo +2. Kimi: Visual identification + details +3. Hermes: Tells object's "story" +4. Output: Identity + history narrative + +## Step-by-Step + +``` +🔍 Scanning object... + +Visual Analysis: +• Material: Ceramic +• Style: Ming Dynasty blue and white +• Pattern: Dragon with cloud motifs +• Technique: Underglaze blue + +🏺 Object Identified: +Ming Dynasty (1368-1644) Blue and White Porcelain +Dragon Pattern Bowl + +📜 The Story: +This bowl was crafted during the reign of Emperor Wanli, +at the height of Jingdezhen's porcelain production. +The dragon motif signifies imperial power and protection... + +[Full historical narrative] +``` + +--- + +## Quick Comparison Matrix + +| # | Name | Visual | Reasoning | Uniqueness | Fun | +|---|------|--------|-----------|------------|-----| +| 007 | Spot the Difference | Heavy | Light | 9/10 | 8/10 | +| 008 | Visual Detective | Heavy | Light | 8/10 | 9/10 | +| 009 | Image Tarot | Heavy | Light | 8/10 | 10/10 | +| 010 | Color Emotion | Medium | Light | 7/10 | 7/10 | +| 011 | Before/After | Heavy | Medium | 8/10 | 8/10 | +| 012 | Visual Haiku | Heavy | Light | 9/10 | 8/10 | +| 013 | Image Alchemy | Heavy | Light | 10/10 | 10/10 | +| 014 | Lie Detector | Heavy | Medium | 9/10 | 8/10 | +| 015 | Object Archaeology | Heavy | Medium | 8/10 | 8/10 | + +--- + +**My top picks for uniqueness + fun:** +1. **013 Image Alchemy** — Most unique, viral potential +2. **009 Image Tarot** — Fun, shareable, low friction +3. **007 Spot the Difference** — Game + AI demonstration +4. **014 Visual Lie Detector** — Useful, educational + +What stands out to you? diff --git a/docs/ideas/COMPARISON.md b/docs/ideas/COMPARISON.md new file mode 100644 index 0000000..b949245 --- /dev/null +++ b/docs/ideas/COMPARISON.md @@ -0,0 +1,132 @@ +# Ideas Comparison Matrix + +**Date:** 2026-04-19 +**Purpose:** Compare all ideas to select final concept + +--- + +## Scoring Criteria + +| Criteria | Weight | Description | +|----------|--------|-------------| +| **Visual Analysis** | 30% | Heavy Kimi use (aligned with Kimi's strength) | +| **Multi-Turn** | 20% | Not single-turn, builds over time | +| **Human-AI Interaction** | 20% | Human participates, not passive | +| **Cost Efficiency** | 15% | Low API costs (image gen vs analysis) | +| **Uniqueness** | 10% | Stand out from competitors | +| **Fun/Engagement** | 5% | Enjoyable to play/watch | + +**Scoring:** 1-5 (5 = best) + +--- + +## Full Comparison Matrix + +| # | Idea | Visual | Multi-Turn | Human-AI | Cost | Unique | Fun | **Total** | +|---|------|--------|------------|----------|------|--------|-----|-----------| +| 001 | Visual Narrative Agent | 4 | 4 | 3 | 2 | 3 | 4 | **3.5** | +| 002 | Visual Memory Journal | 3 | 3 | 2 | 3 | 4 | 3 | **3.0** | +| 003 | Design Critic | 3 | 2 | 2 | 3 | 2 | 3 | **2.6** | +| 004 | Visual Poem | 4 | 2 | 2 | 3 | 4 | 4 | **3.2** | +| 005 | Scene Journey | 4 | 3 | 2 | 2 | 3 | 4 | **3.2** | +| 007 | Spot the Difference | 4 | 2 | 3 | 2 | 4 | 5 | **3.4** | +| 008 | Visual Detective | 4 | 3 | 2 | 3 | 4 | 4 | **3.5** | +| 009 | Image Tarot | 4 | 2 | 3 | 3 | 4 | 5 | **3.5** | +| 013 | Image Alchemy | 4 | 2 | 3 | 2 | 5 | 5 | **3.6** | +| 014 | Lie Detector | 4 | 2 | 3 | 3 | 4 | 4 | **3.4** | +| 032v2 | Art Critic | 5 | 3 | 3 | 3 | 3 | 4 | **3.7** | +| **033v2** | **Detective** | **5** | **5** | **5** | **4** | **4** | **5** | **4.7** | +| 035 | Guess Artist | 5 | 2 | 3 | 3 | 3 | 4 | **3.5** | +| Auction | Auction | 3 | 4 | 5 | 4 | 4 | 4 | **3.9** | + +--- + +## Top Contenders + +| Rank | Idea | Score | Key Strengths | +|------|------|-------|---------------| +| 🥇 | **033v2 Detective** | **4.7** | Best multi-turn, human directs, Kimi does real work | +| 🥈 | Auction | 3.9 | Human describes, human engages, cheap | +| 🥉 | 032v2 Art Critic | 3.7 | Kimi visual analysis, multi-turn | +| 4 | 013 Image Alchemy | 3.6 | Most unique, viral potential | +| 5 | 009 Image Tarot | 3.5 | Fun, shareable | + +--- + +## 033v2 Detective — Why It Wins + +### Alignment with User Goals + +| User Goal | How Detective Meets It | +|-----------|----------------------| +| Heavy visual analysis | Kimi analyzes each piece of evidence | +| Low reasoning | Pattern matching, not complex logic | +| Multi-turn | 5-7 rounds per case | +| Human-AI collaboration | Human (Chief) directs the investigation | +| Cost efficient | Mostly text between Kimi calls | +| Fun/engagement | Mystery + competition | + +### What Makes It Special + +1. **Natural two-agent roles:** Witness (sees) + Detective (thinks) +2. **Human as boss:** Chief directs investigation, not passive observer +3. **Multi-turn structure:** Each round builds the case +4. **Kimi's strength shines:** Visual evidence analysis is the core mechanic +5. **Scoring system:** Track cases solved, rounds taken, accuracy + +### Comparison to Other Games + +| Aspect | Spot the Difference | Tarot | Alchemy | **Detective** | +|--------|-------------------|-------|---------|---------------| +| Visual Analysis | 4 | 4 | 4 | **5** | +| Multi-Turn | 2 | 2 | 2 | **5** | +| Human Role | Judge | Receive | Submit | **Direct** | +| Narrative | None | Story | Surprise | **Full Mystery** | +| Replayability | Medium | Low | Medium | **High** | + +--- + +## Recommendation + +**Go with 033v2 Detective.** + +### Why Not Others + +| Idea | Why Not | +|------|---------| +| 001 Visual Narrative | Too similar to others, high cost | +| 007 Spot Difference | Fun but shallow (1-turn) | +| 009 Image Tarot | Not really interactive | +| 013 Image Alchemy | Unique but single interaction | +| Auction | Good but less "AI demonstration" | + +### Detective's Edge + +- **Multi-turn** = not just a quick demo +- **Human directs** = active participation +- **Kimi sees evidence** = clear AI capability showcase +- **Cost efficient** = mostly text +- **Daily cases** = reason to return + +--- + +## Next Steps for 033v2 Detective + +- [ ] Define case structure (5-7 evidence images) +- [ ] Design Chief interface (what buttons/actions) +- [ ] Plan Witness + Detective prompts +- [ ] Mock up UI +- [ ] Prototype with one case + +--- + +## Appendix: Ideas That Could Combine with Detective + +### Detective + Art Critic +Two types of daily content: Mystery case OR Art analysis + +### Detective + Auction +Hybrid mode: Evidence auction where Chief describes to Detective + +### Detective + Spot Difference +Mini-game within case: "Find the clue hidden in this photo" diff --git a/docs/research-hermes-agent.md b/docs/research-hermes-agent.md new file mode 100644 index 0000000..dc5197c --- /dev/null +++ b/docs/research-hermes-agent.md @@ -0,0 +1,47 @@ +# Research: Hermes Agent Capabilities + +**Date:** 2026-04-19 +**Purpose:** Understand Hermes Agent framework for hackathon integration + +## Hermes 3 (Nous Research) + +### Core Capabilities +- **Advanced agentic capabilities** +- **Reliable function calling** - Trained specifically for tool use +- **Structured output** - JSON mode / Pydantic schemas +- **ChatML prompt format** - OpenAI-compatible +- Multi-turn conversation +- Long context coherence + +### Benchmark Performance +| Benchmark | Hermes 3 Score | +|-----------|---------------| +| IFEval (0-shot) | 61.70% | +| MMLU-Redux | 92.7% | +| MMLU-Pro | 81.1% | +| SimpleQA | 31.0% | + +### Function Calling +- Trained on specific prompts for tool use +- XML-based tool call format: `{"name": "...", "arguments": {...}}` +- Supports recursive/chain tool calls +- Native tool integration via NousResearch/Hermes-Function-Calling repo + +## Hermes Agent Framework + +### Key Components +1. **ChatML format** - Structured system/user/assistant turns +2. **Tool definitions** - JSON schema for function signatures +3. **Tool parsing** - Parse and execute function calls +4. **Response loop** - Multi-turn agentic execution + +### Integration Points +- HuggingFace Transformers +- vLLM inference +- Ollama local deployment +- OpenAI-compatible API + +## Sources +- https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B +- https://github.com/NousResearch/Hermes-Function-Calling +- https://arxiv.org/abs/2408.11857 (Hermes 3 Technical Report) diff --git a/docs/research-image-generation-apis.md b/docs/research-image-generation-apis.md new file mode 100644 index 0000000..5fcd646 --- /dev/null +++ b/docs/research-image-generation-apis.md @@ -0,0 +1,72 @@ +# Research: Image Generation APIs + +**Date:** 2026-04-19 +**Purpose:** Find affordable/free image generation for hackathon project + +## Pollinations AI (Recommended ✅) + +**Why:** Free tier, OpenAI-compatible, multiple models, simple API + +### Quick Start +```bash +# No auth needed for basic +curl "https://gen.pollinations.ai/image/a%20cat%20in%20space" + +# With auth +curl -H "Authorization: Bearer YOUR_KEY" ... +``` + +### Models Available +| Model | Type | Notes | +|-------|------|-------| +| `flux` | Default | Good quality | +| `zimage` | Default | Alternative | +| `wan-image` | Quality | Higher quality option | +| `qwen-image` | Quality | Alibaba model | +| `gptimage` | Quality | GPT-based | +| `seedream5` | Style | Special styles | +| `kontext` | Edit | Image editing | + +### Pricing +- **Free tier:** Weekly pollen credits (tier-based) +- **Paid:** $1 ≈ 1 Pollen +- **Free API:** Limited but usable +- **Rate limits:** Anonymous = limited, Seed/Flower = more + +### API Details +- **Base URL:** `https://gen.pollinations.ai` +- **Image endpoint:** `GET /image/{prompt}` +- **OpenAI-compatible:** `POST /v1/images/generations` +- **No setup:** Just curl it + +### Strengths +- ✅ 100% Open Source +- ✅ Free tier available +- ✅ Multiple model options +- ✅ Simple API (no complex setup) +- ✅ OpenAI-compatible SDK + +### Weaknesses +- ⚠️ Quality may not match DALL-E/Midjourney +- ⚠️ Free tier has rate limits +- ⚠️ Infrastructure may vary in reliability + +## Other Options Considered + +| Provider | Free Tier | Quality | Notes | +|----------|-----------|---------|-------| +| **Midjourney** | ❌ No | High | Expensive | +| **Stable Diffusion** | Local only | High | Needs GPU | +| **DALL-E 3** | ❌ No | High | OpenAI pricing | +| **Ideogram** | Limited | Good | API in beta | +| **Flux (Local)** | ✅ Free | High | Self-hosted, needs GPU | + +## Recommendation + +**Primary:** Pollinations AI (free tier + simplicity) +**Fallback:** Flux if we have GPU resources + +## Sources +- https://gen.pollinations.ai +- https://docs.pollinations.ai/ +- https://github.com/pollinations/pollinations diff --git a/docs/research-kimi-visual-capabilities.md b/docs/research-kimi-visual-capabilities.md new file mode 100644 index 0000000..2f283f3 --- /dev/null +++ b/docs/research-kimi-visual-capabilities.md @@ -0,0 +1,51 @@ +# Research: Kimi Visual Capabilities + +**Date:** 2026-04-19 +**Purpose:** Validate Kimi's visual strengths for hackathon project + +## Kimi K2.5 - Multimodal Model + +### Core Capabilities +- **Text + Images + Video** input support +- 256K context length +- Thinking/non-thinking modes +- Agent task support + +### Visual API Models +- `moonshot-v1-8k-vision-preview` +- `moonshot-v1-32k-vision-preview` +- `moonshot-v1-128k-vision-preview` +- `kimi-k2.5` (latest, supports video) + +### Supported Formats +**Images:** png, jpeg, webp, gif +**Video:** mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, 3gpp + +### Unique Visual Features +1. **Visual Coding** - Kimi Code, Kimi Claw for coding with visual context +2. **Video Understanding** - Analyzes video content (unique for multimodal models) +3. **Real-time Visual Chat** - Interactive visual conversation + +## Kimi K2 Benchmarks (Coding/Agent) + +| Benchmark | Kimi K2 Score | Notes | +|-----------|---------------|-------| +| SWE-bench Verified (Single Attempt) | **65.8%** | Global SOTA for open-source | +| SWE-bench Multilingual | 47.3% | Outperforms most proprietary | +| LiveCodeBench v6 | 53.7% | Strong coding | +| TerminalBench | 30.0% | Agentic tool use | +| Aider-Polyglot | 60.0% | Code editing | +| Tau2-Bench (avg) | ~64% | Tool use tasks | + +## Kimi Visual Strengths Summary + +✅ **Video understanding** (unique advantage) +✅ **Visual coding** capabilities +✅ **Image + Text multimodal** +✅ **Strong agentic tool use** +✅ **256K context** for large visual inputs + +## Sources +- https://platform.moonshot.cn/docs/guide/kimi-k2-5-quickstart +- https://moonshotai.github.io/Kimi-K2/ +- https://platform.moonshot.cn/docs/guide/use-kimi-vision-model