fix(kugetsu): prevent excess agent spawning with flock + sequential processing

- count_active_dev_sessions() now excludes pm-agent.json from count - process_queue() now calls kugetsu start directly (not opencode run) - process_queue() uses dynamic batch size = available_slots - process_queue() has retry logic (max 3 attempts) on failure - cmd_start() now uses flock around critical section - Added notification types: task_queued, task_dequeued, task_started, task_completed, task_error - Removed QUEUE_DAEMON_BATCH_SIZE config (no longer needed) Fixes issue #146
2026-04-05 08:43:59 +00:00
parent 98a31070a7
commit bda762b774
3 changed files with 177 additions and 90 deletions
@@ -0,0 +1,67 @@
+# Fix: Queue daemon spawning excess agents due to race condition
+
+## Problem
+
+When enqueueing multiple tasks (e.g., 6 tasks), the queue daemon was spawning many more subagents than expected, eventually exhausting container memory.
+
+**Root Cause:** The combination of:
+1. `process_queue()` calling `opencode run` directly instead of `kugetsu start`, bypassing all concurrency logic
+2. `count_active_dev_sessions()` counting `pm-agent.json` toward `MAX_CONCURRENT_AGENTS`, reducing effective dev agent slots
+3. No atomic locking around session count check + session file creation (TOCTOU race condition)
+4. Background spawning of multiple concurrent processes in `process_queue()`
+
+**Expected behavior:** With `MAX_CONCURRENT_AGENTS=3` and 6 tasks:
+- Tasks should be processed sequentially via `kugetsu start`
+- Only 3 dev agents should run at a time
+- Tasks should queue and wait for slots to free up
+
+## Solution
+
+### 1. `count_active_dev_sessions()` - Exclude pm-agent
+Only count actual dev agent session files (exclude `pm-agent.json`).
+
+### 2. `process_queue()` - Call `kugetsu start` directly + retry logic
+- Call `kugetsu start` directly (foreground, sequential) instead of spawning `opencode run` background process
+- Dynamic batch size = available slots (removes need for `QUEUE_DAEMON_BATCH_SIZE`)
+- Retry logic (max 3 attempts) on failure
+- On failure: cleanup worktree/session and revert to `pending` state
+- Save `fork_pid` to queue item for timeout handling
+
+### 3. `cmd_start()` - Add flock
+- Add flock around critical section (count check + fork)
+- Track `fork_pid` for queue item timeout handling
+
+### 4. Notification System
+New notification types:
+| Event | Type |
+|-------|------|
+| Task enqueued | `task_queued` |
+| Task dequeued | `task_dequeued` |
+| Task started | `task_started` |
+| Task completed | `task_completed` |
+| Task error | `task_error` |
+
+### 5. Config
+- Remove `QUEUE_DAEMON_BATCH_SIZE` (no longer needed - batch size is now dynamic)
+
+## Notification Flow
+
+| Event | Location | Type |
+|-------|----------|------|
+| Task enqueued | `enqueue_task()` | `task_queued` |
+| Task dequeued | `process_queue()` after state change to `notified` | `task_dequeued` |
+| Task started | `cmd_start()` after session file created | `task_started` |
+| Task completed | `update_queue_item_state()` | `task_completed` |
+| Task error | `update_queue_item_state()` | `task_error` |
+
+## Out of Scope
+
+- Re-check loop in cmd_start (checking if session DB is reliable) - deferred to separate research issue
+- Buffer mechanism for excess forking (safety failsafe only)
+
+## Status
+
+- [x] Issue created
+- [x] Implementation
+- [ ] PR created
+- [ ] Merged