kugetsu/.github/ISSUES/fix-queue-daemon-excess-agents.md

# Fix: Queue daemon spawning excess agents due to race condition

## Problem

When enqueueing multiple tasks (e.g., 6 tasks), the queue daemon was spawning many more subagents than expected, eventually exhausting container memory.

**Root Cause:** The combination of:
1. `process_queue()` calling `opencode run` directly instead of `kugetsu start`, bypassing all concurrency logic
2. `count_active_dev_sessions()` counting `pm-agent.json` toward `MAX_CONCURRENT_AGENTS`, reducing effective dev agent slots
3. No atomic locking around session count check + session file creation (TOCTOU race condition)
4. Background spawning of multiple concurrent processes in `process_queue()`

**Expected behavior:** With `MAX_CONCURRENT_AGENTS=3` and 6 tasks:
- Tasks should be processed sequentially via `kugetsu start`
- Only 3 dev agents should run at a time
- Tasks should queue and wait for slots to free up

## Solution

### 1. `count_active_dev_sessions()` - Exclude pm-agent
Only count actual dev agent session files (exclude `pm-agent.json`).

### 2. `process_queue()` - Call `kugetsu start` directly + retry logic
- Call `kugetsu start` directly (foreground, sequential) instead of spawning `opencode run` background process
- Dynamic batch size = available slots (removes need for `QUEUE_DAEMON_BATCH_SIZE`)
- Retry logic (max 3 attempts) on failure
- On failure: cleanup worktree/session and revert to `pending` state
- Save `fork_pid` to queue item for timeout handling

### 3. `cmd_start()` - Add flock
- Add flock around critical section (count check + fork)
- Track `fork_pid` for queue item timeout handling

### 4. Notification System
New notification types:
| Event | Type |
|-------|------|
| Task enqueued | `task_queued` |
| Task dequeued | `task_dequeued` |
| Task started | `task_started` |
| Task completed | `task_completed` |
| Task error | `task_error` |

### 5. Config
- Remove `QUEUE_DAEMON_BATCH_SIZE` (no longer needed - batch size is now dynamic)

## Notification Flow

| Event | Location | Type |
|-------|----------|------|
| Task enqueued | `enqueue_task()` | `task_queued` |
| Task dequeued | `process_queue()` after state change to `notified` | `task_dequeued` |
| Task started | `cmd_start()` after session file created | `task_started` |
| Task completed | `update_queue_item_state()` | `task_completed` |
| Task error | `update_queue_item_state()` | `task_error` |

## Out of Scope

- Re-check loop in cmd_start (checking if session DB is reliable) - deferred to separate research issue
- Buffer mechanism for excess forking (safety failsafe only)

## Status

- [x] Issue created
- [x] Implementation
- [x] PR created (#147)
- [ ] Merged