- count_active_dev_sessions() now excludes pm-agent.json from count - process_queue() now calls kugetsu start directly (not opencode run) - process_queue() uses dynamic batch size = available_slots - process_queue() has retry logic (max 3 attempts) on failure - cmd_start() now uses flock around critical section - Added notification types: task_queued, task_dequeued, task_started, task_completed, task_error - Removed QUEUE_DAEMON_BATCH_SIZE config (no longer needed) Fixes issue #146
2.6 KiB
2.6 KiB
Fix: Queue daemon spawning excess agents due to race condition
Problem
When enqueueing multiple tasks (e.g., 6 tasks), the queue daemon was spawning many more subagents than expected, eventually exhausting container memory.
Root Cause: The combination of:
process_queue()callingopencode rundirectly instead ofkugetsu start, bypassing all concurrency logiccount_active_dev_sessions()countingpm-agent.jsontowardMAX_CONCURRENT_AGENTS, reducing effective dev agent slots- No atomic locking around session count check + session file creation (TOCTOU race condition)
- Background spawning of multiple concurrent processes in
process_queue()
Expected behavior: With MAX_CONCURRENT_AGENTS=3 and 6 tasks:
- Tasks should be processed sequentially via
kugetsu start - Only 3 dev agents should run at a time
- Tasks should queue and wait for slots to free up
Solution
1. count_active_dev_sessions() - Exclude pm-agent
Only count actual dev agent session files (exclude pm-agent.json).
2. process_queue() - Call kugetsu start directly + retry logic
- Call
kugetsu startdirectly (foreground, sequential) instead of spawningopencode runbackground process - Dynamic batch size = available slots (removes need for
QUEUE_DAEMON_BATCH_SIZE) - Retry logic (max 3 attempts) on failure
- On failure: cleanup worktree/session and revert to
pendingstate - Save
fork_pidto queue item for timeout handling
3. cmd_start() - Add flock
- Add flock around critical section (count check + fork)
- Track
fork_pidfor queue item timeout handling
4. Notification System
New notification types:
| Event | Type |
|---|---|
| Task enqueued | task_queued |
| Task dequeued | task_dequeued |
| Task started | task_started |
| Task completed | task_completed |
| Task error | task_error |
5. Config
- Remove
QUEUE_DAEMON_BATCH_SIZE(no longer needed - batch size is now dynamic)
Notification Flow
| Event | Location | Type |
|---|---|---|
| Task enqueued | enqueue_task() |
task_queued |
| Task dequeued | process_queue() after state change to notified |
task_dequeued |
| Task started | cmd_start() after session file created |
task_started |
| Task completed | update_queue_item_state() |
task_completed |
| Task error | update_queue_item_state() |
task_error |
Out of Scope
- Re-check loop in cmd_start (checking if session DB is reliable) - deferred to separate research issue
- Buffer mechanism for excess forking (safety failsafe only)
Status
- Issue created
- Implementation
- PR created
- Merged