fix(kugetsu): prevent excess agent spawning with flock + sequential processing
- count_active_dev_sessions() now excludes pm-agent.json from count - process_queue() now calls kugetsu start directly (not opencode run) - process_queue() uses dynamic batch size = available_slots - process_queue() has retry logic (max 3 attempts) on failure - cmd_start() now uses flock around critical section - Added notification types: task_queued, task_dequeued, task_started, task_completed, task_error - Removed QUEUE_DAEMON_BATCH_SIZE config (no longer needed) Fixes issue #146
This commit is contained in:
67
.github/ISSUES/fix-queue-daemon-excess-agents.md
vendored
Normal file
67
.github/ISSUES/fix-queue-daemon-excess-agents.md
vendored
Normal file
@@ -0,0 +1,67 @@
|
||||
# Fix: Queue daemon spawning excess agents due to race condition
|
||||
|
||||
## Problem
|
||||
|
||||
When enqueueing multiple tasks (e.g., 6 tasks), the queue daemon was spawning many more subagents than expected, eventually exhausting container memory.
|
||||
|
||||
**Root Cause:** The combination of:
|
||||
1. `process_queue()` calling `opencode run` directly instead of `kugetsu start`, bypassing all concurrency logic
|
||||
2. `count_active_dev_sessions()` counting `pm-agent.json` toward `MAX_CONCURRENT_AGENTS`, reducing effective dev agent slots
|
||||
3. No atomic locking around session count check + session file creation (TOCTOU race condition)
|
||||
4. Background spawning of multiple concurrent processes in `process_queue()`
|
||||
|
||||
**Expected behavior:** With `MAX_CONCURRENT_AGENTS=3` and 6 tasks:
|
||||
- Tasks should be processed sequentially via `kugetsu start`
|
||||
- Only 3 dev agents should run at a time
|
||||
- Tasks should queue and wait for slots to free up
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. `count_active_dev_sessions()` - Exclude pm-agent
|
||||
Only count actual dev agent session files (exclude `pm-agent.json`).
|
||||
|
||||
### 2. `process_queue()` - Call `kugetsu start` directly + retry logic
|
||||
- Call `kugetsu start` directly (foreground, sequential) instead of spawning `opencode run` background process
|
||||
- Dynamic batch size = available slots (removes need for `QUEUE_DAEMON_BATCH_SIZE`)
|
||||
- Retry logic (max 3 attempts) on failure
|
||||
- On failure: cleanup worktree/session and revert to `pending` state
|
||||
- Save `fork_pid` to queue item for timeout handling
|
||||
|
||||
### 3. `cmd_start()` - Add flock
|
||||
- Add flock around critical section (count check + fork)
|
||||
- Track `fork_pid` for queue item timeout handling
|
||||
|
||||
### 4. Notification System
|
||||
New notification types:
|
||||
| Event | Type |
|
||||
|-------|------|
|
||||
| Task enqueued | `task_queued` |
|
||||
| Task dequeued | `task_dequeued` |
|
||||
| Task started | `task_started` |
|
||||
| Task completed | `task_completed` |
|
||||
| Task error | `task_error` |
|
||||
|
||||
### 5. Config
|
||||
- Remove `QUEUE_DAEMON_BATCH_SIZE` (no longer needed - batch size is now dynamic)
|
||||
|
||||
## Notification Flow
|
||||
|
||||
| Event | Location | Type |
|
||||
|-------|----------|------|
|
||||
| Task enqueued | `enqueue_task()` | `task_queued` |
|
||||
| Task dequeued | `process_queue()` after state change to `notified` | `task_dequeued` |
|
||||
| Task started | `cmd_start()` after session file created | `task_started` |
|
||||
| Task completed | `update_queue_item_state()` | `task_completed` |
|
||||
| Task error | `update_queue_item_state()` | `task_error` |
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Re-check loop in cmd_start (checking if session DB is reliable) - deferred to separate research issue
|
||||
- Buffer mechanism for excess forking (safety failsafe only)
|
||||
|
||||
## Status
|
||||
|
||||
- [x] Issue created
|
||||
- [x] Implementation
|
||||
- [ ] PR created
|
||||
- [ ] Merged
|
||||
Reference in New Issue
Block a user