Files
kugetsu/.github/ISSUES/fix-queue-daemon-excess-agents.md
shokollm 54aa6419eb fix(kugetsu): prevent excess agent spawning with flock + sequential processing
- count_active_dev_sessions() now excludes pm-agent.json from count
- process_queue() now calls kugetsu start directly (not opencode run)
- process_queue() uses dynamic batch size = available_slots
- process_queue() has retry logic (max 3 attempts) on failure
- cmd_start() now uses flock around critical section
- Added notification types: task_queued, task_dequeued, task_started, task_completed, task_error
- Removed QUEUE_DAEMON_BATCH_SIZE config (no longer needed)

Fixes issue #146
2026-04-05 08:44:45 +00:00

2.6 KiB

Fix: Queue daemon spawning excess agents due to race condition

Problem

When enqueueing multiple tasks (e.g., 6 tasks), the queue daemon was spawning many more subagents than expected, eventually exhausting container memory.

Root Cause: The combination of:

  1. process_queue() calling opencode run directly instead of kugetsu start, bypassing all concurrency logic
  2. count_active_dev_sessions() counting pm-agent.json toward MAX_CONCURRENT_AGENTS, reducing effective dev agent slots
  3. No atomic locking around session count check + session file creation (TOCTOU race condition)
  4. Background spawning of multiple concurrent processes in process_queue()

Expected behavior: With MAX_CONCURRENT_AGENTS=3 and 6 tasks:

  • Tasks should be processed sequentially via kugetsu start
  • Only 3 dev agents should run at a time
  • Tasks should queue and wait for slots to free up

Solution

1. count_active_dev_sessions() - Exclude pm-agent

Only count actual dev agent session files (exclude pm-agent.json).

2. process_queue() - Call kugetsu start directly + retry logic

  • Call kugetsu start directly (foreground, sequential) instead of spawning opencode run background process
  • Dynamic batch size = available slots (removes need for QUEUE_DAEMON_BATCH_SIZE)
  • Retry logic (max 3 attempts) on failure
  • On failure: cleanup worktree/session and revert to pending state
  • Save fork_pid to queue item for timeout handling

3. cmd_start() - Add flock

  • Add flock around critical section (count check + fork)
  • Track fork_pid for queue item timeout handling

4. Notification System

New notification types:

Event Type
Task enqueued task_queued
Task dequeued task_dequeued
Task started task_started
Task completed task_completed
Task error task_error

5. Config

  • Remove QUEUE_DAEMON_BATCH_SIZE (no longer needed - batch size is now dynamic)

Notification Flow

Event Location Type
Task enqueued enqueue_task() task_queued
Task dequeued process_queue() after state change to notified task_dequeued
Task started cmd_start() after session file created task_started
Task completed update_queue_item_state() task_completed
Task error update_queue_item_state() task_error

Out of Scope

  • Re-check loop in cmd_start (checking if session DB is reliable) - deferred to separate research issue
  • Buffer mechanism for excess forking (safety failsafe only)

Status

  • Issue created
  • Implementation
  • PR created (#147)
  • Merged