Files
jigaido/.github/ISSUE_TEMPLATE/v2-simplify-storage.md
shokollm 2e7b20ed81 Update issue #2 storage design with new file structure
Changed from per-user flat files to group/DM directory structure:
- data/{group_id}/group.json — group bounties
- data/{group_id}/{user_id}.json — user tracking in group
- data/{user_id}/user.json — user personal bounties (DM)
- Groups isolated, no cross-group access
- Tracking is per-group-per-user
2026-04-01 21:31:34 +00:00

7.6 KiB

Simplify Storage: Replace SQLite with Per-User JSON Files

Status

Proposed

Background

What happened

The SQLite-based storage layer (db.py) introduced several categories of complexity that outweigh its benefits at this stage:

  1. Connection management bugs — SQLite Python's row_factory disables implicit transaction handling. Combined with PRAGMA foreign_keys = ON, this caused ON CONFLICT UPDATE statements to silently fail to commit. The fix required setting conn.isolation_level = None directly on the connection object after creation. These are not obvious behaviors and took significant debugging time.

  2. Test fragility — The fresh_db fixture patches DB_PATH but the SQLite connection is a module-level singleton with connection-level state. Tests passed in isolation but failed under pytest's caching, and the root cause was subtle enough to require multiple iterations.

  3. Tracking table complexity — The user_bounty_tracking + reminder_log tables with dedup logic add non-trivial query complexity for what is essentially a "bookmark" feature.

  4. Schema migrations — Any schema change requires a migration script. For a personal bot with 2 users and 50 bounties, this overhead is disproportionate.

  5. Cron/reminder system — The daily reminder cron (cron.py) requires a separate process, scheduler (cron), and reminder_log table to prevent duplicate notifications. This is a significant operational surface for a v1.

Why it happened

The current design was over-engineered for the actual usage pattern:

  • Most commands are stateless (one request → one response)
  • The user is the primary (and likely only) user
  • Scale target is 10-100 users, not 10,000+
  • The bot is a personal project, not a production service

SQLite was chosen for "correctness" but at this scale, the correctness guarantees are irrelevant while the complexity is real.

Current state

The bot works and 53/53 tests pass. But db.py is ~300 lines with subtle connection semantics, schema.sql defines 7 tables, cron.py is a separate process, and the command layer (commands.py) is entangled with the DB layer.


Proposal

Replace SQLite with a JSON file storage system — one directory per group or DM user.

Storage Design

data/
├── {group_id}/
│   ├── group.json           # group bounties (all bounties in this group)
│   └── {user_id}.json       # user tracking within this group (which bounty IDs they track)
└── {user_id}/
    └── user.json             # user's personal bounties (DM — only this user)

Bot context lookup:

Context Entry point
In group (chat_id = -100123) data/-100123/group.json
In DM (chat_id = 123) data/123/user.json

File: data/{group_id}/group.json — group bounties:

{
  "group_id": -100123,
  "bounties": [
    {
      "id": 1,
      "created_by_user_id": 456,
      "text": "Fix login bug",
      "link": "https://github.com/example/repo/issues/1",
      "due_date_ts": 1735689600,
      "created_at": 1735603200
    }
  ]
}

File: data/{group_id}/{user_id}.json — user tracking in a group:

{
  "user_id": 456,
  "tracked": [1, 5, 9]
}

File: data/{user_id}/user.json — user's personal bounties (DM):

{
  "user_id": 123,
  "bounties": [
    {
      "id": 1,
      "text": "Fix login bug",
      "link": "https://github.com/example/repo/issues/1",
      "due_date_ts": 1735689600,
      "created_at": 1735603200
    }
  ]
}

Key design decisions

  1. Group/DM as directorychat_id is the gateway. Group → data/{group_id}/group.json. DM → data/{user_id}/user.json. No scanning needed.

  2. Tracking is per-group-per-userdata/{group_id}/{user_id}.json stores the list of bounty IDs this user tracks in this group. Simple, isolated.

  3. No cross-group access — Group bounties live only in that group's file. A member of Group A cannot see or track Group B's bounties.

  4. Bounty IDs are sequential integers per group — Not global. Each group.json has its own next_id counter.

  5. No reminders in v1 — Drop the cron/reminder system entirely. The reminder_log table and cron.py are removed.

  6. No admin model in v1 — Anyone in the group can add bounties. Only the bounty creator can edit/delete (enforced by created_by_user_id check).

Deleted components

  • db.py — removed entirely
  • schema.sql — removed entirely
  • cron.py — removed entirely
  • reminder_log table — removed
  • user_bounty_tracking table — replaced by tracked_bounties in user JSON
  • groups table — removed (group_id stored directly in bounty objects)
  • group_admins table — removed (simplified permission model)

Retained components

  • bot.py — minimal entrypoint
  • commands.py — command parsing and reply logic (simplified)
  • tests/ — simplified to match new data model

Implementation Plan

Phase 1: Data model + storage layer

  1. Create storage.py with:

    • get_user_path(user_id) — returns Path to user's JSON
    • load_user(user_id) — reads and parses JSON, returns dict, creates file if missing
    • save_user(user_id, data) — writes JSON atomically (temp file + rename)
    • next_bounty_id(user_id) — increments and returns next ID for that user's file
  2. No locking needed at v1 scale. tempfile + rename gives atomic writes.

Phase 2: Rewrite commands.py

Simplified command set:

Command Where Who Description
/bounty Group / DM Anyone List all bounties (group-scoped in group, personal in DM)
/add <text> [link] [due> Group Anyone Add bounty to group
/add <text> [link] [due> DM Anyone Add personal bounty
/edit <id> [text] [link] [due> Group Creator only Edit bounty
/edit <id> [text] [link] [due> DM Creator only Edit personal bounty
/delete <id> Group Creator only Delete bounty
/delete <id> DM Creator only Delete personal bounty
/track <id> Group Anyone Track a group bounty
/untrack <id> Group Anyone Untrack a bounty
/my Group Anyone Show tracked group bounties
/my DM Anyone Show tracked personal bounties
/start Anywhere Anyone Re-initialize user
/help Anywhere Anyone Show help

Removed commands:

  • /admin_add, /admin_remove — no admin model in v1
  • Reminder-related logic — no cron in v1

Phase 3: Simplify bot.py

  • Remove Application.post_init setup (no DB init needed)
  • Bot starts instantly — JSON files created on first use
  • No migration logic

Phase 4: Rewrite tests

  • test_commands.py — keep (parsing is unchanged)
  • test_storage.py — new, tests load_user, save_user, next_bounty_id
  • Remove all DB-dependent tests (test_db.py deleted)

Phase 5: Cleanup

  • Delete db.py, schema.sql, cron.py, test_db.py
  • Delete requirements-dev.txt (dev deps in pyproject.toml)
  • Update README to reflect simplified commands

Estimated effort

  • Storage layer: ~80 lines
  • Commands rewrite: ~200 lines (simpler than current)
  • Tests: ~100 lines
  • Cleanup: trivial

Total: ~1 day of work for one person.


When to revert to SQLite

If any of these become true, SQLite is the right choice:

  • Multiple concurrent users with write conflicts
  • Need for complex queries (across all users, aggregations, etc.)
  • Reminder system with proper deduplication
  • Scale target > 1,000 users
  • Need for ACID guarantees on concurrent writes

For a personal bot with < 100 users, JSON files are the right default.