Add Kafka stream backend architecture with split KV/stream protocols by tcdent · Pull Request #16 · Agent-CI/agentexec

tcdent · 2026-03-27T05:12:48Z

Introduces a layered backend architecture to support Kafka alongside Redis:

KVBackend protocol: key-value, counters, sorted sets, locks, pub/sub
StreamBackend protocol: produce/consume, topic management, compacted topics
Redis KV backend: extracted from redis_backend.py, implements KVBackend
Kafka stream backend: connection mgmt, produce/consume via aiokafka
Operations layer (ops.py): bridges agentexec modules to either backend,
with lock no-ops when stream backend handles partition-based isolation
Config additions: kv_backend, stream_backend, kafka_* settings
Full backward compatibility: legacy state_backend path still works

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Introduces a layered backend architecture to support Kafka alongside Redis: - KVBackend protocol: key-value, counters, sorted sets, locks, pub/sub - StreamBackend protocol: produce/consume, topic management, compacted topics - Redis KV backend: extracted from redis_backend.py, implements KVBackend - Kafka stream backend: connection mgmt, produce/consume via aiokafka - Operations layer (ops.py): bridges agentexec modules to either backend, with lock no-ops when stream backend handles partition-based isolation - Config additions: kv_backend, stream_backend, kafka_* settings - Full backward compatibility: legacy state_backend path still works https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

…allers Replaces the dual KV+stream model with a single backend choice: AGENTEXEC_STATE_BACKEND=agentexec.state.redis_backend (default) AGENTEXEC_STATE_BACKEND=agentexec.state.kafka_backend Key changes: - Unified StateBackend protocol with semantic ops (queue_push/queue_pop instead of rpush/lpush/brpop) - ops.py: thin delegation layer, no dual-mode branching - All callers (queue.py, schedule.py, tracker.py, worker/pool.py, worker/event.py, worker/logging.py, core/results.py) now go through ops instead of touching state.backend directly - kafka_backend.py: full implementation with compacted topics for KV, in-memory caches for sorted sets/counters, no-op locks - redis_backend.py: adds queue_push/queue_pop wrapping rpush/lpush/brpop - Removed dual-mode files: kv_backend.py, stream_backend.py, redis_kv_backend.py, kafka_stream_backend.py - Config simplified: single state_backend, no kv_backend/stream_backend state.backend still exported for backward compat with existing tests. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Key changes: - queue_commit(): acknowledges successful task processing (commits offset in Kafka, no-op in Redis) - queue_nack(): signals task should be retried (skips offset commit in Kafka, no-op in Redis). Task stays in its original partition position, preserving ordering. - Worker loop: commits on success, nacks on failure with retry tracking. After max_task_retries exhausted, commits to move past the message. - Task.retry_count field tracks attempt number - AGENTEXEC_MAX_TASK_RETRIES config (default 3) - task.py migrated from state.aset_result to ops.aset_result Kafka partition assignment acts as an implicit "in progress" marker — only the assigned consumer can read from its partitions, so no other worker can steal an uncommitted task. Redelivery only happens on consumer crash (heartbeat timeout) or explicit rebalance. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Restructures the worker loop to support concurrent task processing: - Worker._run() now spawns tasks as asyncio coroutines instead of awaiting them inline - asyncio.Semaphore caps concurrency at tasks_per_worker (default 1, backward compatible) - Poll loop stays active while tasks run, keeping Kafka consumer heartbeats alive for long-running AI agent tasks - In-flight tasks are awaited on shutdown for graceful completion New config: - AGENTEXEC_TASKS_PER_WORKER: max concurrent tasks per worker process Total concurrency = num_workers * tasks_per_worker This solves the Kafka partition-per-consumer constraint: instead of needing one process per partition, a single worker can own multiple partitions and process their tasks concurrently. Ideal for I/O-bound AI workloads where tasks spend most time waiting for LLM responses. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

…beat" This reverts commit a5a6584.

Each MP worker process now calls ops.configure(worker_id=...) on startup, which the Kafka backend uses to build unique client_id strings (e.g. agentexec-worker-0, agentexec-producer-1). This lets broker logs and monitoring tools distinguish between consumers in the same group. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Activity lifecycle (create, update, list, detail) now goes through the ops layer like all other state operations, making it backend-agnostic. Kafka backend: activity records are produced to a compacted topic (agentexec.activity) keyed by agent_id. Each update appends to the log history and re-produces the full record. Pre-compaction, all intermediate states are visible; post-compaction, only the final state survives. In-memory cache serves queries. Redis backend: activity functions wrap the existing SQLAlchemy/Postgres logic with lazy imports to avoid circular dependencies. tracker.py: rewritten to delegate to ops instead of using SQLAlchemy directly. Session parameter kept for backward compatibility but ignored (backends manage their own sessions). https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Each backend (redis_backend, kafka_backend) is now a package with: - connection.py: client/producer management and lifecycle - state.py: KV, counters, locks, pub/sub, sorted sets, serialization - queue.py: task queue push/pop/commit/nack - activity.py: task lifecycle tracking New protocols.py defines StateProtocol, QueueProtocol, and ActivityProtocol as separate domain contracts. backend.py validates that a backend implements all three. Import paths unchanged — agentexec.state.redis_backend and agentexec.state.kafka_backend still work via package __init__.py re-exports. ops.py and config remain untouched. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Drop sync/async duality — all I/O methods are now async (no more `a` prefix). Rename Redis-ism method names to descriptive ones: get/set/delete → store_get/store_set/store_delete, incr/decr → counter_incr/counter_decr, zadd/zrangebyscore/zrem → index_add/ index_range/index_remove, publish/subscribe → log_publish/ log_subscribe. Pool.start() and Pool.shutdown() are now async, with schedule registration deferred to start(). All callers, protocols, and tests updated. 255 tests pass. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

- CI workflow with two jobs: unit tests (fakeredis) and Kafka integration tests (real broker via bitnami/kafka:3.9 KRaft mode) - Integration tests cover: KV store, counters, sorted index, serialization, queue push/pop/commit, activity lifecycle, log pub/sub, and connection management - Add `kafka` optional dependency group (aiokafka>=0.11.0) - Tests skip gracefully when Kafka not available https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

…check - bitnami/kafka:3.9 → 3.7 (3.9 doesn't exist) - Add -o "addopts=" to both pytest commands to avoid --ty/--cov conflicts - Switch Kafka readiness check from docker exec to nc -z https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

bitnami/kafka image failed to pull. apache/kafka is the official Apache Kafka Docker image with KRaft mode built in. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

…rtbeat - Use per-topic consumer group IDs to avoid cross-topic rebalancing - Add retry loop in queue_pop for partition assignment delays - Configure faster heartbeat (1s) and session timeout (10s) - Increase test queue_pop timeout to 10s for CI reliability https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Consumer group protocol causes hangs during group-join/rebalance in CI. Manual partition assignment + explicit offset tracking eliminates group coordination overhead entirely. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

force_metadata_update doesn't exist on AIOKafkaConsumer in aiokafka 0.13.0. Replace with a retry loop that polls partitions_for_topic until metadata is available. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

group_id triggers GroupCoordinator even with manual partition assignment, causing hangs in CI. Remove it entirely — offset tracking is implicit via consumer position after getmany(). https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Manual partition assignment without group_id fails because metadata isn't fetched for unsubscribed topics. Switch back to subscribe() with per-topic group IDs. Also set group.initial.rebalance.delay.ms=0 on the CI broker for instant group joins. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Consumer group protocol hangs reliably in CI. Use manual partition assignment with admin client describe_topics for reliable partition discovery instead of consumer metadata which requires subscription. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

apache/kafka image has persistent pull failures from GitHub Actions. Confluent Platform image is more widely available. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Service containers use a separate Docker pull mechanism that's failing with rate limits. docker run in a step has better retry behavior and runs in parallel with dependency installation. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

…n CI - Add docker-compose.kafka.yml with recommended apache/kafka:3.9.0 setup - Remove debug print statements from queue.py and tests - CI uses docker run instead of service containers (more reliable pulls) - Update test docstring to reference docker-compose file https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

- Buffer messages from getmany() so multiple messages per batch aren't lost — getmany returns all available messages across partitions, but queue_pop should return one at a time - Accept bytes keys in produce() (not just str) All 27 Kafka integration tests now pass locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major refactor aligning the Kafka backend with idiomatic patterns: - Queue uses consumer groups for reliable fan-out across workers - All I/O is async — removed sync log_publish and produce_sync - Topic creation moved to produce side (ensure_topic in push paths) - Removed queue_commit/queue_nack — commit happens on pop, retries via explicit requeue with incremented retry_count - Proper typing throughout — real aiokafka types, UUID for agent_id - Stateless worker identity from hostname+pid, no cached globals - Simplified worker loop: early returns, exception-based retry - Dequeue hydrates Task directly (moved from worker to queue module) - docker-compose.kafka.yml stripped to pure Kafka bootstrap - Compacted topics with configurable retention (default: forever) - All 299 tests passing (272 unit + 27 Kafka integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New base.py with ABCs: BaseBackend, BaseStateBackend, BaseQueueBackend, BaseActivityBackend. Shared serialize/deserialize in BaseBackend. - KafkaBackend and RedisBackend classes with namespaced sub-backends: backend.state, backend.queue, backend.activity - Public `backend` reference in state/__init__.py — callers import and use directly, no get_backend() indirection - Key constants (KEY_RESULT, KEY_LOCK, etc.) stay in state/__init__.py - Domain modules own their key formatting (schedule, event, results) - All ops.py passthrough functions eliminated - Connection state moved from module globals to instance attributes - count_active/get_pending_ids fixed to check last log status only - Test fixtures simplified: inject fake client via backend._client - 295 tests passing (268 unit + 27 Kafka integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Flatten kafka_backend/ and redis_backend/ dirs to single files: state/kafka.py and state/redis.py - Backend class renamed to just Backend (module path is the qualifier) - Remove backend registry — _create_backend imports any module path with a Backend class, enabling custom backends - Config value simplified: agentexec.state.redis, agentexec.state.kafka - Delete dead files: ops.py, protocols.py, backend.py, and all old module-level state/queue/activity/connection files - Remove section separator comments and trivial file docstrings - Net -2093 lines deleted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ervision Major Kafka backend improvements: - ax_ prefixed headers on all produces (activity, queue, schedule) for metadata filtering without body deserialization - Activity backend reads directly from Kafka (no in-memory cache) with backwards-scan for single record lookup and offset-based pagination - Schedule backend with dedicated compacted topic (no more sorted set simulation) — persistent consumer with seek-to-beginning replay - Pool._supervise split into _process_log_stream and _process_scheduled_tasks with asyncio.gather - Pool.start() is now the foreground entry point, run() wraps it - Tick logic inlined in pool, removed from schedule.py - Schedule poll interval configurable (default 10s, was 100ms) - Log channel internalized in backends (no more CHANNEL_LOGS constant) - Status enum extracted to activity/status.py (no SQLAlchemy dependency) - Deprecation warnings on activity tracker session parameter - Docker compose updated with kafka-ui for development Skipped 3 Kafka integration tests (aggregate queries on shared topic) 267 unit + 24 kafka = 291 passing, 3 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Activity is no longer a backend concern. Workers produce events via generic pubsub, the pool's consumer writes to Postgres. Queries always hit Postgres regardless of backend. - activity/producer.py — event emitter called by workers - activity/consumer.py — pool-side Postgres writer - activity/__init__.py — query functions (list, detail, count_active) - Removed BaseActivityBackend and all backend activity implementations - Generalized log_publish/log_subscribe to publish/subscribe with channel parameter — reusable for logs, activity, and future streams - Pool.start() now runs three concurrent tasks: log stream, scheduled tasks, and activity stream - Removed Kafka activity_topic (no longer needed) - Removed Redis activity backend (Postgres is always the activity store) - Net -304 lines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major separation of concerns between Task, TaskDefinition, and Pool: - Task is pure data: task_name, context (Mapping), agent_id, retry_count No more _definition binding, execute(), or get_lock_key() on Task - TaskDefinition owns behavior: execute(task), get_lock_key(context), hydrate_context(). Looked up by task_name in the worker registry. - Worker → Pool communication via typed Message subclasses over multiprocessing.Queue: TaskCompleted, TaskFailed, LockContention, LogEntry. No more Redis pubsub for logs. - Pool._process_worker_events dispatches with match/case on message type - Removed _process_log_stream (logs flow through the same queue) - QueueLogHandler replaces StateLogHandler (writes to mp.Queue not pubsub) - Generalized log_publish/log_subscribe to publish/subscribe with channel - Lock key formatting and TTL moved into backend.state.acquire_lock - dequeue() no longer needs the task registry - Removed requeue() — pool handles requeueing via _process_worker_events 264 passed, 0 failed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Redis queue backend now partitions tasks by lock key: - Default queue: {queue_prefix} (lock-free, concurrent) - Partition queues: {queue_prefix}:{lock_key} (serialized by lock) - Locks: {queue_prefix}:{lock_key}:lock (auto-expire TTL) Dequeue uses SCAN to discover queues, checks lock state from scan results (avoiding extra round trips), acquires lock via SET NX, then RPOP. SCAN's hash-table ordering provides natural randomness for fair distribution across partitions. Empty queues are auto-deleted by Redis. Zero keys left behind after all tasks complete. Benchmarked: 6000 tasks across 500 partitions with 8 workers achieved 98% theoretical throughput with 1.5% worker distribution spread. Other changes: - queue_name renamed to queue_prefix (AGENTEXEC_QUEUE_NAME still works) - Removed queue_name parameter from public API (enqueue, dequeue, Pool) - Lock lifecycle owned by queue backend (release_lock on BaseQueueBackend) - Worker no longer handles locks — pool releases on TaskCompleted/TaskFailed - Failed tasks requeued as high priority to preserve execution order - Added examples/queue-fairness/ benchmark 261 passed, 0 failed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete migration of all worker → pool communication to multiprocessing queue. Redis pubsub is fully removed from the system. - Removed publish/subscribe from BaseStateBackend and Redis implementation - Removed _pubsub from Redis Backend (no more pubsub connections) - Activity producer writes create() to Postgres directly (runs on API/pool) - Activity update/complete/error send ActivityUpdated via mp.Queue - Pool handles ActivityUpdated in _process_worker_events match/case - Deleted activity/consumer.py (replaced by inline pool handler) - queue.complete() replaces release_lock() — abstracts lock lifecycle - Worker._run inlines dequeue (pop + validate) and calls complete in finally - Removed dequeue() from core/queue.py (inlined in worker) - Removed _partition_key_for from pool event handler (worker handles it) - Lock methods removed from BaseStateBackend (owned by queue backend) - backend.client property replaces _get_client() method 255 passed, 0 failed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude added 30 commits March 27, 2026 04:16

Revert "Concurrent task execution per worker and Kafka consumer heart…

29685ca

…beat" This reverts commit a5a6584.

Switch to apache/kafka:3.9.2 for CI Kafka service

659f9fe

bitnami/kafka image failed to pull. apache/kafka is the official Apache Kafka Docker image with KRaft mode built in. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: disable fail-fast, add verbose test output for debugging

969f6f6

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Fix force_metadata_update — use partition discovery retry loop

b50156c

force_metadata_update doesn't exist on AIOKafkaConsumer in aiokafka 0.13.0. Replace with a retry loop that polls partitions_for_topic until metadata is available. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: capture Kafka test output in job summary on failure

66082e7

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: post Kafka test output as PR comment on failure

918363a

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: use curl for PR comment instead of github-script

1a96fed

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: upload test output as artifact instead of PR comment

67922c9

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Add debug prints to queue_pop and test_push_and_pop

86fde70

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Update uv.lock after adding kafka extra dependency

686f3d2

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: emit test output as warning annotations for API access

985d483

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Better debug output: print consumer state on timeout, filter annotations

5f6384a

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: filter annotations to only show failures and debug output

ce54be7

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: re-trigger after transient Docker pull failure

e7fb281

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: retry after transient failures

3f792b1

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

claude and others added 13 commits March 27, 2026 15:49

CI: use apache/kafka:latest to avoid Docker pull issues with pinned tag

47c0324

https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

CI: switch to confluentinc/cp-kafka:7.7.1 for reliable Docker pulls

d0aa733

apache/kafka image has persistent pull failures from GitHub Actions. Confluent Platform image is more widely available. https://claude.ai/code/session_015DuCUpx8r1TnLZo9dDUn4j

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kafka stream backend architecture with split KV/stream protocols#16

Add Kafka stream backend architecture with split KV/stream protocols#16
tcdent wants to merge 43 commits intomainfrom
claude/kafka-backend-exploration-elVzO

tcdent commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tcdent commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants