A production-ready, multi-channel AI agent with AIOps, Kubernetes management, security scanning, and human-in-the-loop remediation — built on FastAPI, GitHub Models, and the Model Context Protocol (MCP).
- Overview
- Feature Matrix
- Architecture
- Quick Start
- Channel Setup
- AIOps Engine
- MCP Integration
- Kubernetes Integration
- Security Scanning
- Monitoring & Observability
- Configuration Reference
- API Reference
- Project Structure
- Development
- Deployment
- Contributing
Simple AI Agent is a conversational AI agent that connects Telegram and Slack to powerful backend capabilities:
| Capability | Technology |
|---|---|
| LLM inference | GitHub Models API (GPT-4o, Claude-3 Opus, Llama-3-70B) |
| Chat persistence | PostgreSQL 16 (ACID, JSONB, Alembic migrations) |
| Session caching | Redis 7 (sub-ms access, TTL expiry) |
| Tool execution | MCP — stdio (Kubernetes) + SSE (cloud services) |
| Cluster ops | kubectl — 13 natural-language Kubernetes tools |
| Security scans | SimplePortChecker MCP — 8 security tools |
| AIOps | Watch-loop → Rule engine → Playbooks → RCA |
| Approvals | Human-in-the-loop via chat message |
| Alerting | Prometheus + Alertmanager webhook receiver |
| Observability | Grafana dashboards, structlog JSON, /metrics |
- Telegram — Webhook mode, privacy-mode support, group and private chat
- Slack — Events API, app-mention, IM history, signing-secret verification
- Multiple models — GPT-4o, Claude-3 Opus, Llama-3-70B via GitHub Models
- Model selection priority — conversation override → user pref → channel default → system default
- Conversation history — stored in PostgreSQL, windowed into context
- Streaming-compatible — openai-compatible SDK with GitHub Models endpoint
- Full CRUD — pods, deployments, services, namespaces, nodes, events
- Natural language — "show me error pods in production"
- Status filters — error/failed/crash, unhealthy/not-ready, pending, running
- Scaling —
/k8s scale <deployment> <replicas> [ns] - Logs — streaming and snapshot log retrieval
- Resource usage —
top pods,top nodes - Multi-context — switch between clusters
- Port scanning — TCP/UDP port enumeration
- Certificate analysis — TLS issuer, expiry, SANs, protocol
- WAF/CDN detection — Cloudflare, AWS WAF, Azure Front Door, Akamai
- mTLS verification — mutual TLS support check
- Security headers — HSTS, CSP, X-Frame-Options
- OWASP scan — common vulnerability detection
- Full security scan — combined assessment report
- Hybrid identity — identity provider detection
- K8s Watch-Loop — background polling every 30 s (configurable)
- Detects:
CrashLoopBackOff,OOMKilled,NotReadynodes, zero-replica deployments
- Detects:
- Rule Engine — YAML-defined alert rules with severity mapping
- Playbook Executor — ordered step sequences with risk-gated execution
LOW risk— auto-execute, notify afterMEDIUM risk— post approval request, await chat responseHIGH risk— warn + require explicit confirmation
- RCA Engine — LLM-powered root-cause analysis (SRE prompt → JSON report)
- Log Analyzer — structured log pattern matching
- Approval Manager — Redis-backed TTL approvals; chat-native
approve/reject - Alertmanager receiver —
POST /api/alert/webhookingests Prometheus alerts
- PostgreSQL 16 — users, conversations, messages, channel configs, JSONB metadata
- Redis 7 — session cache (sub-ms), pending approvals (TTL 5 min), AOF persistence
- Alembic migrations — versioned schema management
- Connection pooling — async SQLAlchemy + asyncpg
- Multi-stage Docker build — kubectl bundled, OCI labels, non-root UID 1000
- Security options —
no-new-privileges, isolated network, non-root container - Resource limits — CPU and memory limits/reservations in Compose
- Rich health endpoint — DB, Redis, K8s, Prometheus, watchloop, pending approvals
- Rate limiting —
slowapiper-IP rate limiter on all endpoints - Structured logging — JSON via
structlog, Docker log rotation
The full traffic-flow diagram is maintained as a D2 source file at docs/hld.d2.
Render to PNG/SVG (requires D2):
# Install D2: https://d2lang.com/tour/install
d2 docs/hld.d2 docs/hld.svg
# or PNG
d2 docs/hld.d2 docs/hld.png --theme=0The rendered diagram is committed at docs/hld.svg:
To regenerate after edits:
d2 docs/hld.d2 docs/hld.svg --theme=0
Users
|
+-- Telegram Webhook --> /api/webhook -+
+-- Slack Events API --> /api/webhook -+
|
Channel Router
|
Message Handler
+----------+-----------+
| | |
Session Mgr AI Layer K8s Handler
(Redis) (GitHub (NL parser)
Models) |
| | MCP Manager
| | +-----+------+
PostgreSQL LLM stdio(K8s) SSE(Security)
(history) tokens | |
kubectl SimplePortChecker
(subprocess) (HTTPS)
AIOps (async background):
Watch-Loop --> Rule Engine --> Playbook Executor --> Approval Manager
| | |
K8s Cluster MCP tools Redis TTL
|
RCA Engine --> GitHub Models (SRE prompt)
Observability:
App /metrics --> Prometheus --> Grafana dashboards
|
Alertmanager --> POST /api/alert/webhook --> Rule Engine
+-----------------------------------------------------+
| Channel Layer | Telegram / Slack adapters
+-----------------------------------------------------+
| API Layer | FastAPI, rate-limiter, webhooks
+-----------------------------------------------------+
| Business Logic Layer | Message handler, session, K8s, approvals
+------------------------+----------------------------+
| AI Layer | AIOps Layer | LLM client | watch-loop, rules, playbooks, RCA
+------------------------+----------------------------+
| MCP Layer | MCP Manager -> stdio + SSE transports
+-----------------------------------------------------+
| Data Layer | PostgreSQL + Redis
+-----------------------------------------------------+
| Observability Layer | Prometheus metrics, structlog JSON, Grafana
+-----------------------------------------------------+
| Document | Description |
|---|---|
docs/hld.d2 |
Full HLD traffic-flow diagram (D2 source) |
docs/architecture.md |
Layered architecture, design decisions |
docs/component-diagram.md |
Mermaid component interactions |
docs/sequence-diagrams.md |
Message flows, startup, MCP flows |
docs/database-architecture.md |
PostgreSQL & Redis schema + performance |
docs/kubernetes-integration.md |
K8s guide — NL queries, status filters |
docs/mcp-integration.md |
MCP multi-transport architecture |
docs/mcp-registry.md |
Tool registry and routing |
docs/aiops.md |
AIOps engine — watch-loop, rules, playbooks, RCA |
docs/slack-setup.md |
Slack bot setup guide |
| Requirement | Version |
|---|---|
| Python | 3.12+ |
| Docker + Compose | v24+ |
| kubectl | 1.28+ (K8s features) |
| GitHub Account | Models API access |
git clone https://github.com/YOUR_USERNAME/simple-ai-agent.git
cd simple-ai-agent
python3.12 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txtcp .env.example .env
# Edit .env -- minimum required: GITHUB_TOKEN + at least one bot token# PostgreSQL and Redis -- schema auto-created on first boot
docker compose up -d postgres redis./scripts/start_server.sh
# Or manually:
python -m uvicorn src.main:app --reload --host 0.0.0.0 --port 8000curl http://localhost:8000/health
# {"status":"healthy","database":"healthy","redis":"healthy",...}- Visit https://github.com/settings/tokens -> Fine-grained personal access token
- Enable Models API permission
- Set
GITHUB_TOKENin.env
- Message @BotFather ->
/newbot - Copy token ->
TELEGRAM_TOKEN - Groups (recommended): Disable privacy mode via @BotFather -> Bot Settings -> Group Privacy -> OFF
- https://api.slack.com/apps -> New App -> From scratch
- OAuth scopes:
app_mentions:read,chat:write,im:history,users:read - Install to workspace -> copy Bot User OAuth Token ->
SLACK_BOT_TOKEN - Event Subscriptions webhook:
https://your-domain.com/api/webhook/slack - Subscribe to:
app_mention,message.im
See docs/slack-setup.md for the full walkthrough.
The AIOps engine provides proactive cluster health monitoring and automated remediation with a human-in-the-loop approval gate.
| Component | Purpose |
|---|---|
| K8s Watch-Loop | Polls cluster every K8S_WATCHLOOP_INTERVAL seconds |
| Rule Engine | Matches ClusterEvent objects against configured rules |
| Playbook Executor | Runs ordered remediation steps |
| Approval Manager | Gates MEDIUM/HIGH risk steps via chat |
| RCA Engine | LLM-powered root-cause analysis with structured JSON output |
| Log Analyzer | Pattern recognition on pod/container logs |
| Event | Severity |
|---|---|
crash_loop |
critical |
oom_killed |
critical |
not_ready_node |
critical |
replication_failure |
high |
| External Alertmanager alert | varies |
Playbook step (MEDIUM / HIGH risk)
|
v
Approval Manager --> Redis HSET (TTL: 5 min)
|
v
Chat: "Approval required [ID: abc123]
Action: restart pod nginx-abc in production
Risk: MEDIUM -- type 'approve abc123' or 'reject abc123'"
|
+----+----+
approve reject
| |
Execute Cancel
step playbook
Root Cause Analysis
Pattern: OOMKill
Root Cause: Container exceeded memory limit due to unbounded in-memory cache growth
Confidence: 87%
Supporting Evidence:
- OOMKilled event at 2026-03-02T14:23:11Z
- Memory usage reached 512Mi (limit: 512Mi)
- 3 restarts in last 4 hours
Recommended Actions:
1. Set JVM/app heap limit to 60% of container memory limit
2. Increase memory limit to 768Mi and monitor
3. Add memory usage alert at 80% threshold
K8S_WATCHLOOP_ENABLED=true
K8S_WATCHLOOP_INTERVAL=30
AUTO_REMEDIATION_ENABLED=false # true = skip approvals for LOW-risk only
AIOPS_NOTIFICATION_CHANNEL=telegram:YOUR_CHAT_ID
APPROVAL_TIMEOUT_SECONDS=300
ALERTMANAGER_WEBHOOK_SECRET=your-secretSimple AI Agent uses MCP (Model Context Protocol) with two transport types:
| Transport | Server | Use Case |
|---|---|---|
stdio |
scripts/mcp_server.py |
Kubernetes (local subprocess, 13 tools) |
SSE |
https://mcp.simpleportchecker.com/mcp |
Security scanning (cloud, 8 tools) |
{
"mcpServers": {
"kubernetes": {
"type": "stdio",
"command": "python3",
"args": ["scripts/mcp_server.py"],
"description": "Kubernetes management tools via kubectl"
},
"simplePortChecker": {
"type": "sse",
"url": "https://mcp.simpleportchecker.com/mcp",
"description": "Security scanning and port checking tools"
}
}
}All tools are registered in MCPManager.tool_registry (tool_name -> server_name).
The MessageHandler calls MCPManager.call_tool(name, params) which dispatches to the correct transport automatically.
See docs/mcp-integration.md for protocol details.
| Command | Description |
|---|---|
/k8s pods [ns] |
List pods |
/k8s logs <pod> [ns] |
Get logs |
/k8s scale <deploy> <n> [ns] |
Scale deployment |
/k8s deployments [ns] |
List deployments |
/k8s nodes |
List nodes |
/k8s services [ns] |
List services |
/k8s namespaces |
List namespaces |
/k8s events [ns] |
Recent events |
/k8s describe <type> <name> [ns] |
Describe resource |
/k8s top pods|nodes |
Resource usage |
/k8s contexts |
Available contexts |
show me error pods in production
list failed pods
scale api-server to 3 replicas in staging
get logs from nginx-abc123
what are my nodes
show pending pods in development
| Keywords | Shows |
|---|---|
error, failed, crash |
CrashLoopBackOff, Error, ImagePullBackOff |
unhealthy, not ready |
Containers not ready |
pending |
Pending, ContainerCreating |
running, healthy |
Only healthy running pods |
See docs/kubernetes-integration.md for the full guide.
Natural language queries powered by the SimplePortChecker MCP server:
is port 443 open on example.com
check certificate for github.com
detect waf on mysite.com
full security scan on example.com
check mtls on api.example.com
check security headers on example.com
{
"status": "healthy",
"database": "healthy",
"redis": "healthy",
"kubernetes": "healthy (5 namespaces)",
"prometheus": "healthy",
"watchloop": "running",
"pending_approvals": 0,
"active_incidents": 0
}| Component | Default Port | Purpose |
|---|---|---|
| Prometheus | 9090 | Metrics scraping |
| Grafana | 3000 | Dashboards |
| Alertmanager | 9093 | Alert routing |
| pgAdmin | 5050 | DB admin (debug profile) |
| redis-commander | 8081 | Redis admin (debug profile) |
# Start observability stack
docker compose up -d prometheus grafana alertmanager
# With debug tools
docker compose --profile debug up -dAdd to alertmanager.yml:
receivers:
- name: simple-ai-agent
webhook_configs:
- url: http://simple-ai-agent:8000/api/alert/webhook
send_resolved: true
http_config:
authorization:
credentials: "your-alertmanager-webhook-secret"Copy .env.example to .env.
| Variable | Required | Default | Description |
|---|---|---|---|
GITHUB_TOKEN |
yes | -- | GitHub fine-grained PAT with Models access |
TELEGRAM_TOKEN |
one of | -- | Telegram bot token |
SLACK_BOT_TOKEN |
one of | -- | Slack bot token |
SLACK_SIGNING_SECRET |
one of | -- | Slack signing secret |
DATABASE_URL |
-- | postgres DSN | PostgreSQL async DSN |
REDIS_URL |
-- | redis://localhost:6379/0 |
Redis DSN |
LOG_LEVEL |
-- | INFO |
DEBUG / INFO / WARNING / ERROR |
ENVIRONMENT |
-- | development |
development or production |
DEFAULT_MODEL |
-- | gpt-4 |
gpt-4, claude-3-opus, llama-3-70b |
RATE_LIMIT_PER_MINUTE |
-- | 60 |
Per-IP rate limit |
K8S_WATCHLOOP_ENABLED |
-- | true |
Enable AIOps background poller |
K8S_WATCHLOOP_INTERVAL |
-- | 30 |
Poll interval in seconds |
AUTO_REMEDIATION_ENABLED |
-- | false |
Skip approvals for LOW-risk steps |
AIOPS_NOTIFICATION_CHANNEL |
-- | -- | telegram:CHAT_ID or slack:CHANNEL_ID |
APPROVAL_TIMEOUT_SECONDS |
-- | 300 |
Seconds before approval auto-expires |
PROMETHEUS_URL |
-- | -- | http://prometheus:9090 |
GRAFANA_URL |
-- | -- | http://grafana:3000 |
GRAFANA_API_KEY |
-- | -- | Grafana API key for annotations |
ALERTMANAGER_WEBHOOK_SECRET |
-- | -- | Webhook receiver validation secret |
| Method | Path | Description |
|---|---|---|
GET |
/ |
Root -- name, version, environment |
GET |
/health |
Full health (DB, Redis, K8s, Prometheus, watchloop) |
GET |
/ready |
Readiness probe |
POST |
/api/webhook/telegram |
Telegram update webhook |
POST |
/api/webhook/slack |
Slack Events API webhook |
POST |
/api/alert/webhook |
Alertmanager webhook receiver |
GET |
/api/webhook/test |
Webhook connectivity test |
simple-ai-agent/
+-- src/
| +-- main.py # Application entry point & lifespan
| +-- config.py # Pydantic Settings (env vars)
| +-- ai/
| | +-- github_models.py # GitHub Models API client
| | +-- model_selector.py # Per-user/channel model selection
| | +-- context_builder.py # Conversation window builder
| | +-- prompt_manager.py # System prompt templates
| +-- channels/
| | +-- base.py # BaseAdapter interface
| | +-- telegram_adapter.py # python-telegram-bot adapter
| | +-- slack_adapter.py # slack_bolt adapter
| | +-- router.py # Fan-out / fan-in router
| +-- api/
| | +-- health.py # /health, /ready endpoints
| | +-- webhooks.py # /api/webhook/* endpoints
| | +-- middleware.py # Rate limiter setup
| +-- services/
| | +-- message_handler.py # Intent detection & routing
| | +-- session_manager.py # Redis TTL sessions
| | +-- kubernetes_handler.py # NL K8s query handler
| | +-- approval_manager.py # Human-in-the-loop approvals
| | +-- mcp_client.py # Low-level MCP client helper
| | +-- mcp_registry.py # Tool registry helpers
| +-- mcp/
| | +-- mcp_manager.py # Lifecycle + routing manager
| | +-- base_transport.py # Transport ABC
| | +-- stdio_transport.py # stdio (subprocess) transport
| | +-- sse_transport.py # SSE (HTTP) transport
| | +-- kubernetes_server.py # K8s MCP server implementation
| +-- aiops/
| | +-- rule_engine.py # Alert rule matching
| | +-- playbooks.py # Playbook registry & executor
| | +-- rca_engine.py # LLM-powered root-cause analysis
| | +-- log_analyzer.py # Log pattern analysis
| +-- monitoring/
| | +-- watchloop.py # K8s background watch-loop
| | +-- prometheus.py # Prometheus metrics helpers
| | +-- grafana.py # Grafana annotation helper
| +-- k8s/
| | +-- client.py # Kubernetes API client wrapper
| +-- database/
| +-- models.py # SQLAlchemy ORM models
| +-- postgres.py # Async engine + session factory
| +-- redis.py # Redis connection pool
| +-- repositories/ # Data-access layer
| +-- migrations/ # Alembic migration scripts
+-- scripts/
| +-- mcp_server.py # stdio MCP server (K8s tools)
| +-- init_db.py # Manual DB init helper
| +-- start_server.sh # Dev server launcher
| +-- start_production.sh # Production launcher
| +-- stop_server.sh # Graceful stop
+-- config/
| +-- prometheus.yml # Prometheus scrape config
| +-- alertmanager.yml # Alertmanager routing config
| +-- alert_rules.yml # Prometheus alert rules
| +-- grafana/ # Grafana provisioning
+-- docs/
| +-- hld.d2 # High-Level Design diagram (D2 source)
| +-- architecture.md
| +-- component-diagram.md
| +-- sequence-diagrams.md
| +-- database-architecture.md
| +-- aiops.md
| +-- kubernetes-integration.md
| +-- mcp-integration.md
| +-- mcp-registry.md
| +-- slack-setup.md
+-- tests/
| +-- conftest.py
+-- Dockerfile # Multi-stage, non-root, kubectl bundled
+-- docker-compose.yml # Full stack: app + postgres + redis + observability
+-- .mcp-config.json # MCP server configuration
+-- .env.example # Environment template (safe to commit)
+-- .env.production.example # Production environment template
+-- alembic.ini # Migration config
+-- pyproject.toml # Build + tool config
+-- requirements.txt # Python dependencies
pip install -r requirements.txt
pytest # all tests
pytest --cov=src # with coverage report
pytest -k test_aiops # filter specific testsblack src/ # format
ruff check src/ # lint
mypy src/ # type checkalembic revision --autogenerate -m "add column foo"
alembic upgrade head
alembic downgrade -1# Full stack
docker compose up -d
# Including observability
docker compose up -d prometheus grafana alertmanager
# Debug tools (pgAdmin + redis-commander)
docker compose --profile debug up -d
# Follow logs
docker compose logs -f appBefore deploy:
- Set
GITHUB_TOKENand at least one bot token - Set strong
POSTGRES_PASSWORDandREDIS_PASSWORD - Mount kubeconfig at
./data/kube/configfor K8s features - Set
AIOPS_NOTIFICATION_CHANNELfor alert routing - Set
ALERTMANAGER_WEBHOOK_SECRET - Review CPU/memory limits in
docker-compose.yml - Enable TLS termination (nginx/Caddy in front)
- Configure log aggregation
After deploy:
-
GET /healthreturns all subsystems healthy - Test a message in each configured channel
- Test
/k8s podscommand - Verify
"watchloop": "running"in/health - Monitor
docker compose logs -f appfor warnings
# 1. Copy and configure
cp .env.production.example .env.production
nano .env.production
# 2. Build with version metadata
export VERSION=$(git describe --tags --always)
export VCS_REF=$(git rev-parse --short HEAD)
export BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
docker compose build \
--build-arg VERSION=$VERSION \
--build-arg VCS_REF=$VCS_REF \
--build-arg BUILD_DATE=$BUILD_DATE
# 3. Start
docker compose --env-file .env.production up -d
# 4. Verify
curl http://localhost:8000/health| Environment | CPU | RAM | Disk |
|---|---|---|---|
| Development | 1 core | 2 GB | 10 GB |
| Production (minimum) | 2 cores | 4 GB | 50 GB SSD |
| Production (recommended) | 4 cores | 8 GB | 100 GB SSD |
See CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit using conventional commits:
feat: add X,fix: Y,docs: update Z - Push and open a Pull Request against
main
See SECURITY.md for the vulnerability disclosure policy.
Built-in security practices:
- All secrets via environment variables, never hardcoded
.envexcluded from git via.gitignore- Non-root Docker user (UID 1000)
no-new-privilegessecurity option- Pydantic validation on all inputs
- Rate limiting on all API endpoints
- Read-only kubeconfig mount
- Network isolation via Docker bridge networks
MIT License -- Copyright 2026 Simple AI Agent Contributors
Built with Python 3.12, FastAPI, python-telegram-bot, slack_bolt, and the GitHub Models API