Skip to content

Bloodhound v2 – CI/CD, Validation, and Operational Documentation Updates#2

Open
mmccla1n wants to merge 86 commits into
codeplatoon-devops:mainfrom
mmccla1n:main
Open

Bloodhound v2 – CI/CD, Validation, and Operational Documentation Updates#2
mmccla1n wants to merge 86 commits into
codeplatoon-devops:mainfrom
mmccla1n:main

Conversation

@mmccla1n
Copy link
Copy Markdown

This pull request introduces updates to the Bloodhound v2 project related to CI/CD workflows, operational tooling, validation infrastructure, and documentation. Changes include updates to GitHub Actions workflows such as introducing a dedicated manual operations workflow (bloodhound_ops.yml), adding a concurrency guard to prevent overlapping executions, expanding Lambda execution logging, streaming CloudWatch logs into CI output, and masking sensitive environment variables in logs. GitHub Actions authentication was updated to use AWS OIDC role assumption with a bootstrap script for configuring the IAM provider and role, eliminating the need for static AWS credentials in CI. A validation harness was added to support controlled teardown testing using Terraform-created resources, with safeguards requiring validation source identification, explicit target IDs, and validation tags on resources. Additional operational configuration options were introduced to define teardown limits and execution conditions, including deletion caps, simulation mode, Terraform deployment guard checks, and AWS account verification. Execution observability was expanded through CI summaries and improved Lambda logging visibility. Repository documentation was also updated to reflect the current project architecture, operational workflows, validation processes, and configuration system. The validation workflow option has been temporarily removed from CI workflow inputs while stabilization continues, though validation tooling remains available locally via repository scripts. These updates provide expanded CI/CD workflow capabilities, validation tooling for teardown operations, OIDC-based authentication for CI, improved execution logging, and updated operational documentation.

tsmith4014 and others added 30 commits December 20, 2025 18:36
…wn controls

Summary
- Replace v1 script with v2 package architecture (scanner/budget/whitelist/teardown).
- Slack reporting: scan summary (per region + totals, including 0 counts for scanned types), budget summary, teardown plan/results, and dedicated whitelisted resources list.
- Whitelist: tag-based keep rule (default bloodhound:keep=true) plus optional KEEP_RESOURCE_IDS.
- Teardown: dry-run by default; apply-mode gated by APPLY_CHANGES and supports simulate mode (TEARDOWN_SIMULATE) plus safety rails (TEARDOWN_TARGET_IDS, TEARDOWN_ALLOW_ALL).
- Budgeting: 7-month cohort spend tracking and month-end projection via Cost Explorer.

Operational
- Add lambda handler entrypoint (lambda_function.lambda_handler) and local runner (run_local.py).
- Add env.example and .env auto-loading for local runs.
- Add .gitignore to prevent committing secrets/venvs/build zips.
- Update requirements to resolve urllib3/botocore conflict.
- Add v2 GitHub Actions workflow (invoke_lambda_v2.yml).
- Add v2 plan doc and split Slack setup into SLACK_SETUP.md.

Notes
- v1 is preserved separately under versions/v1_0/ outside this repo directory; v2 deletes/terminations require explicit env flags.
- Move Lambda entrypoint into handlers/ and update Terraform handler + build pipeline
- Move docs into docs/ and link from README
- Move local runner + AWS helper JSON into tools/
- Remove empty scripts directory
- Keep functionality unchanged (only paths/organization)
… the demo, feel free to use this just update with your aws profile, it builds 8 ec2, 2 rds and half get whitelisted, cleaned up readme
…pendency split, and Slack manifest infrastructure

Key changes:
- Separate runtime and development dependencies
  - requirements.txt now contains only Lambda runtime packages
  - requirements-dev.txt added for local development/testing dependencies
- Document Lambda dependency strategy
  - add docs/lambda_packaging.md explaining:
    - why boto3 should not be bundled
    - Lambda runtime dependency behavior
    - packaging workflow
    - future Docker-based packaging
- Add Lambda packaging flow documentation and diagrams
- Introduce Slack manifest-based configuration
  - infra/slack/bloodhound_v2_manifest.json becomes Slack app source of truth
  - add infra/slack/README.md documenting manifest structure and change policy
- Update docs/SLACK_SETUP.md to support manifest-based setup with manual fallback
- Update main README with improved onboarding flow and Slack setup references
- Improve infra documentation
  - clarify Lambda environment variables and secret handling
  - document Terraform build behavior
- Introduce Lambda alias infrastructure for versioned deployments and safe rollback
  - infra/alias.tf
  - lambda_alias_version_override variable
- Improve Terraform packaging pipeline documentation
- Update .gitignore and repo structure for build artifacts
- Prepare repo for future deterministic Docker-based Lambda packaging

No infrastructure behavior changes yet; Terraform deployment remains ZIP-based.
Docker packaging planned for future phase.
- added terraform aws account guard to prevent deploy to wrong account
- added lifecycle.prevent_destroy to lambda iam role and policy
- confirmed terraform plan safe (1 add, 2 update, 0 destroy)

slack integration
- wired /seek and /seek_destroy to lambda function url
- verified slack -> lambda -> aws scan flow
- confirmed async lambda invocation working
- slack messages returning scan + budget + teardown plan

validation
- ran /seek from slack
- confirmed lambda invocation in cloudwatch logs
- lambda run time ~14s, memory usage normal
- dry-run teardown confirmed (no deletes)

docs
- added slack validation doc (watching cloudwatch logs)
- added safe operations guide for teardown controls
- added future infra hardening notes

status
phase 3 complete (slack wired)
phase 4 validation in progress
…eployment guard

Infrastructure safety
- Add Terraform apply-mode guard preventing deployment when APPLY_CHANGES=true unless allow_apply_mode=true
- Add deletion cap via TEARDOWN_MAX_DELETE_COUNT to prevent large accidental teardown operations
- Update lambda.tf and variables.tf with documented safety logic and comments

Runtime safety
- Extend TeardownConfig with max_delete_count
- Add executor guard to abort teardown when plan exceeds deletion limit
- Preserve simulate-mode protections and dry-run behavior

Operational validation
- Add docs/validate_teardown.md with full controlled teardown test procedure
- Update Slack/Lambda validation documentation and common failure scenarios
- Document CLI methods for verifying Lambda environment variables and CloudWatch logs

Configuration updates
- Update env.example and terraform.tfvars.example to include TEARDOWN_MAX_DELETE_COUNT
- Clarify apply-mode behavior and Terraform deployment guard

Documentation improvements
- Expand README teardown controls and safety model
- Update infra README with destructive-mode deployment guard
- Improve V2_PLAN and validation guides for operational clarity

These changes introduce defense-in-depth protections for Bloodhound teardown operations
while providing reproducible validation workflows for engineers.
…entation

- Add automated validation tooling:
  - tools/run_validation_workflow.sh
  - tools/smoke_test_lambda.sh
  - tools/validate_teardown.sh
  - tools/show_validation_history.sh

- Implement controlled teardown validation pipeline

- Add Terraform validation resource:
  - infra/test_resource.tf

- Introduce validation logging and history tracking

- Add configuration system documentation:
  - docs/configuration_system.md
  - docs/run_validation.md

- Update teardown validation documentation

- Improve README with configuration safety warning and documentation index

- Rename architecture document to docs/bloodhound_v2_plan.md

- Update env.example with safety guard configuration

- Add Terraform variables for validation resources and safety controls

This commit introduces a full validation framework for Bloodhound v2 including
smoke testing, controlled teardown verification, configuration safety guards,
and documentation for operational workflows.
…ove teardown validation tooling

Core changes
- Standardized Slack command routing to maintain internal modes  and
- Added support for  preview mode while preserving existing destructive flow
- Ensured Lambda worker receives correct execution flags (apply_changes / simulate)
- Fixed Slack command handler logic and improved safety gating for destructive operations

Slack integration
- Updated Slack manifest to include , , , , and
- Aligned manifest URLs with Lambda Function URL endpoint
- Updated Slack command documentation and operational guidance

Infrastructure
- Updated Terraform outputs and test resource configuration
- Improved Lambda smoke test tooling

Validation & tooling
- Added automated validation workflow scripts
- Improved teardown validation scripts and history utilities
- Added operational docs for Slack app usage and troubleshooting

Documentation
- Updated Slack setup documentation
- Updated configuration system documentation
- Updated validation workflow documentation
- Added troubleshooting and operational runbooks
- Combined header and mode_text generation into a single decision block
- Added explicit DRY RUN banner to prevent operator confusion”
…face

- Implement /v2_status Slack command for Bloodhound system status
- Add system health indicator (🟢 🟡 🔴) based on teardown configuration and safety limits
- Improve Slack report formatting (section dividers, vertical service/action lists, Top Regions Affected)
- Update Slack manifest to include /v2_status
- Update validation scripts and teardown tooling references
- Synchronize documentation across README and docs/* with full v2 command set

Commands now supported:
  /v2_seek
  /v2_seek_destroy_plan
  /v2_seek_destroy CONFIRM
  /v2_status
…safety improvements (WIP)

Summary
-------
Refactors Bloodhound pipeline structure to separate event handling and operational services.
This keeps the orchestration layer clean and improves maintainability of scan, budget, and teardown logic.

Major Changes
-------------
• Introduced new architecture layers
  - handlers/: event interpretation (Slack, validation harness, scheduled runs)
  - services/: operational pipeline logic (scan, budget, status, teardown)

• Extracted pipeline logic from app.py into service modules:
  - scan_service.py
  - budget_service.py
  - status_service.py
  - teardown_service.py

• Added handler modules:
  - slack_handler.py
  - validation_handler.py
  - scheduled_handler.py

• execute_pipeline() now acts as a clean orchestrator coordinating services.

Validation Safety Improvements
------------------------------
• Added validation-mode safeguards to ensure destructive testing can only affect validation resources.
• Validation runs now enforce:
  - target ID filtering
  - validation tag checks
  - restricted teardown scope

Validation Workflow (WIP)
-------------------------
Validation pipeline currently under active testing:

Terraform -> create validation instance
Validation script -> capture instance ID
Lambda invocation -> validation mode
Teardown restricted via TEARDOWN_TARGET_IDS
Script verifies instance deletion

Status
------
Validation harness still in progress. Destructive validation behavior being verified before finalizing CI/CD integration.
- Add explicit event routing in app.py for slack_command, validation, and scheduled sources
- Harden validation_handler with source checks and target_ids enforcement
- Document validation harness architecture and payload model
- Update README to reflect Lambda → app.run() → pipeline execution flow
- Clarify dual execution paths (Slack operator vs validation harness)
- Align documentation with v2 slash commands and current validation workflow
- Fix outdated doc references and legacy command notes
Engineering notes:
Changes made while validating the Bloodhound teardown workflow
and debugging Lambda packaging behavior.

Changes:
- Add jq validation check to ensure Lambda response success
- Add scheduled_handler entrypoint for scheduled scans
- Improve Terraform Lambda packaging triggers and debug visibility
- Exclude __pycache__ and .pyc files from Lambda bundle
- Add AWS CLI '--cli-binary-format raw-in-base64-out' to Lambda invocation workflow
- Add Terraform + Lambda troubleshooting documentation

Validation:
Pipeline verified using run_validation_workflow.sh
with successful EC2 teardown validation.
This commit introduces the Bloodhound teardown validation system
along with several reliability improvements.

Key updates:

- added strict bash mode (set -euo pipefail) to prevent silent script failures
- added Lambda execution metric validation before checking AWS resources
- replaced fixed sleep with a loop that waits until EC2 is fully terminated
- added workflow logging using RUN_ID for each validation run
- added log cleanup to keep only the last 3 validation logs
- limited Lambda rebuilds to actual code changes
- updated troubleshooting documentation for the build pipeline
- confirmed full teardown validation workflow working end-to-end

Validation workflow test:

1. smoke test checks Lambda configuration
2. Terraform creates a disposable EC2 instance
3. Lambda teardown deletes the instance
4. execution metrics are verified
5. EC2 termination is confirmed

Result: PASS

This validation workflow ensures Bloodhound safely deletes targeted resources.
- Add structured CI log groups for improved debugging

- Add Lambda error detection and StatusCode validation

- Stream CloudWatch Lambda logs into GitHub Actions output

- Document GitHub automation in docs/github_actions.md

- Update README with GitHub Actions workflow references
… assumption

- Add scripts/bootstrap_github_oidc.sh to configure GitHub OIDC provider and IAM role
- Detect AWS account ID dynamically using STS
- Add IAM resource tagging for governance and ownership tracking
- Add cleanup trap to remove temporary IAM policy artifacts (trust-policy.json, lambda-policy.json)
- Update GitHub Actions workflow to use OIDC role assumption
- Document GitHub automation and OIDC bootstrap process in README
Add GitHub OIDC bootstrap script and switch CI authentication to role…
- Restrict OIDC subject to repo:*/Bloodhound:ref:refs/heads/main
- Update trust policy automatically if role exists
- Improve documentation and security comments
- Allow forks of Bloodhound repo to assume role
added debug statements for GA to check output, temporary add
mmccla1n added 30 commits March 14, 2026 23:04
Bloodhound: add Terraform support for validation workflow
- Removed validation option from workflow_dispatch inputs
- Validation pipeline still exists but requires CI hardening
- Will be re-enabled prior to GA once validation workflow stabilizes
Bloodhound: temporarily disable validation workflow in CI
…ution troubleshooting

Expanded Lambda packaging documentation and troubleshooting guidance for the Bloodhound Lambda deployment.

Changes include:
- Added explanation of build environment vs Lambda runtime differences
- Documented common dependency resolution failures during packaging
- Added guidance on avoiding transitive dependency pinning
- Expanded Docker-based packaging section for future deterministic builds
- Added troubleshooting section covering pip dependency conflicts
- Linked packaging documentation with Terraform troubleshooting guide

These updates were added after encountering a real dependency conflict
between botocore and a manually pinned urllib3 version during Lambda
packaging.

The documentation now explains:
- how Lambda packages are built locally
- why boto3 should not be bundled
- how dependency conflicts occur
- recommended dependency management practices
- the long-term plan for Docker-based packaging

This improves maintainability of the infrastructure documentation and
provides engineers with clear debugging guidance for Lambda packaging failures.
…ocumentation

* Document script-driven build process (build_lambda.sh)
* Introduce layered build directory model (.build/deps, src, lambda_pkg)
* Clarify Terraform triggers and packaging flow
* Improve troubleshooting for archive/build edge cases
* Add guardrails for modifying build pipeline

Ensures documentation reflects deterministic, cache-aware Lambda packaging architecture
- routed scheduled events through dedicated handler instead of run()
- added run_scheduled_scan() as explicit scheduled entrypoint
- removed scheduled flow from generic run() path
- updated lambda router to distinguish scheduled vs default invocations
- aligned documentation to reflect actual execution model

This change addresses the suspected recursion issue in scheduled Lambda executions.
Validation pending via Terraform deploy, GHA, and Slack testing.
… path

- routed scheduled events through dedicated handler instead of run()
- added validate_scheduler mode to simulate EventBridge scheduled trigger in GHA
- standardized CloudWatch logging across lambda entrypoint
- added request_id tracing for improved log visibility

Validation:
- manual scan/status verified via GHA and CLI
- WIP on scheduler path validation via validate_scheduler mode
Mmc/bloodhound v2..fix(lambda): prevent scheduled recursion and add scheduler validation path
Move Lambda packaging out of Terraform and into scripts/build_lambda.sh.

Key changes:
- Introduced scripts/build_lambda.sh to build the Lambda deployment package
- Default build mode uses AWS SAM Docker image for Amazon Linux compatibility
- Added optional local build mode for faster development
- Terraform terraform_data.build_lambda_pkg now invokes the build script
- archive_file continues to package .build/lambda_pkg into the deployment zip
- Added structured build logging and package visibility for debugging

Benefits:
- Ensures dependencies match the AWS Lambda runtime environment
- Keeps Terraform focused strictly on infrastructure
- Produces deterministic and reproducible Lambda packages
- Improves debugging when diagnosing Lambda import errors
- Enables future CI/CD integration

Docker builds are now the default to ensure production-safe artifacts.
…mbda build system

- corrected Lambda packaging documentation to reflect real .build structure
- removed outdated deps/src build directory references
- documented final Lambda artifact (.build/bloodhound_lambda_v2.zip)
- clarified Docker build environment using SAM build container (public.ecr.aws/sam/build-python3.10)
- added Python packaging metadata explanation (*.dist-info, bin/)
- improved Lambda packaging troubleshooting guidance
- moved Terraform bootstrap import documentation to infra/README.md
- removed Terraform bootstrap section from Slack documentation
- clarified safe operations and teardown validation documentation
- ensured infrastructure docs accurately reflect current build and deployment pipeline
…g artifacts

- add comprehensive quick demo guide covering 8 operational scenarios
- document local execution, Slack commands, teardown planning, and controlled deletion
- add GitHub Actions automation and manual operations walkthroughs
- document CloudWatch log inspection and teardown validation workflow
- add architecture overview documentation
- include demo artifacts (screenshots and PDF walkthroughs)
docs: expose quick_demo guide in README and features documentation
Corrected path to render pdfs correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants