Thread-safe config reads and SIGHUP reload for HTTP API server by mason-sharp · Pull Request #89 · pgEdge/ace

mason-sharp · 2026-03-13T21:08:58Z

A follow up to the SIGHUP changes in PR #84

Replace all bare config.Cfg reads with config.Get() across auth.go, table_diff.go, merkle.go, listen.go, and setup.go to eliminate data races with config.Set() during SIGHUP reload.

Switch HTTP API handlers from stale s.cfg snapshot to config.Get() so reloaded values (block size, concurrency, timeouts, etc.) take effect on the next request. Remove the cfg field from APIServer.

Add SIGHUP config reload handling for both API-only entry points: ./ace server (StartAPIServerCLI) and ./ace start --component=api. Invalid configs are rejected with the previous config retained.

While the test included is basic, there was testing done to verify the change.

coderabbitai · 2026-03-13T21:09:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Centralizes runtime configuration via config.Get(), removes stored cfg from APIServer, adds in-process SIGHUP-driven reload loops for API and scheduler (including security reload), introduces per-task mtree schema caching, updates callers to use live config accessors, and adds tests and docs for config-reload behavior.

Changes

Cohort / File(s)	Summary
API server & handlers `internal/api/http/server.go`, `internal/api/http/handler.go`	Removes `cfg *config.Config` from `APIServer`; replaces direct `s.cfg` reads by calling `cfg := config.Get()` in handlers. Resolver methods now accept a `cfg` parameter; certificate validator switched to an atomic pointer with `ReloadSecurityConfig` to swap validators at runtime.
Reload tests & validator `internal/api/http/config_reload_test.go`, `internal/api/http/validator_test.go`	Adds `TestResolversPickUpReloadedConfig` and two security-reload integration tests that exercise in-memory config swaps and `ReloadSecurityConfig` behavior before/after reload.
CLI & reload loops `internal/cli/cli.go`	Adds SIGHUP-driven reload loop (`runConfigReloadLoop`) wired for API and scheduler startup paths; scheduler reload loop now accepts `apiServer`, uses `config.Reload()`/`config.Set()` for atomic in-process config swaps, drains in-flight scheduler jobs, and invokes `apiServer.ReloadSecurityConfig` when present.
Consistency & merkle tasks `internal/consistency/diff/table_diff.go`, `internal/consistency/mtree/merkle.go`	Switches table-diff limits to use `config.Get()`; introduces per-task `mtreeSchema` field and `MerkleTreeTask.aceSchema()` to memoize schema per-task and avoid mid-task config mixups.
CDC & infra `internal/infra/cdc/listen.go`, `internal/infra/cdc/setup.go`	Replace direct `config.Cfg` reads with `config.Get()`; propagate `mtreeSchema` through change processing and CDC setup and use dynamic CDC config via `config.Get()`.
DB auth config `internal/infra/db/auth.go`	Replace `config.Cfg` usage with `cfg := config.Get()` in connection-string and pool configuration functions.
Config API `pkg/config/config.go`	Changes `Reload(path string)` → `Reload()`; reload now always reads the configured `CfgPath` under lock and no longer accepts an explicit path argument.
Docs `docs/scheduling.md`	Adds "Signaling the Scheduler to Reload the Configuration" section describing SIGHUP-triggered reload and waiting for in-flight jobs to finish before swapping configs.

Poem

🐇 I nibble configs as they softly flip,
A SIGHUP tugs the thread and wakes the ship.
Old jobs finish dancing while the new rules wake,
No sudden stops — just a gentle config shake.
Hopping on, the server hums awake.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and concisely summarizes the main objective: making config reads thread-safe and adding SIGHUP reload support specifically for the HTTP API server.
Description check	✅ Passed	The description is well-related to the changeset, explaining the core changes (replacing config.Cfg with config.Get(), adding SIGHUP reload, removing cfg field) and the problems they address.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/ACE/sighup-follow-up

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

internal/consistency/diff/table_diff.go (1)

730-735: Add a nil-config guard in validation path.

cfg.TableDiff is dereferenced immediately; returning a clear validation error is safer than panicking.

Proposed fix

 	cfg := config.Get()
+	if cfg == nil {
+		return fmt.Errorf("configuration not loaded")
+	}
 	if t.BlockSize > cfg.TableDiff.MaxBlockSize && !t.OverrideBlockSize {
 		return fmt.Errorf("block row size should be <= %d", cfg.TableDiff.MaxBlockSize)
 	}

Also applies to: 741-743

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/diff/table_diff.go` around lines 730 - 735, The
validation currently dereferences cfg.TableDiff without checking for nil; update
the validation path (the block where cfg := config.Get() compares t.BlockSize
against cfg.TableDiff.MaxBlockSize/MinBlockSize and checks t.OverrideBlockSize)
to first guard that cfg.TableDiff != nil and return a clear error (e.g., "table
diff configuration missing") if it is nil; apply the same nil-config guard to
the other identical check later (the code around the second occurrence
referencing cfg.TableDiff at lines ~741-743) so neither MaxBlockSize nor
MinBlockSize dereferences can panic.

internal/infra/cdc/setup.go (1)

42-43: Guard config.Get() before dereferencing.

config.Get().MTree.CDC can panic when config is unavailable. Returning an explicit error is safer.

Proposed fix

-	cfg := config.Get().MTree.CDC
+	rootCfg := config.Get()
+	if rootCfg == nil {
+		return 0, fmt.Errorf("configuration not loaded")
+	}
+	cfg := rootCfg.MTree.CDC

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/infra/cdc/setup.go` around lines 42 - 43, Guard the result of
config.Get() and its nested fields before dereferencing: replace the direct
access cfg := config.Get().MTree.CDC and slot := cfg.SlotName with a nil-check
flow that obtains root := config.Get(), returns an explicit error if root is nil
or if root.MTree or root.MTree.CDC is nil, then assigns cfg := root.MTree.CDC
and slot := cfg.SlotName; ensure the containing function returns an error when
config is unavailable rather than letting a panic occur.

docs/scheduling.md (1)

44-47: Add a concrete command example for sending SIGHUP.

Consider including one explicit example (e.g., kill -HUP <pid>) so operators can apply this immediately.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/scheduling.md` around lines 44 - 47, Update the "Signaling the Scheduler
to Reload the Configuration" section to include a concrete command example such
as "kill -HUP <pid>" and an alternative like "pkill -HUP <process_name>", plus a
brief note how to locate the PID (e.g., pgrep or systemctl show) and mention the
systemd variant "systemctl kill -s HUP <service>" so operators immediately have
actionable commands; place this immediately after the sentence that explains
using SIGHUP.

internal/api/http/handler.go (1)

223-223: Use one config snapshot per request to avoid mixed-version task settings.

At Line 223, Line 363, Line 444, Line 509, Line 584, Line 666, Line 744, Line 802, Line 866, Line 937, Line 1012, and Line 1091, each handler reads config.Get() independently. Since pkg/config/config.go:116-179 allows Set() to swap config between reads, a single request can assemble a task with mixed old/new config values.

♻️ Suggested pattern (snapshot once, reuse)

 func (s *APIServer) handleTableDiff(w http.ResponseWriter, r *http.Request) {
+	cfg := config.Get()
+	if cfg == nil {
+		writeError(w, http.StatusServiceUnavailable, "configuration unavailable")
+		return
+	}
+
 	...
-	task.BlockSize = s.resolveBlockSize(req.BlockSize)
-	task.ConcurrencyFactor = s.resolveConcurrency(req.Concurrency)
-	task.CompareUnitSize = s.resolveCompareUnitSize(req.CompareUnitSize)
-	task.MaxDiffRows = s.resolveMaxDiffRows(req.MaxDiffRows)
+	task.BlockSize = s.resolveBlockSize(cfg, req.BlockSize)
+	task.ConcurrencyFactor = s.resolveConcurrency(cfg, req.Concurrency)
+	task.CompareUnitSize = s.resolveCompareUnitSize(cfg, req.CompareUnitSize)
+	task.MaxDiffRows = s.resolveMaxDiffRows(cfg, req.MaxDiffRows)
 	...
-	task.TaskStorePath = config.Get().Server.TaskStorePath
+	task.TaskStorePath = cfg.Server.TaskStorePath
 }
 
-func (s *APIServer) resolveBlockSize(requested int) int {
+func (s *APIServer) resolveBlockSize(cfg *config.Config, requested int) int {
 	if requested > 0 {
 		return requested
 	}
-	if cfg := config.Get(); cfg != nil && cfg.TableDiff.DiffBlockSize > 0 {
+	if cfg != nil && cfg.TableDiff.DiffBlockSize > 0 {
 		return cfg.TableDiff.DiffBlockSize
 	}
 	return 100000
 }

Also applies to: 363-363, 444-444, 509-509, 584-584, 666-666, 744-744, 802-802, 866-866, 937-937, 1012-1012, 1091-1091

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/api/http/handler.go` at line 223, Multiple places in handler.go call
config.Get() per-use (e.g., setting task.TaskStorePath =
config.Get().Server.TaskStorePath) which can mix config versions during a
request; fix by taking a single snapshot of the config at the start of each
request handler (e.g., cfg := config.Get()) and then replace subsequent
config.Get() calls in that handler with cfg (use cfg.Server.TaskStorePath,
cfg.OtherField, etc.), ensuring all task assembly and reads (including where
task.TaskStorePath is set) consistently use the same cfg for the duration of the
request.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/http/config_reload_test.go`:
- Around line 16-20: The test cleanup currently calls
config.Set(&config.Config{}) when original == nil, which mutates global state;
instead, restore the exact prior state by calling config.Set(original) in all
cases (i.e., allow original to be nil) so that when original is nil the global
remains nil; update the cleanup logic around the original variable and the
config.Set call (references: original, config.Set, config.Config) to ensure no
cross-test pollution.

In `@internal/cli/cli.go`:
- Around line 1456-1462: The reload path currently calls config.Reload and
immediately config.Set the newCfg without validating it; update the reload logic
(the block that calls config.Reload/newCfg and the similar block around lines
where this appears) to validate the reloaded config with
canStartAPIServer(newCfg) (the same check used at startup) before calling
config.Set, and if validation fails log an error with details and continue
without replacing the live config; ensure you reference the functions
config.Reload, config.Set and canStartAPIServer in the change and preserve the
existing success/failure logging behavior.

In `@internal/consistency/mtree/merkle.go`:
- Around line 50-52: The aceSchema() helper currently reads
config.Get().MTree.Schema on every call which can drift during reloads; instead
snapshot the schema once at task/run start and reuse that value. Modify call
sites to accept a schema parameter (or store the snapshot on a per-task struct)
and remove direct use of config.Get().MTree.Schema from aceSchema()/caller code
so aceSchema() returns the pre-captured string (or remove aceSchema and use the
injected snapshot); reference aceSchema(), config.Get().MTree.Schema and any
task/run struct or function that orchestrates the operation to propagate and use
the single captured schema value.

In `@internal/infra/cdc/listen.go`:
- Around line 61-63: The bug is that processChanges reads live config via
config.Get() allowing mid-stream schema changes; instead, capture the schema
once in processReplicationStream and pass it down: modify
processReplicationStream to read cfg := config.Get().MTree.CDC (or specifically
mtreeSchema := cfg.SomeSchemaField) before starting the stream and change the
signature of processChanges(ctx context.Context, pool *pgxpool.Pool, changes
[]cdcMsg, mtreeSchema string) to accept the schema, then update all calls to
processChanges in this file (and related call sites e.g., where
processReplicationStream invokes it) so processChanges uses the passed
mtreeSchema instead of calling config.Get() internally. Ensure any other uses of
config.Get() in this CDC path are replaced with the captured value to prevent
mixed-schema writes.

---

Nitpick comments:
In `@docs/scheduling.md`:
- Around line 44-47: Update the "Signaling the Scheduler to Reload the
Configuration" section to include a concrete command example such as "kill -HUP
<pid>" and an alternative like "pkill -HUP <process_name>", plus a brief note
how to locate the PID (e.g., pgrep or systemctl show) and mention the systemd
variant "systemctl kill -s HUP <service>" so operators immediately have
actionable commands; place this immediately after the sentence that explains
using SIGHUP.

In `@internal/api/http/handler.go`:
- Line 223: Multiple places in handler.go call config.Get() per-use (e.g.,
setting task.TaskStorePath = config.Get().Server.TaskStorePath) which can mix
config versions during a request; fix by taking a single snapshot of the config
at the start of each request handler (e.g., cfg := config.Get()) and then
replace subsequent config.Get() calls in that handler with cfg (use
cfg.Server.TaskStorePath, cfg.OtherField, etc.), ensuring all task assembly and
reads (including where task.TaskStorePath is set) consistently use the same cfg
for the duration of the request.

In `@internal/consistency/diff/table_diff.go`:
- Around line 730-735: The validation currently dereferences cfg.TableDiff
without checking for nil; update the validation path (the block where cfg :=
config.Get() compares t.BlockSize against
cfg.TableDiff.MaxBlockSize/MinBlockSize and checks t.OverrideBlockSize) to first
guard that cfg.TableDiff != nil and return a clear error (e.g., "table diff
configuration missing") if it is nil; apply the same nil-config guard to the
other identical check later (the code around the second occurrence referencing
cfg.TableDiff at lines ~741-743) so neither MaxBlockSize nor MinBlockSize
dereferences can panic.

In `@internal/infra/cdc/setup.go`:
- Around line 42-43: Guard the result of config.Get() and its nested fields
before dereferencing: replace the direct access cfg := config.Get().MTree.CDC
and slot := cfg.SlotName with a nil-check flow that obtains root :=
config.Get(), returns an explicit error if root is nil or if root.MTree or
root.MTree.CDC is nil, then assigns cfg := root.MTree.CDC and slot :=
cfg.SlotName; ensure the containing function returns an error when config is
unavailable rather than letting a panic occur.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2543e0c-d27d-462b-8870-123b5d8e6bb4

📥 Commits

Reviewing files that changed from the base of the PR and between c3af6dd and e620446.

📒 Files selected for processing (10)

docs/scheduling.md
internal/api/http/config_reload_test.go
internal/api/http/handler.go
internal/api/http/server.go
internal/cli/cli.go
internal/consistency/diff/table_diff.go
internal/consistency/mtree/merkle.go
internal/infra/cdc/listen.go
internal/infra/cdc/setup.go
internal/infra/db/auth.go

💤 Files with no reviewable changes (1)

internal/api/http/server.go

internal/api/http/config_reload_test.go

internal/cli/cli.go

internal/consistency/mtree/merkle.go

internal/infra/cdc/listen.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/api/http/handler.go (1)

205-226: ⚠️ Potential issue | 🟡 Minor

Use a single config snapshot per handler (including default-cluster resolution).

Some handlers now snapshot config, but others still fetch config inline for TaskStorePath, and resolveClusterName reads default cluster via a separate global call. A SIGHUP mid-handler can still mix config generations in one request.

💡 Proposed pattern

-func (s *APIServer) resolveClusterName(name string) (string, error) {
+func (s *APIServer) resolveClusterName(cfg *config.Config, name string) (string, error) {
    cluster := strings.TrimSpace(name)
    if cluster == "" {
-       cluster = config.DefaultCluster()
+       if cfg != nil {
+           cluster = strings.TrimSpace(cfg.DefaultCluster)
+       }
    }
    if cluster == "" {
        return "", fmt.Errorf("cluster is required")
    }
    return cluster, nil
}

- cluster, err := s.resolveClusterName(req.Cluster)
+ cfg := config.Get()
+ if cfg == nil {
+     writeError(w, http.StatusServiceUnavailable, "configuration unavailable")
+     return
+ }
+ cluster, err := s.resolveClusterName(cfg, req.Cluster)
  if err != nil { ... }

- task.TaskStorePath = config.Get().Server.TaskStorePath
+ task.TaskStorePath = cfg.Server.TaskStorePath

Also applies to: 365-365, 446-446, 511-511, 566-589, 649-673, 750-750, 808-808, 872-872, 943-943, 1018-1018, 1097-1097

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/api/http/handler.go` around lines 205 - 226, This handler may mix
different config generations during a request; take a single config snapshot at
the top (cfg := config.Get()) and use that same cfg throughout the handler and
in any helper invocations (e.g., pass cfg into resolveClusterName,
resolveBlockSize, resolveConcurrency, resolveCompareUnitSize,
resolveMaxDiffRows, resolveNodes) instead of calling config.Get() again or
reading defaults inside those helpers; update resolveClusterName (and any helper
that currently calls config.Get()) to accept a cfg parameter or fetch the
default cluster externally and pass it in, and replace any inline config.Get()
usages (including TaskStorePath lookups and default-cluster resolution) with the
same cfg variable so the entire handler uses one consistent snapshot.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/cli/cli.go`:
- Around line 1453-1455: The reload goroutine can leak because
runConfigReloadLoop currently ranges over sighupCh which never closes; change
runConfigReloadLoop to accept a context (e.g. ctx context.Context) or the
existing runCtx and update its internal loop to select on either the signal
channel or ctx.Done() so it returns when the command shuts down; then call go
runConfigReloadLoop(runCtx, sighupCh) (or similar) where it’s currently invoked
(references: runConfigReloadLoop and sighupCh and runCtx.Done) so the goroutine
exits on cancellation instead of blocking forever.

---

Outside diff comments:
In `@internal/api/http/handler.go`:
- Around line 205-226: This handler may mix different config generations during
a request; take a single config snapshot at the top (cfg := config.Get()) and
use that same cfg throughout the handler and in any helper invocations (e.g.,
pass cfg into resolveClusterName, resolveBlockSize, resolveConcurrency,
resolveCompareUnitSize, resolveMaxDiffRows, resolveNodes) instead of calling
config.Get() again or reading defaults inside those helpers; update
resolveClusterName (and any helper that currently calls config.Get()) to accept
a cfg parameter or fetch the default cluster externally and pass it in, and
replace any inline config.Get() usages (including TaskStorePath lookups and
default-cluster resolution) with the same cfg variable so the entire handler
uses one consistent snapshot.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 210023ff-5494-4ff0-8a9c-f69518102092

📥 Commits

Reviewing files that changed from the base of the PR and between d6a9256 and eeb8c08.

📒 Files selected for processing (4)

internal/api/http/config_reload_test.go
internal/api/http/handler.go
internal/cli/cli.go
pkg/config/config.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/api/http/config_reload_test.go

internal/cli/cli.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/cli/cli.go (2)

1433-1458: ⚠️ Potential issue | 🟠 Major

Surface API server failures out of StartSchedulerCLI.

The API server goroutine only logs and calls stop(). After that, the API-only path on Line 1457 and the scheduler path both treat the shutdown as clean, so a bind/TLS/runtime failure exits ace start --component=api/all with nil instead of surfacing the actual error.

🧭 Suggested direction

+	apiErrCh := make(chan error, 1)
 	if runAPI {
 		if ok, apiErr := canStartAPIServer(cfg); ok {
 			var err error
 			apiServer, err = server.New(cfg)
 			if err != nil {
 				return fmt.Errorf("api server init failed: %w", err)
 			}
 			go func() {
-				if err := apiServer.Run(runCtx); err != nil && !errors.Is(err, context.Canceled) {
-					logger.Error("api server error: %v", err)
-					stop()
-				}
+				apiErrCh <- apiServer.Run(runCtx)
 			}()
@@
 	if !runScheduler {
-		go runConfigReloadLoop(sighupCh, apiServer)
-		<-runCtx.Done()
-		return nil
+		go runConfigReloadLoop(runCtx, sighupCh, apiServer)
+		select {
+		case <-runCtx.Done():
+			return nil
+		case err := <-apiErrCh:
+			if err != nil && !errors.Is(err, context.Canceled) {
+				return fmt.Errorf("api server error: %w", err)
+			}
+			return nil
+		}
 	}

The scheduler path should select on the same apiErrCh and return it as well.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/cli/cli.go` around lines 1433 - 1458, The API server goroutine
currently only logs and calls stop(), so StartSchedulerCLI treats those
shutdowns as clean; modify the API startup to create an apiErrCh (chan error),
have the goroutine send the non-canceled error into apiErrCh when apiServer.Run
fails, and close the channel on normal exit; then update both the API-only path
(where runScheduler is false and runConfigReloadLoop is used) and the scheduler
path to select on runCtx.Done() and apiErrCh so any error received from apiErrCh
is returned by StartSchedulerCLI (propagate the error instead of returning nil).
Ensure you reference and update the apiServer run goroutine, the apiErrCh, and
the shutdown/select logic where runConfigReloadLoop and <-runCtx.Done() are
used.

1454-1463: ⚠️ Potential issue | 🔴 Critical

Reload API mTLS state in ace start --component=all too.

In the combined path, SIGHUP is handled only by schedulerReloadLoop, and that code never calls apiServer.ReloadSecurityConfig. So config.Set(newCfg) updates handler-facing config, but allowedCNs and CRL state stay frozen at startup in the default ace start mode. A revoked client cert will keep working until restart.

🔐 Proposed fix

-	return schedulerReloadLoop(runCtx, sighupCh)
+	return schedulerReloadLoop(runCtx, sighupCh, apiServer)
 }
@@
 func schedulerReloadLoop(
 	runCtx context.Context,
 	sighupCh <-chan os.Signal,
+	apiServer *server.APIServer,
 ) error {
@@
-				config.Set(newCfg)
+				config.Set(newCfg)
+				if apiServer != nil {
+					if err := apiServer.ReloadSecurityConfig(newCfg); err != nil {
+						logger.Warn("api: security config reload failed (mTLS config unchanged): %v", err)
+					}
+				}
 				logger.Info("scheduler: configuration reloaded successfully")
 				reloaded = true // break inner loop → outer loop restarts scheduler
 			}
 		}

Also applies to: 1535-1562

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/cli/cli.go` around lines 1454 - 1463, SIGHUP handling in the
combined startup path only goes through schedulerReloadLoop and never calls
apiServer.ReloadSecurityConfig, so allowedCNs/CRL aren't refreshed; update the
SIGHUP handling to invoke the API server reload like the API‑only path: either
call apiServer.ReloadSecurityConfig (and then config.Set/newCfg handling) from
inside schedulerReloadLoop where it processes sighupCh, or have the combined
path spawn/run the same runConfigReloadLoop(sighupCh, apiServer) helper in
addition to starting the scheduler; target symbols: schedulerReloadLoop,
runConfigReloadLoop, apiServer.ReloadSecurityConfig, and config.Set/allowedCNs
to ensure the API mTLS state (allowedCNs and CRL) is reloaded on SIGHUP.

♻️ Duplicate comments (1)

internal/cli/cli.go (1)

1454-1456: ⚠️ Potential issue | 🟡 Minor

Stop runConfigReloadLoop explicitly on shutdown.

signal.Stop(sighupCh) does not close sighupCh, so the goroutine started here can outlive shutdown while blocked in for range ch.

🛑 Proposed fix

-		go runConfigReloadLoop(sighupCh, apiServer)
+		go runConfigReloadLoop(runCtx, sighupCh, apiServer)
@@
-	go runConfigReloadLoop(sighupCh, apiServer)
+	go runConfigReloadLoop(runCtx, sighupCh, apiServer)
@@
-func runConfigReloadLoop(ch <-chan os.Signal, apiServer *server.APIServer) {
-	for range ch {
-		logger.Info("api: received SIGHUP – reloading configuration")
-		newCfg, err := config.Reload()
-		if err != nil {
-			logger.Error("api: config reload failed (keeping current config): %v", err)
-			continue
-		}
-		config.Set(newCfg)
-		if apiServer != nil {
-			if err := apiServer.ReloadSecurityConfig(newCfg); err != nil {
-				logger.Warn("api: security config reload failed (mTLS config unchanged): %v", err)
-			}
-		}
-		logger.Info("api: configuration reloaded successfully")
-	}
+func runConfigReloadLoop(runCtx context.Context, ch <-chan os.Signal, apiServer *server.APIServer) {
+	for {
+		select {
+		case <-runCtx.Done():
+			return
+		case <-ch:
+			logger.Info("api: received SIGHUP – reloading configuration")
+			newCfg, err := config.Reload()
+			if err != nil {
+				logger.Error("api: config reload failed (keeping current config): %v", err)
+				continue
+			}
+			config.Set(newCfg)
+			if apiServer != nil {
+				if err := apiServer.ReloadSecurityConfig(newCfg); err != nil {
+					logger.Warn("api: security config reload failed (mTLS config unchanged): %v", err)
+				}
+			}
+			logger.Info("api: configuration reloaded successfully")
+		}
+	}
 }

Also applies to: 1592-1615

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/cli/cli.go` around lines 1454 - 1456, The goroutine started with
runConfigReloadLoop(sighupCh, apiServer) can hang because signal.Stop(sighupCh)
doesn't close the channel; update the shutdown path and runConfigReloadLoop to
use an explicit cancellation mechanism (e.g., accept a context or a done
channel) instead of relying solely on signal.Stop: change runConfigReloadLoop to
select on either ctx.Done()/doneCh or the SIGHUP channel, and on shutdown cancel
the context or close/send on the done channel (or close sighupCh) so the
goroutine exits cleanly; update the caller sites that start runConfigReloadLoop
(the API-only branch and the other occurrence around lines 1592-1615) to pass
the cancellation context/done channel and invoke its cancel/close during
shutdown.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/cli/cli.go`:
- Around line 1433-1458: The API server goroutine currently only logs and calls
stop(), so StartSchedulerCLI treats those shutdowns as clean; modify the API
startup to create an apiErrCh (chan error), have the goroutine send the
non-canceled error into apiErrCh when apiServer.Run fails, and close the channel
on normal exit; then update both the API-only path (where runScheduler is false
and runConfigReloadLoop is used) and the scheduler path to select on
runCtx.Done() and apiErrCh so any error received from apiErrCh is returned by
StartSchedulerCLI (propagate the error instead of returning nil). Ensure you
reference and update the apiServer run goroutine, the apiErrCh, and the
shutdown/select logic where runConfigReloadLoop and <-runCtx.Done() are used.
- Around line 1454-1463: SIGHUP handling in the combined startup path only goes
through schedulerReloadLoop and never calls apiServer.ReloadSecurityConfig, so
allowedCNs/CRL aren't refreshed; update the SIGHUP handling to invoke the API
server reload like the API‑only path: either call apiServer.ReloadSecurityConfig
(and then config.Set/newCfg handling) from inside schedulerReloadLoop where it
processes sighupCh, or have the combined path spawn/run the same
runConfigReloadLoop(sighupCh, apiServer) helper in addition to starting the
scheduler; target symbols: schedulerReloadLoop, runConfigReloadLoop,
apiServer.ReloadSecurityConfig, and config.Set/allowedCNs to ensure the API mTLS
state (allowedCNs and CRL) is reloaded on SIGHUP.

---

Duplicate comments:
In `@internal/cli/cli.go`:
- Around line 1454-1456: The goroutine started with
runConfigReloadLoop(sighupCh, apiServer) can hang because signal.Stop(sighupCh)
doesn't close the channel; update the shutdown path and runConfigReloadLoop to
use an explicit cancellation mechanism (e.g., accept a context or a done
channel) instead of relying solely on signal.Stop: change runConfigReloadLoop to
select on either ctx.Done()/doneCh or the SIGHUP channel, and on shutdown cancel
the context or close/send on the done channel (or close sighupCh) so the
goroutine exits cleanly; update the caller sites that start runConfigReloadLoop
(the API-only branch and the other occurrence around lines 1592-1615) to pass
the cancellation context/done channel and invoke its cancel/close during
shutdown.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b905cc1d-f3ea-4257-afab-986aba3d1369

📥 Commits

Reviewing files that changed from the base of the PR and between eeb8c08 and 6c573c5.

📒 Files selected for processing (3)

internal/api/http/server.go
internal/api/http/validator_test.go
internal/cli/cli.go

Replace all bare config.Cfg reads with config.Get() across auth.go, table_diff.go, merkle.go, listen.go, and setup.go to eliminate data races with config.Set() during SIGHUP reload. Switch HTTP API handlers from stale s.cfg snapshot to config.Get() so reloaded values (block size, concurrency, timeouts, etc.) take effect on the next request. Remove the cfg field from APIServer. Add SIGHUP config reload handling for both API-only entry points: ./ace server (StartAPIServerCLI) and ./ace start --component=api. Invalid configs are rejected with the previous config retained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tream drift Capture MTree.CDC and MTree.Schema at the start of processReplicationStream and pass mtreeSchema into processChanges instead of calling config.Get() from concurrent goroutines. This prevents a SIGHUP reload from causing mixed-schema writes across change batches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rift Convert aceSchema() from a package-level function reading live config to a lazy-init method on MerkleTreeTask. The schema is captured on first call and reused for the task lifetime, preventing a SIGHUP reload from causing mixed-schema references within a single merkle tree operation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Handler resolvers now accept *config.Config so each HTTP request uses a single config snapshot, preventing mixed values if SIGHUP fires mid-handler. - Extract duplicated SIGHUP reload goroutine into runConfigReloadLoop, used by both StartSchedulerCLI (API-only branch) and StartAPIServerCLI. - Update config_reload_test to verify per-snapshot consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace direct config.Cfg reads with config.Get() in both StartSchedulerCLI and StartAPIServerCLI for consistency. - Drop unused path parameter from config.Reload() since all callers rely on the saved CfgPath; the function now reads it under the lock internally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add ReloadSecurityConfig to APIServer that atomically swaps the certValidator on SIGHUP so allowedCNs and CRL changes take effect without a restart. Uses atomic.Pointer for lock-free reads per request. Wire into runConfigReloadLoop so both ace server and ace start --component=api/all pick up security config changes on SIGHUP. Security config reload failure is logged as a warning and does not block the main config reload. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- add canStartAPIServer validation to runConfigReloadLoop so a semantically invalid config is rejected on SIGHUP in the API-only path, matching the dry-run guard already present in schedulerReloadLoop - correct stale comment on config.Cfg that implied direct reads were acceptable in concurrent code; bare reads are only safe during single-threaded startup - expand docs/scheduling.md SIGHUP section to cover all three long-running modes (scheduler, api, all), not just the scheduler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

internal/api/http/config_reload_test.go Show resolved Hide resolved

internal/cli/cli.go Outdated Show resolved Hide resolved

internal/consistency/mtree/merkle.go Outdated Show resolved Hide resolved

internal/infra/cdc/listen.go Outdated Show resolved Hide resolved

mason-sharp requested a review from rasifr March 16, 2026 14:31

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

internal/cli/cli.go Show resolved Hide resolved

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

rasifr force-pushed the feature/ACE/sighup-follow-up branch from 6c573c5 to ee05009 Compare March 17, 2026 07:57

mason-sharp and others added 5 commits March 18, 2026 07:12

mason-sharp force-pushed the feature/ACE/sighup-follow-up branch from 7a3b355 to 03fce10 Compare March 18, 2026 14:17

rasifr and others added 2 commits March 18, 2026 20:37

rasifr approved these changes Mar 18, 2026

View reviewed changes

mason-sharp merged commit 0a2edff into main Mar 18, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread-safe config reads and SIGHUP reload for HTTP API server#89

Thread-safe config reads and SIGHUP reload for HTTP API server#89
mason-sharp merged 7 commits intomainfrom
feature/ACE/sighup-follow-up

mason-sharp commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mason-sharp commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 13, 2026 •

edited

Loading