Skip to content

MG-216: add openshift/mustgather toolset#159

Open
swghosh wants to merge 2 commits intoopenshift:mainfrom
swghosh:wg-must-gather-mcp-tools
Open

MG-216: add openshift/mustgather toolset#159
swghosh wants to merge 2 commits intoopenshift:mainfrom
swghosh:wg-must-gather-mcp-tools

Conversation

@swghosh
Copy link
Copy Markdown
Member

@swghosh swghosh commented Feb 25, 2026

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features

    • Added openshift/mustgather toolset enabling offline analysis of Kubernetes/OpenShift must-gather archives.
    • Introduced tools to query ETCD data, cluster events, pod logs, node diagnostics, and monitoring metrics (Prometheus/AlertManager).
    • Added resource registration capability to MCP servers for dynamic resource management at runtime.
  • Tests

    • Added comprehensive test suites for must-gather toolset and resource registrar functionality.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Feb 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

// FindContainerDir locates the must-gather container directory
// (e.g., quay-io-okd-scos-content-sha256-...)
func FindContainerDir(basePath string) (string, error) {
entries, err := os.ReadDir(basePath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few considerations regarding where the must-gather can be stored:

  • if the server is running locally in a dev setup, it can access the must-gather from the fs
  • if it is running on the target cluster, we should probably make sure it can be read from a pv
  • the last one is variable where the must-gather needs to be accessed remotely. This will involve further discussions on where the must-gather from customer cases are stored and how they can be accessed.

@Cali0707
Copy link
Copy Markdown

/cc @bentito

Iirc you were also interested in the offline bits

@openshift-ci openshift-ci Bot requested a review from bentito February 26, 2026 20:55
@swghosh swghosh changed the title WIP: add "openshift/mustgather" toolset [WIP] MG-216: add "openshift/mustgather" toolset Feb 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 27, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2026
@openshift-merge-robot
Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from f10d877 to c3cd32c Compare April 6, 2026 21:25
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces the mustgather toolset for offline analysis of OpenShift must-gather archives. It adds infrastructure for dynamic MCP resource registration, archive loading and indexing capabilities, and implements 8+ tools covering ETCD, events, monitoring, pod logs, nodes, and generic resource queries.

Changes

Cohort / File(s) Summary
Resource Registration Infrastructure
pkg/api/toolsets.go, pkg/mcp/resource.go, pkg/mcp/gosdk.go, pkg/mcp/mcp.go
Added ResourceRegistrar and ResourceTemplateRegistrar interfaces to api.ToolHandlerParams, implemented registrars wrapping mcp.Server, adjusted ServerCapabilities initialization to omit explicit Resources: nil assignment.
Module and Import Registration
pkg/mcp/modules.go, internal/tools/update-readme/main.go
Added blank imports for pkg/toolsets/mustgather to register the mustgather toolset at initialization.
Must-Gather Data Types
pkg/ocp/mustgather/types.go
New file defining 30+ exported Go types for must-gather operations: log options, metadata, ETCD health, Prometheus/AlertManager monitoring payloads, node diagnostics, and resource filtering options.
Must-Gather Archive Operations
pkg/ocp/mustgather/loader.go, pkg/ocp/mustgather/index.go, pkg/ocp/mustgather/provider.go
New files implementing archive loading (resolver, YAML parsing, namespace/cluster-scoped resource ingestion), in-memory resource indexing with label/field selector support, and provider interface exposing resource, log, and diagnostics queries.
Must-Gather Backend Utilities
pkg/ocp/mustgather/etcd.go, pkg/ocp/mustgather/monitoring.go
New files adding archive-backed accessors for ETCD health/members/object counts and Prometheus/AlertManager status, configuration, rules, targets, and TSDB metrics.
Mustgather Toolset Implementation
pkg/toolsets/mustgather/toolset.go, pkg/toolsets/mustgather/use.go, pkg/toolsets/mustgather/resources.go, pkg/toolsets/mustgather/namespaces.go, pkg/toolsets/mustgather/pod_logs.go, pkg/toolsets/mustgather/events.go, pkg/toolsets/mustgather/nodes.go, pkg/toolsets/mustgather/etcd.go, pkg/toolsets/mustgather/monitoring.go, pkg/toolsets/mustgather/helpers.go
New toolset implementing 23 server tools: mustgather_use (archive loader/registrar), resource queries, namespace listing, pod log retrieval with filtering/tailing/grep, event queries by resource/time, node diagnostics, kubelet logs, ETCD queries, Prometheus/AlertManager metrics and status. Includes provider holder, typed extractors, and formatting utilities.
Test Support
internal/test/mcp.go, pkg/mcp/resource_test.go, pkg/mcp/toolsets_test.go
Added McpClient resource listing/reading/template methods; introduced ResourceRegistrarSuite to verify resource and resource-template registrar behavior; added TestOpenShiftMustGatherToolset to validate tool metadata against JSON snapshot.
Test Data
pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
New 622-line JSON file defining 23 mustgather tool schemas with required/optional parameters, enums, and descriptions.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Client
    participant MCP Server
    participant Toolset
    participant Provider
    participant Archive

    User->>Client: Call mustgather_use(path)
    Client->>MCP Server: ToolCall mustgather_use
    MCP Server->>Toolset: Execute handler with ResourceRegistrar
    Toolset->>Archive: Load archive from path
    Archive-->>Provider: Create Provider with loaded resources
    Toolset->>Provider: GetMetadata()
    Provider-->>Toolset: metadata (version, timestamp, counts)
    Toolset->>MCP Server: ResourceRegistrar.AddResource(must-gather://...)
    MCP Server->>MCP Server: Store resource with handler
    Toolset-->>Client: Success response with archive info

    User->>Client: Call mustgather_pod_logs_get(namespace, pod)
    Client->>MCP Server: ToolCall mustgather_pod_logs_get
    MCP Server->>Toolset: Execute handler
    Toolset->>Provider: GetPodLog(namespace, pod)
    Provider->>Archive: Read log file from pod directory
    Archive-->>Provider: Log content
    Provider-->>Toolset: Log content string
    Toolset-->>Client: Formatted log output

    User->>Client: ReadResource(must-gather://...)
    Client->>MCP Server: ReadResource request
    MCP Server->>MCP Server: Invoke resource handler
    MCP Server-->>Client: Archive metadata content
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit hops through archives deep,
Where ETCD secrets safely sleep,
Pod logs and events, nodes and more,
All indexed fast, impossible to ignore!
With must-gather tools, analysis flies—
No OpenShift data escapes my eyes! 🔍

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the main change: adding an openshift/mustgather toolset. The title is concise, directly references the Jira issue (MG-216), and accurately summarizes the primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026
@swghosh swghosh marked this pull request as ready for review April 6, 2026 21:26
@swghosh swghosh changed the title [WIP] MG-216: add "openshift/mustgather" toolset MG-216: add "openshift/mustgather" toolset Apr 6, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2026
@openshift-ci openshift-ci Bot requested review from Kaustubh-pande and matzew April 6, 2026 21:26
@swghosh swghosh changed the title MG-216: add "openshift/mustgather" toolset MG-216: add openshift/mustgather toolset Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 6, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features
  • Added must-gather archive analysis toolset for offline OpenShift cluster inspection with ETCD, event, monitoring, node, and pod log tools.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (7)
pkg/toolsets/mustgather/etcd.go (1)

173-175: Add deterministic tie-break for equal counts.

Equal counts currently follow map iteration order. Adding a secondary sort key (resource) makes output stable across runs.

♻️ Proposed refinement
 	sort.Slice(entries, func(i, j int) bool {
-		return entries[i].count > entries[j].count
+		if entries[i].count == entries[j].count {
+			return entries[i].resource < entries[j].resource
+		}
+		return entries[i].count > entries[j].count
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/etcd.go` around lines 173 - 175, Current sort of
entries uses only count so ties are non-deterministic; update the comparator
passed to sort.Slice (the anonymous func used with entries) to break ties by the
resource field: return entries[i].count > entries[j].count || (entries[i].count
== entries[j].count && entries[i].resource < entries[j].resource). This ensures
deterministic, stable ordering for equal counts while keeping primary sort by
count.
pkg/toolsets/mustgather/use.go (1)

56-73: Consider removing previously registered must-gather resource before adding a new one.

When mustgather_use is called multiple times with different paths, the old must-gather://<old-path> resource remains registered alongside the new one. This could lead to stale resources accumulating in the MCP server.

♻️ Proposed fix to clean up old resource
 	// Register MCP resource
 	if params.ResourceRegistrar != nil {
 		resourceURI := "must-gather://" + path
+		// Remove any previously registered must-gather resource
+		// (simple approach: remove the exact URI being registered)
+		params.ResourceRegistrar.RemoveResources(resourceURI)
 		content := fmt.Sprintf("Must-Gather Archive\n"+

Alternatively, track the previously registered URI in package state and remove it before registering the new one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/use.go` around lines 56 - 73, The package currently
registers a new must-gather resource via params.ResourceRegistrar.AddResource
without removing any previously registered must-gather URI, causing stale
entries; add logic in mustgather_use to remove the prior registered resource
before adding the new one by tracking the last registered URI in package state
(e.g., a package-level variable like lastMustGatherURI) and call
params.ResourceRegistrar.RemoveResource(lastMustGatherURI) if non-empty, then
update lastMustGatherURI to the new resourceURI after successful AddResource; if
concurrent calls are possible, protect the package-level tracker with a mutex to
avoid races.
pkg/toolsets/mustgather/events.go (1)

110-115: String comparison for timestamps works but is fragile.

The sorting relies on string comparison of lastTimestamp (e.g., "2024-01-15T10:30:00Z" > "2024-01-14T10:30:00Z"). This works for ISO 8601 / RFC 3339 timestamps, which Kubernetes uses, but adding a comment would clarify this assumption.

📝 Suggested documentation
 	// Sort by lastTimestamp descending
+	// String comparison works correctly for RFC 3339 timestamps (lexicographic order = chronological order)
 	sort.Slice(filtered, func(i, j int) bool {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/events.go` around lines 110 - 115, The sort block
that uses sort.Slice(filtered, ...) and compares ti and tj from
unstructured.NestedString(filtered[i].Object, "lastTimestamp") relies on
RFC3339/ISO‑8601 lexical ordering; add a concise comment above this sort
explaining that Kubernetes timestamps are RFC3339 and therefore string
comparison yields correct chronological ordering, and optionally note that if
non‑RFC3339 input is expected the code should parse to time.Time
(time.Parse(time.RFC3339)) before comparing for robustness.
pkg/toolsets/mustgather/helpers.go (1)

125-135: Mapping "firing" to [OK] may be semantically misleading.

In healthSymbol, the status "firing" (typically indicating an active alert) is mapped to [OK], which could be confusing when displaying alert states. Consider whether "firing" should map to [WARN] or [FAIL] instead, or document this intentional mapping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/helpers.go` around lines 125 - 135, The healthSymbol
function maps "firing" to "[OK]", which is misleading; update the switch in
healthSymbol so the "firing" string is removed from the OK case and instead
handled as a warning (map "firing" to "[WARN]"), and update any related unit
tests or callers that expect the old behavior; locate the healthSymbol function
and change the case membership accordingly so "firing" falls through to the WARN
branch (or add an explicit case returning "[WARN]").
pkg/toolsets/mustgather/monitoring.go (2)

395-398: Fragile string manipulation for inserting summary.

Using strings.Replace to insert the summary after the first "\n\n" is brittle and could break if the header format changes. Consider building the output with the summary in the correct position from the start.

♻️ Suggested approach
-	output := "## Prometheus Rules\n\n"
-	if typeFilter != "" {
-		output += fmt.Sprintf("Filter: %s rules only\n\n", typeFilter)
-	}
+	var headerParts []string
+	headerParts = append(headerParts, "## Prometheus Rules")
 
 	// ... counting logic ...
 
-	summary := fmt.Sprintf("Total: %d groups, %d rules\n\n", totalGroups, totalRules)
-	// Insert summary after header
-	output = strings.Replace(output, "\n\n", "\n\n"+summary, 1)
+	output := "## Prometheus Rules\n\n"
+	output += fmt.Sprintf("Total: %d groups, %d rules\n\n", totalGroups, totalRules)
+	if typeFilter != "" {
+		output += fmt.Sprintf("Filter: %s rules only\n\n", typeFilter)
+	}
+	// ... rest of output building

This requires restructuring to count first, then build output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/monitoring.go` around lines 395 - 398, The current
insertion of summary using strings.Replace on output is brittle; instead compute
totalGroups/totalRules first and construct the final output explicitly rather
than mutating with Replace: build the header, then append the summary (using the
summary variable you already format), then append the remainder of the formatted
rules/body into output so you no longer rely on finding "\n\n" in output (update
the code paths that set summary and output around the variables summary,
totalGroups, totalRules and remove strings.Replace usage).

494-526: Consider consolidating duplicate formatting functions.

formatNameValueSection and formatNameValueBytesSection have nearly identical logic, differing only in the value formatter (formatNumber vs formatBytes). Consider a single function with a formatter parameter.

♻️ Proposed consolidation
+// formatNameValueSectionWith formats a list of name-value pairs using the provided value formatter
+func formatNameValueSectionWith(title string, items []mg.NameValue, limit int, formatValue func(int64) string) string {
+	output := fmt.Sprintf("#### %s\n\n", title)
+	if len(items) == 0 {
+		return output + "No data available.\n\n"
+	}
+	count := len(items)
+	if limit > 0 && count > limit {
+		count = limit
+	}
+	for i := 0; i < count; i++ {
+		output += fmt.Sprintf("%-60s %s\n", truncate(items[i].Name, 60), formatValue(items[i].Value))
+	}
+	output += "\n"
+	return output
+}
+
 // formatNameValueSection formats a list of name-value pairs as a section
 func formatNameValueSection(title string, items []mg.NameValue, limit int) string {
-	// ... current implementation ...
+	return formatNameValueSectionWith(title, items, limit, formatNumber)
 }
 
 // formatNameValueBytesSection formats a list of name-value pairs with byte formatting
 func formatNameValueBytesSection(title string, items []mg.NameValue, limit int) string {
-	// ... current implementation ...
+	return formatNameValueSectionWith(title, items, limit, formatBytes)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/monitoring.go` around lines 494 - 526, The two nearly
identical functions formatNameValueSection and formatNameValueBytesSection
should be consolidated into a single function that accepts a value formatter
parameter; implement a new function (e.g.,
formatNameValueSectionWithFormatter(title string, items []mg.NameValue, limit
int, formatValue func(int64) string)) that performs the shared
header/empty-check/limit/truncate loop and calls formatValue(items[i].Value) for
formatting, then replace calls to formatNameValueSection and
formatNameValueBytesSection to use this new function with formatNumber and
formatBytes respectively and remove the duplicated functions.
pkg/ocp/mustgather/loader.go (1)

132-144: Stop traversing pod log trees during resource discovery.

The pods/... guard currently runs only for files, because directories return on Line 136 before that check executes. On large must-gathers this still walks every pod log path under namespaces/*/pods, which can dominate load time even though none of those files will ever be parsed as resources.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/loader.go` around lines 132 - 144, The walker callback in
the filepath.Walk call is currently returning on directories before checking for
the "pods/..." prefix, causing the walker to descend into all pod log subtrees;
update the anonymous callback used in filepath.Walk (the func(path string, info
os.FileInfo, err error) error) to compute rel := filepath.Rel(nsDir, path) early
and, if info.IsDir() && strings.HasPrefix(rel,
"pods"+string(filepath.Separator)), return filepath.SkipDir to prevent
descending into pod log directories; keep the existing file-suffix checks for
.yaml/.yml for regular files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcp/mcp.go`:
- Around line 87-90: The Initialize response doesn't advertise the Resources
capability so clients can't discover resources added later via AddResource();
update the initialization capability block (where Prompts, Tools, Logging are
set) to include Resources by adding a Resources field using
mcp.ResourceCapabilities (e.g., set ListChanged based on configuration.Stateless
similar to Prompts/Tools) so the server pre-advertises resource support;
reference the Initialize handling logic that relies on s.resources.len() and the
AddResource() method to ensure clients will know Resources can be registered
dynamically.

In `@pkg/ocp/mustgather/etcd.go`:
- Around line 78-83: ReadETCDFile currently joins filename into
p.metadata.ContainerDir + "etcd_info" directly which allows path traversal;
validate and sanitize filename before reading: reject absolute paths
(filepath.IsAbs), reject any filename that after filepath.Clean contains ".."
path segments or path separators, or alternatively construct the absolute path
and verify it is inside the etcd_info base using filepath.Rel to ensure the
resulting filePath is within the base (p.metadata.ContainerDir + "etcd_info")
before calling os.ReadFile; update ReadETCDFile to perform these checks and
return an error for invalid traversal attempts.

In `@pkg/ocp/mustgather/index.go`:
- Around line 171-217: matchesLabelSelector and matchesFieldSelector currently
ignore malformed clauses and unsupported fields which widens matches; change
them to reject invalid selectors by returning false on any malformed part (e.g.,
kv len != 2 or empty key/value) and in matchesFieldSelector treat unknown keys
(not metadata.name/metadata.namespace) as invalid and return false instead of
continuing; update both functions (matchesLabelSelector and
matchesFieldSelector) to validate each clause strictly and fail-fast when a
clause is malformed or unsupported so callers receive no-match for bad
selectors.
- Around line 105-139: The List method in ResourceIndex builds result.Items and
truncates by opts.Limit without a deterministic ordering, causing
non-reproducible subsets; after populating result.Items (but before applying the
Limit) sort it deterministically (e.g. stable sort by GetNamespace(), then
GetName(), then GetUID() or creation timestamp) so repeated calls return the
same first N items; update ResourceIndex.List to perform this stable sort on
result.Items prior to the existing limit logic referenced by symbols
ResourceIndex.List, result.Items, matchesListOptions, and opts.Limit.

In `@pkg/ocp/mustgather/loader.go`:
- Around line 21-47: Load currently masks IO errors: in Load, do not treat all
FindContainerDir errors as ignorable—only tolerate the specific "container
directory not found" case; for any other error returned by FindContainerDir
return it immediately. Also stop swallowing errors from
loadClusterScopedResources and loadNamespacedResources: if either returns a
non-nil error, propagate that error from Load (do not append nil result or
continue). Update the logic in Load (referencing FindContainerDir, Load,
loadClusterScopedResources, loadNamespacedResources, and loadMetadata) so only
the explicit "container dir not found" condition is ignored and all other errors
are returned.

In `@pkg/ocp/mustgather/monitoring.go`:
- Around line 119-127: GetReplicaNumbers currently treats any non-exact match as
"all" which hides bad input; change the switch on replicaParam in
GetReplicaNumbers to only return []int{0} for "prometheus-k8s-0" or "0" and
[]int{1} for "prometheus-k8s-1" or "1", and make the default branch return nil
(or an empty slice) to signal invalid input instead of returning both replicas;
also update any callers of GetReplicaNumbers to detect len==0 (or nil) and
surface/return a clear validation error indicating the replicaParam was unknown.

In `@pkg/ocp/mustgather/provider.go`:
- Around line 176-201: The readGzipFile function currently swallows read errors
and always returns nil error; update read loop in readGzipFile to treat io.EOF
as normal (break) but return any other non-nil readErr to the caller (after
writing any bytes read), so decompression/truncation failures propagate; locate
the function readGzipFile and modify the loop to check readErr (if readErr ==
io.EOF break else return "", readErr).
- Around line 205-210: TailLines currently returns an empty string when content
ends with a newline because Split produces a trailing empty element; update
TailLines to trim trailing newline characters from content (e.g., use
strings.TrimRight(content, "\n")) before splitting or otherwise ignore the final
empty element so that TailLines("a\nb\n", 1) returns "b"; apply the change
inside the TailLines function to preserve existing behavior for
non-newline-terminated input.

In `@pkg/toolsets/mustgather/etcd.go`:
- Around line 155-197: The code accepts negative limit values from getInt which
can lead to misleading output (e.g., "top -1"); clamp and normalize limit
immediately after it's read: capture the raw value from getInt into a variable
(limit), then if limit < 0 set limit = 0 (treat negative as "show all") or
alternatively set limit = len(counts) to avoid negative slicing; use the
normalized limit when truncating entries and when composing the final summary
message (the "... showing top %d of %d" line) so it never prints a negative
number. Ensure you reference and update uses of limit in the slicing block, the
limit > 0 checks, and the final output message.

In `@pkg/toolsets/mustgather/monitoring.go`:
- Around line 192-197: The code currently calls mg.GetReplicaNumbers(replica)
then always uses only replicas[0], so passing "all" is ignored; update the logic
that uses replicas (the replicas variable and r usage) to handle multiple
entries: loop over the slice returned by mg.GetReplicaNumbers(replica) and
execute the monitoring/target logic for each replica, aggregating or merging
results appropriately (or, if intended, validate and reject "all" earlier and
adjust the tool description), ensuring references to params.GetArguments and
getString remain unchanged and any downstream variables that expected a single
replica are updated to accept per-replica processing.

In `@pkg/toolsets/mustgather/resources.go`:
- Around line 66-70: The current mustgatherResourcesGet uses the package-global
provider via getProvider(), which allows mustgather_use to change a global
`provider` and leak archives across sessions; instead, make the active
must-gather scoped to the session by retrieving the provider from the
request/session context in api.ToolHandlerParams (or adding a session-scoped
provider field) and update mustgatherResourcesGet to use that session-scoped
provider; also update mustgather_use to write the selected archive into the
session-scoped storage rather than the package-global `provider`, and propagate
this change to all mustgather_resources_* handlers so they read the provider
from the session (not getProvider/global `provider`).

---

Nitpick comments:
In `@pkg/ocp/mustgather/loader.go`:
- Around line 132-144: The walker callback in the filepath.Walk call is
currently returning on directories before checking for the "pods/..." prefix,
causing the walker to descend into all pod log subtrees; update the anonymous
callback used in filepath.Walk (the func(path string, info os.FileInfo, err
error) error) to compute rel := filepath.Rel(nsDir, path) early and, if
info.IsDir() && strings.HasPrefix(rel, "pods"+string(filepath.Separator)),
return filepath.SkipDir to prevent descending into pod log directories; keep the
existing file-suffix checks for .yaml/.yml for regular files.

In `@pkg/toolsets/mustgather/etcd.go`:
- Around line 173-175: Current sort of entries uses only count so ties are
non-deterministic; update the comparator passed to sort.Slice (the anonymous
func used with entries) to break ties by the resource field: return
entries[i].count > entries[j].count || (entries[i].count == entries[j].count &&
entries[i].resource < entries[j].resource). This ensures deterministic, stable
ordering for equal counts while keeping primary sort by count.

In `@pkg/toolsets/mustgather/events.go`:
- Around line 110-115: The sort block that uses sort.Slice(filtered, ...) and
compares ti and tj from unstructured.NestedString(filtered[i].Object,
"lastTimestamp") relies on RFC3339/ISO‑8601 lexical ordering; add a concise
comment above this sort explaining that Kubernetes timestamps are RFC3339 and
therefore string comparison yields correct chronological ordering, and
optionally note that if non‑RFC3339 input is expected the code should parse to
time.Time (time.Parse(time.RFC3339)) before comparing for robustness.

In `@pkg/toolsets/mustgather/helpers.go`:
- Around line 125-135: The healthSymbol function maps "firing" to "[OK]", which
is misleading; update the switch in healthSymbol so the "firing" string is
removed from the OK case and instead handled as a warning (map "firing" to
"[WARN]"), and update any related unit tests or callers that expect the old
behavior; locate the healthSymbol function and change the case membership
accordingly so "firing" falls through to the WARN branch (or add an explicit
case returning "[WARN]").

In `@pkg/toolsets/mustgather/monitoring.go`:
- Around line 395-398: The current insertion of summary using strings.Replace on
output is brittle; instead compute totalGroups/totalRules first and construct
the final output explicitly rather than mutating with Replace: build the header,
then append the summary (using the summary variable you already format), then
append the remainder of the formatted rules/body into output so you no longer
rely on finding "\n\n" in output (update the code paths that set summary and
output around the variables summary, totalGroups, totalRules and remove
strings.Replace usage).
- Around line 494-526: The two nearly identical functions formatNameValueSection
and formatNameValueBytesSection should be consolidated into a single function
that accepts a value formatter parameter; implement a new function (e.g.,
formatNameValueSectionWithFormatter(title string, items []mg.NameValue, limit
int, formatValue func(int64) string)) that performs the shared
header/empty-check/limit/truncate loop and calls formatValue(items[i].Value) for
formatting, then replace calls to formatNameValueSection and
formatNameValueBytesSection to use this new function with formatNumber and
formatBytes respectively and remove the duplicated functions.

In `@pkg/toolsets/mustgather/use.go`:
- Around line 56-73: The package currently registers a new must-gather resource
via params.ResourceRegistrar.AddResource without removing any previously
registered must-gather URI, causing stale entries; add logic in mustgather_use
to remove the prior registered resource before adding the new one by tracking
the last registered URI in package state (e.g., a package-level variable like
lastMustGatherURI) and call
params.ResourceRegistrar.RemoveResource(lastMustGatherURI) if non-empty, then
update lastMustGatherURI to the new resourceURI after successful AddResource; if
concurrent calls are possible, protect the package-level tracker with a mutex to
avoid races.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a666ab20-563f-4424-9df4-9b31e2b85c53

📥 Commits

Reviewing files that changed from the base of the PR and between 46c45a9 and c3cd32c.

📒 Files selected for processing (24)
  • go.mod
  • internal/tools/update-readme/main.go
  • pkg/api/toolsets.go
  • pkg/mcp/gosdk.go
  • pkg/mcp/mcp.go
  • pkg/mcp/modules.go
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
  • pkg/mcp/toolsets_test.go
  • pkg/ocp/mustgather/etcd.go
  • pkg/ocp/mustgather/index.go
  • pkg/ocp/mustgather/loader.go
  • pkg/ocp/mustgather/monitoring.go
  • pkg/ocp/mustgather/provider.go
  • pkg/ocp/mustgather/types.go
  • pkg/toolsets/mustgather/etcd.go
  • pkg/toolsets/mustgather/events.go
  • pkg/toolsets/mustgather/helpers.go
  • pkg/toolsets/mustgather/monitoring.go
  • pkg/toolsets/mustgather/namespaces.go
  • pkg/toolsets/mustgather/nodes.go
  • pkg/toolsets/mustgather/pod_logs.go
  • pkg/toolsets/mustgather/resources.go
  • pkg/toolsets/mustgather/toolset.go
  • pkg/toolsets/mustgather/use.go

Comment thread pkg/mcp/mcp.go Outdated
Comment on lines +78 to +83
func (p *Provider) ReadETCDFile(filename string) ([]byte, error) {
filePath := filepath.Join(p.metadata.ContainerDir, "etcd_info", filename)
data, err := os.ReadFile(filePath)
if err != nil {
return nil, fmt.Errorf("ETCD file %s not found: %w", filename, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden ReadETCDFile against path traversal.

At Line 79, filename is joined directly into the archive path. If a non-constant/untrusted filename reaches this method later, ../ traversal could read files outside etcd_info.

🔒 Proposed fix
 import (
 	"encoding/json"
 	"fmt"
 	"os"
 	"path/filepath"
+	"strings"
 )
@@
 func (p *Provider) ReadETCDFile(filename string) ([]byte, error) {
-	filePath := filepath.Join(p.metadata.ContainerDir, "etcd_info", filename)
+	clean := filepath.Clean(filename)
+	if filepath.IsAbs(clean) || clean == ".." || strings.HasPrefix(clean, ".."+string(filepath.Separator)) {
+		return nil, fmt.Errorf("invalid ETCD filename: %q", filename)
+	}
+
+	baseDir := filepath.Join(p.metadata.ContainerDir, "etcd_info")
+	filePath := filepath.Join(baseDir, clean)
+	rel, err := filepath.Rel(baseDir, filePath)
+	if err != nil || rel == ".." || strings.HasPrefix(rel, ".."+string(filepath.Separator)) {
+		return nil, fmt.Errorf("invalid ETCD filename: %q", filename)
+	}
+
 	data, err := os.ReadFile(filePath)
 	if err != nil {
 		return nil, fmt.Errorf("ETCD file %s not found: %w", filename, err)
 	}
 	return data, nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (p *Provider) ReadETCDFile(filename string) ([]byte, error) {
filePath := filepath.Join(p.metadata.ContainerDir, "etcd_info", filename)
data, err := os.ReadFile(filePath)
if err != nil {
return nil, fmt.Errorf("ETCD file %s not found: %w", filename, err)
}
func (p *Provider) ReadETCDFile(filename string) ([]byte, error) {
clean := filepath.Clean(filename)
if filepath.IsAbs(clean) || clean == ".." || strings.HasPrefix(clean, ".."+string(filepath.Separator)) {
return nil, fmt.Errorf("invalid ETCD filename: %q", filename)
}
baseDir := filepath.Join(p.metadata.ContainerDir, "etcd_info")
filePath := filepath.Join(baseDir, clean)
rel, err := filepath.Rel(baseDir, filePath)
if err != nil || rel == ".." || strings.HasPrefix(rel, ".."+string(filepath.Separator)) {
return nil, fmt.Errorf("invalid ETCD filename: %q", filename)
}
data, err := os.ReadFile(filePath)
if err != nil {
return nil, fmt.Errorf("ETCD file %s not found: %w", filename, err)
}
return data, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/etcd.go` around lines 78 - 83, ReadETCDFile currently
joins filename into p.metadata.ContainerDir + "etcd_info" directly which allows
path traversal; validate and sanitize filename before reading: reject absolute
paths (filepath.IsAbs), reject any filename that after filepath.Clean contains
".." path segments or path separators, or alternatively construct the absolute
path and verify it is inside the etcd_info base using filepath.Rel to ensure the
resulting filePath is within the base (p.metadata.ContainerDir + "etcd_info")
before calling os.ReadFile; update ReadETCDFile to perform these checks and
return an error for invalid traversal attempts.

Comment on lines +105 to +139
func (idx *ResourceIndex) List(_ context.Context, gvk schema.GroupVersionKind, namespace string, opts ListOptions) *unstructured.UnstructuredList {
result := &unstructured.UnstructuredList{}

if namespace != "" {
// Namespace-scoped query
nsMap, ok := idx.byNamespace[namespace]
if !ok {
return result
}
gvkMap, ok := nsMap[gvk]
if !ok {
return result
}
for _, obj := range gvkMap {
if matchesListOptions(obj, opts) {
result.Items = append(result.Items, *obj.DeepCopy())
}
}
} else {
// Cluster-wide query
gvkMap, ok := idx.byGVK[gvk]
if !ok {
return result
}
for _, obj := range gvkMap {
if matchesListOptions(obj, opts) {
result.Items = append(result.Items, *obj.DeepCopy())
}
}
}

// Apply limit
if opts.Limit > 0 && len(result.Items) > opts.Limit {
result.Items = result.Items[:opts.Limit]
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sort before applying limit.

mustgather_resources_list uses this path for user-visible results, but there is no stable ordering before opts.Limit is applied. Two identical calls can therefore return different subsets, which makes archive analysis hard to reproduce when users work with a small limit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/index.go` around lines 105 - 139, The List method in
ResourceIndex builds result.Items and truncates by opts.Limit without a
deterministic ordering, causing non-reproducible subsets; after populating
result.Items (but before applying the Limit) sort it deterministically (e.g.
stable sort by GetNamespace(), then GetName(), then GetUID() or creation
timestamp) so repeated calls return the same first N items; update
ResourceIndex.List to perform this stable sort on result.Items prior to the
existing limit logic referenced by symbols ResourceIndex.List, result.Items,
matchesListOptions, and opts.Limit.

Comment on lines +171 to +217
func matchesLabelSelector(labels map[string]string, selector string) bool {
parts := strings.Split(selector, ",")
for _, part := range parts {
part = strings.TrimSpace(part)
if part == "" {
continue
}
kv := strings.SplitN(part, "=", 2)
if len(kv) != 2 {
continue
}
key := strings.TrimSpace(kv[0])
value := strings.TrimSpace(kv[1])
if labels[key] != value {
return false
}
}
return true
}

func matchesFieldSelector(obj *unstructured.Unstructured, selector string) bool {
parts := strings.Split(selector, ",")
for _, part := range parts {
part = strings.TrimSpace(part)
if part == "" {
continue
}
kv := strings.SplitN(part, "=", 2)
if len(kv) != 2 {
continue
}
key := strings.TrimSpace(kv[0])
value := strings.TrimSpace(kv[1])

switch key {
case "metadata.name":
if obj.GetName() != value {
return false
}
case "metadata.namespace":
if obj.GetNamespace() != value {
return false
}
}
}
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject malformed selectors instead of matching everything.

pkg/toolsets/mustgather/resources.go (Lines 106-118) forwards raw selector strings here. Right now malformed label clauses and unsupported field names are skipped with continue, so typos like app or spec.nodeName=... can silently broaden the result set instead of failing fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/index.go` around lines 171 - 217, matchesLabelSelector and
matchesFieldSelector currently ignore malformed clauses and unsupported fields
which widens matches; change them to reject invalid selectors by returning false
on any malformed part (e.g., kv len != 2 or empty key/value) and in
matchesFieldSelector treat unknown keys (not metadata.name/metadata.namespace)
as invalid and return false instead of continuing; update both functions
(matchesLabelSelector and matchesFieldSelector) to validate each clause strictly
and fail-fast when a clause is malformed or unsupported so callers receive
no-match for bad selectors.

Comment on lines +21 to +47
func Load(path string) (*LoadResult, error) {
containerDir, err := FindContainerDir(path)
if err != nil {
// If no container dir found, use the path directly
containerDir = path
}

metadata := MustGatherMetadata{
Path: path,
ContainerDir: containerDir,
}

// Load metadata files
loadMetadata(containerDir, &metadata)

// Load all resources
var resources []*unstructured.Unstructured

clusterResources, err := loadClusterScopedResources(containerDir)
if err == nil {
resources = append(resources, clusterResources...)
}

namespacedResources, err := loadNamespacedResources(containerDir)
if err == nil {
resources = append(resources, namespacedResources...)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't treat unreadable archives as a successful load.

Load falls back to containerDir = path for any FindContainerDir error, and then drops the resource-loader errors on Lines 39-47. A nonexistent path, file path, or permission failure can therefore "load" with zero resources and only fail later as false not found results. Please only ignore the "container directory not found" case and return other I/O errors immediately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/loader.go` around lines 21 - 47, Load currently masks IO
errors: in Load, do not treat all FindContainerDir errors as ignorable—only
tolerate the specific "container directory not found" case; for any other error
returned by FindContainerDir return it immediately. Also stop swallowing errors
from loadClusterScopedResources and loadNamespacedResources: if either returns a
non-nil error, propagate that error from Load (do not append nil result or
continue). Update the logic in Load (referencing FindContainerDir, Load,
loadClusterScopedResources, loadNamespacedResources, and loadMetadata) so only
the explicit "container dir not found" condition is ignored and all other errors
are returned.

Comment on lines +176 to +201
func readGzipFile(path string) (string, error) {
f, err := os.Open(path)
if err != nil {
return "", err
}
defer func() { _ = f.Close() }()

reader, err := gzip.NewReader(f)
if err != nil {
return "", err
}
defer func() { _ = reader.Close() }()

var sb strings.Builder
buf := make([]byte, 32*1024)
for {
n, readErr := reader.Read(buf)
if n > 0 {
sb.Write(buf[:n])
}
if readErr != nil {
break
}
}

return sb.String(), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return decompression failures to the caller.

Any non-nil readErr just exits the loop, and the function then returns sb.String(), nil. A truncated or corrupt kubelet log will therefore look like a successful read with missing tail data, which is very hard to spot during troubleshooting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/provider.go` around lines 176 - 201, The readGzipFile
function currently swallows read errors and always returns nil error; update
read loop in readGzipFile to treat io.EOF as normal (break) but return any other
non-nil readErr to the caller (after writing any bytes read), so
decompression/truncation failures propagate; locate the function readGzipFile
and modify the loop to check readErr (if readErr == io.EOF break else return "",
readErr).

Comment on lines +205 to +210
func TailLines(content string, n int) string {
lines := strings.Split(content, "\n")
if len(lines) <= n {
return content
}
return strings.Join(lines[len(lines)-n:], "\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Trim trailing newlines before tailing.

Newline-terminated logs currently grow an empty final element, so tail=1 on a\nb\n returns the blank trailer instead of b. That affects both pod log and kubelet log tailing.

Suggested fix
 func TailLines(content string, n int) string {
-	lines := strings.Split(content, "\n")
+	hadTrailingNewline := strings.HasSuffix(content, "\n")
+	trimmed := strings.TrimRight(content, "\n")
+	lines := strings.Split(trimmed, "\n")
 	if len(lines) <= n {
 		return content
 	}
-	return strings.Join(lines[len(lines)-n:], "\n")
+	tail := strings.Join(lines[len(lines)-n:], "\n")
+	if hadTrailingNewline {
+		tail += "\n"
+	}
+	return tail
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ocp/mustgather/provider.go` around lines 205 - 210, TailLines currently
returns an empty string when content ends with a newline because Split produces
a trailing empty element; update TailLines to trim trailing newline characters
from content (e.g., use strings.TrimRight(content, "\n")) before splitting or
otherwise ignore the final empty element so that TailLines("a\nb\n", 1) returns
"b"; apply the change inside the TailLines function to preserve existing
behavior for non-newline-terminated input.

Comment on lines +155 to +197
limit := getInt(params.GetArguments(), "limit", 50)

counts, err := p.GetETCDObjectCount()
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("failed to get ETCD object counts: %w", err)), nil
}

// Sort by count descending
type entry struct {
resource string
count int64
}
entries := make([]entry, 0, len(counts))
var total int64
for k, v := range counts {
entries = append(entries, entry{resource: k, count: v})
total += v
}
sort.Slice(entries, func(i, j int) bool {
return entries[i].count > entries[j].count
})

if limit > 0 && len(entries) > limit {
entries = entries[:limit]
}

output := fmt.Sprintf("## ETCD Object Counts\n\nTotal objects: %s across %d resource types\n\n", formatNumber(total), len(counts))

// Find max resource name length for alignment
maxLen := 0
for _, e := range entries {
if len(e.resource) > maxLen {
maxLen = len(e.resource)
}
}

for _, e := range entries {
output += fmt.Sprintf("%-*s %s\n", maxLen, e.resource, formatNumber(e.count))
}

if limit > 0 && len(counts) > limit {
output += fmt.Sprintf("\n... showing top %d of %d resource types\n", limit, len(counts))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate limit to avoid invalid output states.

At Line 155, non-positive values are accepted; e.g., negative input can produce misleading text at Line 196 (top -1).

🛠 Proposed fix
 	limit := getInt(params.GetArguments(), "limit", 50)
+	if limit <= 0 {
+		return api.NewToolCallResult("", fmt.Errorf("limit must be greater than 0")), nil
+	}
@@
-	if limit > 0 && len(entries) > limit {
+	if len(entries) > limit {
 		entries = entries[:limit]
 	}
@@
-	if limit > 0 && len(counts) > limit {
+	if len(counts) > limit {
 		output += fmt.Sprintf("\n... showing top %d of %d resource types\n", limit, len(counts))
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
limit := getInt(params.GetArguments(), "limit", 50)
counts, err := p.GetETCDObjectCount()
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("failed to get ETCD object counts: %w", err)), nil
}
// Sort by count descending
type entry struct {
resource string
count int64
}
entries := make([]entry, 0, len(counts))
var total int64
for k, v := range counts {
entries = append(entries, entry{resource: k, count: v})
total += v
}
sort.Slice(entries, func(i, j int) bool {
return entries[i].count > entries[j].count
})
if limit > 0 && len(entries) > limit {
entries = entries[:limit]
}
output := fmt.Sprintf("## ETCD Object Counts\n\nTotal objects: %s across %d resource types\n\n", formatNumber(total), len(counts))
// Find max resource name length for alignment
maxLen := 0
for _, e := range entries {
if len(e.resource) > maxLen {
maxLen = len(e.resource)
}
}
for _, e := range entries {
output += fmt.Sprintf("%-*s %s\n", maxLen, e.resource, formatNumber(e.count))
}
if limit > 0 && len(counts) > limit {
output += fmt.Sprintf("\n... showing top %d of %d resource types\n", limit, len(counts))
}
limit := getInt(params.GetArguments(), "limit", 50)
if limit <= 0 {
return api.NewToolCallResult("", fmt.Errorf("limit must be greater than 0")), nil
}
counts, err := p.GetETCDObjectCount()
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("failed to get ETCD object counts: %w", err)), nil
}
// Sort by count descending
type entry struct {
resource string
count int64
}
entries := make([]entry, 0, len(counts))
var total int64
for k, v := range counts {
entries = append(entries, entry{resource: k, count: v})
total += v
}
sort.Slice(entries, func(i, j int) bool {
return entries[i].count > entries[j].count
})
if len(entries) > limit {
entries = entries[:limit]
}
output := fmt.Sprintf("## ETCD Object Counts\n\nTotal objects: %s across %d resource types\n\n", formatNumber(total), len(counts))
// Find max resource name length for alignment
maxLen := 0
for _, e := range entries {
if len(e.resource) > maxLen {
maxLen = len(e.resource)
}
}
for _, e := range entries {
output += fmt.Sprintf("%-*s %s\n", maxLen, e.resource, formatNumber(e.count))
}
if len(counts) > limit {
output += fmt.Sprintf("\n... showing top %d of %d resource types\n", limit, len(counts))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/etcd.go` around lines 155 - 197, The code accepts
negative limit values from getInt which can lead to misleading output (e.g.,
"top -1"); clamp and normalize limit immediately after it's read: capture the
raw value from getInt into a variable (limit), then if limit < 0 set limit = 0
(treat negative as "show all") or alternatively set limit = len(counts) to avoid
negative slicing; use the normalized limit when truncating entries and when
composing the final summary message (the "... showing top %d of %d" line) so it
never prints a negative number. Ensure you reference and update uses of limit in
the slicing block, the limit > 0 checks, and the final output message.

Comment on lines +192 to +197
args := params.GetArguments()
replica := getString(args, "replica", "0")
healthFilter := getString(args, "health", "")

replicas := mg.GetReplicaNumbers(replica)
r := replicas[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replica handling ignores "all" option for targets tool.

The tool description says replica can be "0, 1, or all" with default "0", but the implementation always uses only replicas[0] regardless of input. If "all" is passed, it will return [0, 1] but only the first element is used.

🐛 Proposed fix to honor "all" or clarify description

Option 1: Update description to clarify only one replica is used:

-						"replica": {Type: "string", Description: "Prometheus replica (0, 1, or all). Default: 0"},
+						"replica": {Type: "string", Description: "Prometheus replica (0 or 1). Default: 0"},

Option 2: Support multiple replicas (more complex):

 	replicas := mg.GetReplicaNumbers(replica)
-	r := replicas[0]
 
-	targets, err := p.GetPrometheusActiveTargets(r)
+	output := ""
+	for _, r := range replicas {
+		targets, err := p.GetPrometheusActiveTargets(r)
+		// ... handle each replica
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/monitoring.go` around lines 192 - 197, The code
currently calls mg.GetReplicaNumbers(replica) then always uses only replicas[0],
so passing "all" is ignored; update the logic that uses replicas (the replicas
variable and r usage) to handle multiple entries: loop over the slice returned
by mg.GetReplicaNumbers(replica) and execute the monitoring/target logic for
each replica, aggregating or merging results appropriately (or, if intended,
validate and reject "all" earlier and adjust the tool description), ensuring
references to params.GetArguments and getString remain unchanged and any
downstream variables that expected a single replica are updated to accept
per-replica processing.

Comment thread pkg/toolsets/mustgather/resources.go Outdated
Comment on lines +66 to +70
func mustgatherResourcesGet(params api.ToolHandlerParams) (*api.ToolCallResult, error) {
p, err := getProvider()
if err != nil {
return api.NewToolCallResult("", err), nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope the active must-gather per session, not process-wide.

The lookup here is backed by the package-global provider in pkg/toolsets/mustgather/helpers.go (Lines 11-31). Once another client calls mustgather_use, subsequent mustgather_resources_* calls can start reading that other archive, which breaks isolation and can leak customer data between concurrent sessions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/resources.go` around lines 66 - 70, The current
mustgatherResourcesGet uses the package-global provider via getProvider(), which
allows mustgather_use to change a global `provider` and leak archives across
sessions; instead, make the active must-gather scoped to the session by
retrieving the provider from the request/session context in
api.ToolHandlerParams (or adding a session-scoped provider field) and update
mustgatherResourcesGet to use that session-scoped provider; also update
mustgather_use to write the selected archive into the session-scoped storage
rather than the package-global `provider`, and propagate this change to all
mustgather_resources_* handlers so they read the provider from the session (not
getProvider/global `provider`).

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 6, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features
  • Added must-gather archive analysis toolset for offline OpenShift inspection: archive loading/indexing, resource listing/getting, namespaces, ETCD checks, events queries, monitoring (Prometheus/Alertmanager), node diagnostics, and pod log retrieval/grep.
  • Documentation
  • README tooling now includes the must-gather toolset in generated tool listings.
  • Tests
  • Added test to verify must-gather toolset is listed and matches the expected snapshot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/toolsets/mustgather/events.go`:
- Around line 36-37: The limit parameter across mustgather event endpoints is
currently uncapped/treated as unlimited; update the validation and defaulting so
every "limit" schema and parser enforces a sane default and an upper cap (e.g.,
default=100, max=1000) and treats non-positive values as the default;
specifically modify the limit schema entries and the parsing logic used by
mustgather_events_by_resource and other event handlers in
pkg/toolsets/mustgather/events.go so they clamp incoming limit to the range [1,
MAX_LIMIT] (use a named constant MAX_LIMIT) and consistently apply the default
when absent or <=0.
- Around line 134-139: The sort/filter is currently comparing raw timestamp
strings from unstructured.NestedString for "lastTimestamp", which misorders
events and drops ones missing or with unparsable strings; update the sort and
any filter logic (the sort.Slice call operating on variable filtered and other
similar blocks) to extract "lastTimestamp", "eventTime", and "firstTimestamp"
via unstructured.NestedString, parse them with time.Parse(time.RFC3339) (or try
multiple Kubernetes timestamp layouts), fall back to zero time when parsing
fails, and sort by the resulting time.Time values (newest first); apply the same
parsing-and-fallback approach to the other occurrences noted (the blocks around
the other sort/filter usages at the referenced locations) so all comparisons use
parsed times rather than raw strings.
- Around line 233-240: The code parses untilStr into untilTime but doesn't
reject inverted ranges; add a validation after parsing that when hasUntil is
true and sinceStr is non-empty (or sinceTime was parsed) you compare untilTime
and sinceTime and return an api.NewToolCallResult error if untilTime is before
sinceTime (e.g., "invalid time range: 'until' must not be before 'since'"),
updating the same error-return path used in this file; reference the untilTime,
hasUntil, untilStr and sinceStr/sinceTime variables to locate where to add this
check.

In `@pkg/toolsets/mustgather/pod_logs.go`:
- Around line 141-151: The current code calls p.GetPodLog(...) and then does
strings.Split(logs, "\n"), which materializes the entire container log; change
the implementation to a streaming/reader-based path (e.g., add or use a
GetPodLogStream/ReadCloser variant on the provider that accepts mg.PodLogOptions
and returns an io.ReadCloser or stream) and replace the strings.Split usage with
a bufio.Scanner (or equivalent) to scan lines incrementally; while scanning,
apply the grep/filter logic and keep only the needed tail matches using a
fixed-size ring buffer or deque to avoid buffering the whole log, ensure you
close the stream, and propagate errors the same way as before (references:
GetPodLog, mg.PodLogOptions, and the function performing the current
split/grep).
- Around line 77-94: The code currently accepts any integer for tail (assigned
via tail := getInt(args, "tail", 0)) and forwards negative values to the
provider; add explicit validation after retrieving tail to reject negative
values: if tail < 0 return api.NewToolCallResult("", fmt.Errorf("tail must be
non-negative")), nil. Apply the same check in the other handler that reads tail
(the similar block around where tail is assigned for the grep/mustgather grep
path) so both mustgatherPodLogsGet/code paths consistently return a validation
error for negative tail values rather than forwarding or treating them as
unlimited.
- Around line 168-185: Compute and store the total number of matches before you
truncate matchingLines for tailing, e.g. capture totalMatches :=
len(matchingLines) immediately before the block that slices matchingLines when
tail > 0, then use totalMatches in the header’s "Found %d matching line(s)"
message while still slicing matchingLines for display; update the code around
the matchingLines/tail logic and the header construction (symbols:
matchingLines, tail, totalMatches, header, namespace, pod, filter, container,
previous, caseInsensitive) so the summary shows the full match count but the
output only contains the tailed subset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cabd9a7d-7a1f-4b77-bd60-5d401b3e8d88

📥 Commits

Reviewing files that changed from the base of the PR and between c3cd32c and 99db57c.

📒 Files selected for processing (3)
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
  • pkg/toolsets/mustgather/events.go
  • pkg/toolsets/mustgather/pod_logs.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json

Comment on lines +36 to +37
"limit": {Type: "integer", Description: "Maximum number of events to return (default: 100)"},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize and cap result limits across all event tools.

limit is currently uncapped (and effectively unlimited for non-positive values), and mustgather_events_by_resource has no limit at all. Large archives can produce oversized responses.

Proposed fix
+const (
+	defaultEventsListLimit   = 100
+	defaultEventsByTimeLimit = 200
+	defaultEventsByResLimit  = 200
+	maxEventsLimit           = 1000
+)
+
+func normalizeLimit(v, def int) int {
+	if v <= 0 {
+		return def
+	}
+	if v > maxEventsLimit {
+		return maxEventsLimit
+	}
+	return v
+}
+
 // schema: mustgather_events_by_resource
 					Properties: map[string]*jsonschema.Schema{
 						"name":      {Type: "string", Description: "Resource name"},
 						"namespace": {Type: "string", Description: "Resource namespace"},
 						"kind":      {Type: "string", Description: "Resource kind (optional, narrows search)"},
+						"limit":     {Type: "integer", Description: "Maximum number of events to return (default: 200, max: 1000)"},
 					},
 
 // mustgatherEventsList
-	limit := getInt(args, "limit", 100)
+	limit := normalizeLimit(getInt(args, "limit", defaultEventsListLimit), defaultEventsListLimit)
 
 // mustgatherEventsByResource
+	limit := normalizeLimit(getInt(args, "limit", defaultEventsByResLimit), defaultEventsByResLimit)
+	if len(matched) > limit {
+		matched = matched[:limit]
+	}
 
 // mustgatherEventsByTime
-	limit := getInt(args, "limit", 200)
+	limit := normalizeLimit(getInt(args, "limit", defaultEventsByTimeLimit), defaultEventsByTimeLimit)

Also applies to: 53-57, 79-80, 101-104, 142-144, 203-206, 222-223, 282-284

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/events.go` around lines 36 - 37, The limit parameter
across mustgather event endpoints is currently uncapped/treated as unlimited;
update the validation and defaulting so every "limit" schema and parser enforces
a sane default and an upper cap (e.g., default=100, max=1000) and treats
non-positive values as the default; specifically modify the limit schema entries
and the parsing logic used by mustgather_events_by_resource and other event
handlers in pkg/toolsets/mustgather/events.go so they clamp incoming limit to
the range [1, MAX_LIMIT] (use a named constant MAX_LIMIT) and consistently apply
the default when absent or <=0.

Comment on lines +134 to +139
// Sort by lastTimestamp descending
sort.Slice(filtered, func(i, j int) bool {
ti, _, _ := unstructured.NestedString(filtered[i].Object, "lastTimestamp")
tj, _, _ := unstructured.NestedString(filtered[j].Object, "lastTimestamp")
return ti > tj
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sort/filter events using parsed timestamps, not raw strings.

Current logic compares timestamp strings and only uses lastTimestamp. That can misorder results and silently exclude valid events when lastTimestamp is absent or unparseable.

Proposed fix
+func parseEventTime(event *unstructured.Unstructured) (time.Time, bool) {
+	for _, path := range [][]string{
+		{"lastTimestamp"},
+		{"eventTime"},
+		{"metadata", "creationTimestamp"},
+	} {
+		ts, found, _ := unstructured.NestedString(event.Object, path...)
+		if !found || ts == "" {
+			continue
+		}
+		if t, err := time.Parse(time.RFC3339, ts); err == nil {
+			return t, true
+		}
+	}
+	return time.Time{}, false
+}
+
 // Example in mustgatherEventsByTime loop
-	lastTimestamp, _, _ := unstructured.NestedString(event.Object, "lastTimestamp")
-	if lastTimestamp == "" {
-		continue
-	}
-	eventTime, err := time.Parse(time.RFC3339, lastTimestamp)
-	if err != nil {
+	eventTime, ok := parseEventTime(event)
+	if !ok {
 		continue
 	}
 
 // Example sort (descending)
-	return ti > tj
+	ti, oki := parseEventTime(&filtered[i])
+	tj, okj := parseEventTime(&filtered[j])
+	if !oki {
+		return false
+	}
+	if !okj {
+		return true
+	}
+	return ti.After(tj)

Also applies to: 193-197, 248-256, 275-280

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/events.go` around lines 134 - 139, The sort/filter is
currently comparing raw timestamp strings from unstructured.NestedString for
"lastTimestamp", which misorders events and drops ones missing or with
unparsable strings; update the sort and any filter logic (the sort.Slice call
operating on variable filtered and other similar blocks) to extract
"lastTimestamp", "eventTime", and "firstTimestamp" via
unstructured.NestedString, parse them with time.Parse(time.RFC3339) (or try
multiple Kubernetes timestamp layouts), fall back to zero time when parsing
fails, and sort by the resulting time.Time values (newest first); apply the same
parsing-and-fallback approach to the other occurrences noted (the blocks around
the other sort/filter usages at the referenced locations) so all comparisons use
parsed times rather than raw strings.

Comment on lines +233 to +240
var untilTime time.Time
hasUntil := untilStr != ""
if hasUntil {
untilTime, err = time.Parse(time.RFC3339, untilStr)
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("invalid until time format, expected RFC3339 (e.g. 2026-01-15T12:00:00Z): %w", err)), nil
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject inverted time ranges (until before since).

Without this check, invalid windows return misleading “No events found”.

Proposed fix
 	if hasUntil {
 		untilTime, err = time.Parse(time.RFC3339, untilStr)
 		if err != nil {
 			return api.NewToolCallResult("", fmt.Errorf("invalid until time format, expected RFC3339 (e.g. 2026-01-15T12:00:00Z): %w", err)), nil
 		}
+		if untilTime.Before(sinceTime) {
+			return api.NewToolCallResult("", fmt.Errorf("until must be greater than or equal to since")), nil
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/events.go` around lines 233 - 240, The code parses
untilStr into untilTime but doesn't reject inverted ranges; add a validation
after parsing that when hasUntil is true and sinceStr is non-empty (or sinceTime
was parsed) you compare untilTime and sinceTime and return an
api.NewToolCallResult error if untilTime is before sinceTime (e.g., "invalid
time range: 'until' must not be before 'since'"), updating the same error-return
path used in this file; reference the untilTime, hasUntil, untilStr and
sinceStr/sinceTime variables to locate where to add this check.

Comment on lines +77 to +94
tail := getInt(args, "tail", 0)

if namespace == "" || pod == "" {
return api.NewToolCallResult("", fmt.Errorf("namespace and pod are required")), nil
}

logType := mg.LogTypeCurrent
if previous {
logType = mg.LogTypePrevious
}

logs, err := p.GetPodLog(mg.PodLogOptions{
Namespace: namespace,
Pod: pod,
Container: container,
LogType: logType,
TailLines: tail,
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject negative tail values explicitly.

Line 77 and Line 126 accept any integer, but the same invalid input currently behaves differently: mustgatherPodLogsGet forwards it to the provider, while grep silently treats it as unlimited. Returning a validation error here will make both tools deterministic.

🛠️ Suggested fix
 	tail := getInt(args, "tail", 0)
+	if tail < 0 {
+		return api.NewToolCallResult("", fmt.Errorf("tail must be >= 0")), nil
+	}
 
 	if namespace == "" || pod == "" {
 		return api.NewToolCallResult("", fmt.Errorf("namespace and pod are required")), nil
 	}
 	tail := getInt(args, "tail", 0)
 	caseInsensitive := getBool(args, "caseInsensitive", false)
+	if tail < 0 {
+		return api.NewToolCallResult("", fmt.Errorf("tail must be >= 0")), nil
+	}
 
 	if namespace == "" || pod == "" {
 		return api.NewToolCallResult("", fmt.Errorf("namespace and pod are required")), nil
 	}

Also applies to: 126-133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 77 - 94, The code currently
accepts any integer for tail (assigned via tail := getInt(args, "tail", 0)) and
forwards negative values to the provider; add explicit validation after
retrieving tail to reject negative values: if tail < 0 return
api.NewToolCallResult("", fmt.Errorf("tail must be non-negative")), nil. Apply
the same check in the other handler that reads tail (the similar block around
where tail is assigned for the grep/mustgather grep path) so both
mustgatherPodLogsGet/code paths consistently return a validation error for
negative tail values rather than forwarding or treating them as unlimited.

Comment on lines +141 to +151
logs, err := p.GetPodLog(mg.PodLogOptions{
Namespace: namespace,
Pod: pod,
Container: container,
LogType: logType,
})
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("failed to get pod logs: %w", err)), nil
}

lines := strings.Split(logs, "\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid full-log materialization in the grep path.

Even when the caller only wants the last few matches, Line 141 loads the entire container log and Line 151 splits the whole string before filtering. On large must-gather archives this makes mustgather_pod_logs_grep scale with total log size in both memory and latency, which is risky for the offline-debugging use case this toolset is targeting. A streaming/provider-side grep path would be much safer here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 141 - 151, The current code
calls p.GetPodLog(...) and then does strings.Split(logs, "\n"), which
materializes the entire container log; change the implementation to a
streaming/reader-based path (e.g., add or use a GetPodLogStream/ReadCloser
variant on the provider that accepts mg.PodLogOptions and returns an
io.ReadCloser or stream) and replace the strings.Split usage with a
bufio.Scanner (or equivalent) to scan lines incrementally; while scanning, apply
the grep/filter logic and keep only the needed tail matches using a fixed-size
ring buffer or deque to avoid buffering the whole log, ensure you close the
stream, and propagate errors the same way as before (references: GetPodLog,
mg.PodLogOptions, and the function performing the current split/grep).

Comment on lines +168 to +185
if tail > 0 && len(matchingLines) > tail {
matchingLines = matchingLines[len(matchingLines)-tail:]
}

header := fmt.Sprintf("Logs for pod %s/%s filtered by '%s'", namespace, pod, filter)
if container != "" {
header += fmt.Sprintf(", container %s", container)
}
if previous {
header += " (previous)"
}
if caseInsensitive {
header += " (case-insensitive)"
}
if tail > 0 {
header += fmt.Sprintf(" (last %d matches)", tail)
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the total match count separate from the tailed result.

Line 185 reports len(matchingLines) after Lines 168-170 may already have dropped earlier hits. With tail=10, 100 matches will be reported as Found 10 matching line(s), which makes the summary inaccurate.

🛠️ Suggested fix
-	var matchingLines []string
+	var matchingLines []string
+	totalMatches := 0
 	for _, line := range lines {
 		compareLine := line
 		if caseInsensitive {
 			compareLine = strings.ToLower(line)
 		}
 		if strings.Contains(compareLine, searchFilter) {
+			totalMatches++
 			matchingLines = append(matchingLines, line)
 		}
 	}
@@
-	header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
+	header += fmt.Sprintf(":\n\nFound %d matching line(s)", totalMatches)
+	if tail > 0 && totalMatches > len(matchingLines) {
+		header += fmt.Sprintf(", returning last %d", len(matchingLines))
+	}
+	header += "\n\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 168 - 185, Compute and
store the total number of matches before you truncate matchingLines for tailing,
e.g. capture totalMatches := len(matchingLines) immediately before the block
that slices matchingLines when tail > 0, then use totalMatches in the header’s
"Found %d matching line(s)" message while still slicing matchingLines for
display; update the code around the matchingLines/tail logic and the header
construction (symbols: matchingLines, tail, totalMatches, header, namespace,
pod, filter, container, previous, caseInsensitive) so the summary shows the full
match count but the output only contains the tailed subset.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 6, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features
  • Added a must-gather archive analysis toolset for offline OpenShift inspection: archive loading/indexing, resource listing/getting, namespaces, ETCD checks, events queries, monitoring (Prometheus/Alertmanager), node diagnostics, and pod log retrieval/grep. Selecting an archive registers it as an available resource and surfaces metadata/counts.
  • Documentation
  • README generation now includes the must-gather toolset in tool listings.
  • Tests
  • Added test verifying the must-gather toolset is listed and matches the expected snapshot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/toolsets/mustgather/pod_logs.go (1)

81-81: Clarify that limit returns the first N matching lines.

The current description "Maximum number of lines to return" doesn't indicate whether it's the first or last N. This differs from tail in the grep tool which returns the last N matches. Consider updating for clarity.

📝 Suggested description update
-						"limit":     {Type: "integer", Description: "Maximum number of lines to return (default: 500)"},
+						"limit":     {Type: "integer", Description: "Maximum number of lines to return from start of range (default: 500)"},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` at line 81, Update the JSON schema
description for the "limit" parameter to explicitly state it returns the first N
matching lines (not the last), so callers understand it behaves differently from
grep's tail behavior; locate the "limit" schema entry in
pkg/toolsets/mustgather/pod_logs.go (the `"limit": {Type: "integer",
Description: ...}` entry) and change the Description to something like "Maximum
number of matching lines to return (first N matches, default: 500)" to clearly
indicate it returns the first matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/toolsets/mustgather/pod_logs.go`:
- Around line 273-291: The loop in PodLogs that currently silently drops lines
without parseable RFC3339Nano timestamps should count them and surface that
metric: add a skippedLines int counter and increment it whenever you continue
due to spaceIdx == -1 or time.Parse error (the variables in this block are
lines, spaceIdx, lineTime, sinceTime, hasUntil, untilTime). After processing,
include skippedLines in the log header (only when >0) so callers know how many
lines were omitted; update the header-building code that emits the pod log
metadata to append a "(skipped X non-timestamped lines)" note when skippedLines
> 0. Ensure behavior otherwise unchanged.

---

Nitpick comments:
In `@pkg/toolsets/mustgather/pod_logs.go`:
- Line 81: Update the JSON schema description for the "limit" parameter to
explicitly state it returns the first N matching lines (not the last), so
callers understand it behaves differently from grep's tail behavior; locate the
"limit" schema entry in pkg/toolsets/mustgather/pod_logs.go (the `"limit":
{Type: "integer", Description: ...}` entry) and change the Description to
something like "Maximum number of matching lines to return (first N matches,
default: 500)" to clearly indicate it returns the first matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0e9fc2f6-fb76-4956-a795-f523a3d7dacb

📥 Commits

Reviewing files that changed from the base of the PR and between 99db57c and 6a66cc7.

📒 Files selected for processing (1)
  • pkg/toolsets/mustgather/pod_logs.go

Comment on lines +273 to +291
for _, line := range lines {
if line == "" {
continue
}
// Each line starts with an RFC3339Nano timestamp followed by a space
spaceIdx := strings.IndexByte(line, ' ')
if spaceIdx == -1 {
continue
}
lineTime, err := time.Parse(time.RFC3339Nano, line[:spaceIdx])
if err != nil {
continue
}
if lineTime.Before(sinceTime) {
continue
}
if hasUntil && lineTime.After(untilTime) {
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently skipping lines without parseable timestamps may hide relevant data.

Lines without a valid RFC3339Nano prefix (e.g., multi-line stack traces, malformed entries, or logs collected without --timestamps) are silently dropped. Users won't know if data was lost. Consider either including non-timestamped lines when they immediately follow a matching timestamped line, or at minimum reporting how many lines were skipped.

💡 Suggested improvement to track skipped lines
 	lines := strings.Split(logs, "\n")
 	var matchingLines []string
+	skippedLines := 0
 	for _, line := range lines {
 		if line == "" {
 			continue
 		}
 		// Each line starts with an RFC3339Nano timestamp followed by a space
 		spaceIdx := strings.IndexByte(line, ' ')
 		if spaceIdx == -1 {
+			skippedLines++
 			continue
 		}
 		lineTime, err := time.Parse(time.RFC3339Nano, line[:spaceIdx])
 		if err != nil {
+			skippedLines++
 			continue
 		}

Then include skippedLines in the header if non-zero:

 	header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
+	if skippedLines > 0 {
+		header = strings.TrimSuffix(header, "\n\n")
+		header += fmt.Sprintf(" (%d lines skipped due to missing/invalid timestamps)\n\n", skippedLines)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 273 - 291, The loop in
PodLogs that currently silently drops lines without parseable RFC3339Nano
timestamps should count them and surface that metric: add a skippedLines int
counter and increment it whenever you continue due to spaceIdx == -1 or
time.Parse error (the variables in this block are lines, spaceIdx, lineTime,
sinceTime, hasUntil, untilTime). After processing, include skippedLines in the
log header (only when >0) so callers know how many lines were omitted; update
the header-building code that emits the pod log metadata to append a "(skipped X
non-timestamped lines)" note when skippedLines > 0. Ensure behavior otherwise
unchanged.

@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from 6a66cc7 to 2ab4bf2 Compare April 7, 2026 05:50
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features
  • Added a must-gather archive analysis toolset for offline OpenShift inspection: archive loading/indexing, resource listing/getting, namespaces, ETCD checks, events queries, monitoring (Prometheus/Alertmanager), node diagnostics, and pod log retrieval/grep. Selecting an archive registers it as an available resource and surfaces metadata and counts.
  • Documentation
  • README generation now includes the must-gather toolset in tool listings.
  • Tests
  • Added test verifying the must-gather toolset is listed and matches the expected snapshot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pkg/toolsets/mustgather/pod_logs.go (1)

167-177: ⚠️ Potential issue | 🟠 Major

Stream the filtered log paths instead of buffering the whole file.

Both filtering handlers call GetPodLog and then strings.Split the full string before applying tail / limit, so a request for a handful of matches still scales with the entire log size. A reader/scanner-based provider path would be much safer for large must-gather archives.

Also applies to: 261-271

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 167 - 177, The code
currently calls GetPodLog and then does strings.Split(logs, "\n") (see GetPodLog
and the lines variable) which buffers the entire log into memory before applying
tail/limit; change to a streaming approach: add or use a reader-returning
variant (e.g., GetPodLogStream or modify GetPodLog to return an
io.ReadCloser/reader from PodLogOptions) and replace the strings.Split-based
logic with a bufio.Scanner/reader that applies the filter, tail and limit while
scanning; update both occurrences (the block around the GetPodLog call and the
duplicate at lines 261-271) so they consume and filter lines incrementally and
close the reader when done.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/toolsets/mustgather/pod_logs.go`:
- Around line 233-253: Validate the input early: after reading limit with getInt
(and before parsing times) return an error via api.NewToolCallResult when limit
is negative (require limit >= 0) and when both untilStr and sinceStr are
provided ensure untilTime >= sinceTime (i.e., after parsing sinceTime and
untilTime, if hasUntil and untilTime.Before(sinceTime) return
api.NewToolCallResult with a clear error like "until must be equal or after
since"); update the same validation later where limit and until are used (same
checks around the code paths that reference limit, sinceStr, untilStr,
sinceTime, untilTime, and hasUntil) so callers get a fast, descriptive error
instead of empty results.
- Around line 114-137: The header can be ambiguous when the caller passed an
empty container and the provider resolved it to the first container; update the
code so the actual container name used is surfaced in the header. Modify
GetPodLog (or its caller contract) to return the resolved container name
alongside logs (or expose the resolved container via mg.PodLogOptions/result),
then when building the header in pod_logs.go use that resolved container value
(not the original empty input) so the header prints "container <resolved-name>"
on multi-container pods; apply the same change to the other similar handlers
referenced (the blocks around 167-217 and 261-317) so all logs include the
actual container chosen.
- Around line 295-311: Record the total number of matches before truncation and
use that count in the header instead of the post-truncate slice length: store
totalMatches := len(matchingLines) before the limit logic in the pod_logs.go
function (where matchingLines and limit are handled), then apply the truncation
(matchingLines = matchingLines[:limit] if needed) and build header using
totalMatches for the "Found %d matching line(s)" message while keeping the
returned matchingLines slice limited.

---

Duplicate comments:
In `@pkg/toolsets/mustgather/pod_logs.go`:
- Around line 167-177: The code currently calls GetPodLog and then does
strings.Split(logs, "\n") (see GetPodLog and the lines variable) which buffers
the entire log into memory before applying tail/limit; change to a streaming
approach: add or use a reader-returning variant (e.g., GetPodLogStream or modify
GetPodLog to return an io.ReadCloser/reader from PodLogOptions) and replace the
strings.Split-based logic with a bufio.Scanner/reader that applies the filter,
tail and limit while scanning; update both occurrences (the block around the
GetPodLog call and the duplicate at lines 261-271) so they consume and filter
lines incrementally and close the reader when done.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80d44601-e66d-434d-bea0-19b88272f14e

📥 Commits

Reviewing files that changed from the base of the PR and between 6a66cc7 and 2ab4bf2.

📒 Files selected for processing (2)
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
  • pkg/toolsets/mustgather/pod_logs.go

Comment on lines +114 to +137
logs, err := p.GetPodLog(mg.PodLogOptions{
Namespace: namespace,
Pod: pod,
Container: container,
LogType: logType,
TailLines: tail,
})
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("failed to get pod logs: %w", err)), nil
}

header := fmt.Sprintf("Logs for pod %s/%s", namespace, pod)
if container != "" {
header += fmt.Sprintf(", container %s", container)
}
if previous {
header += " (previous)"
}
if tail > 0 {
header += fmt.Sprintf(" (last %d lines)", tail)
}
header += ":\n\n"

return api.NewToolCallResult(header+logs, nil), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Report the actual container when defaulting to one.

pkg/ocp/mustgather/provider.go:57-70 resolves an omitted container to the first available container, but these handlers do not surface which container was actually used. On multi-container pods that makes the response ambiguous and easy to misread.

💡 Suggested fix
+	if container == "" {
+		containers, err := p.ListPodContainers(namespace, pod)
+		if err != nil {
+			return api.NewToolCallResult("", fmt.Errorf("failed to resolve pod containers: %w", err)), nil
+		}
+		if len(containers) == 0 {
+			return api.NewToolCallResult("", fmt.Errorf("no containers found for pod %s/%s", namespace, pod)), nil
+		}
+		container = containers[0]
+	}
+
 	logs, err := p.GetPodLog(mg.PodLogOptions{
 		Namespace: namespace,
 		Pod:       pod,
 		Container: container,
@@
-	if container != "" {
-		header += fmt.Sprintf(", container %s", container)
-	}
+	header += fmt.Sprintf(", container %s", container)

Also applies to: 167-217, 261-317

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 114 - 137, The header can
be ambiguous when the caller passed an empty container and the provider resolved
it to the first container; update the code so the actual container name used is
surfaced in the header. Modify GetPodLog (or its caller contract) to return the
resolved container name alongside logs (or expose the resolved container via
mg.PodLogOptions/result), then when building the header in pod_logs.go use that
resolved container value (not the original empty input) so the header prints
"container <resolved-name>" on multi-container pods; apply the same change to
the other similar handlers referenced (the blocks around 167-217 and 261-317) so
all logs include the actual container chosen.

Comment on lines +233 to +253
limit := getInt(args, "limit", 500)

if namespace == "" || pod == "" {
return api.NewToolCallResult("", fmt.Errorf("namespace and pod are required")), nil
}
if sinceStr == "" {
return api.NewToolCallResult("", fmt.Errorf("since is required (RFC3339 format)")), nil
}

sinceTime, err := time.Parse(time.RFC3339, sinceStr)
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("invalid since time format, expected RFC3339 (e.g. 2026-01-15T10:00:00Z): %w", err)), nil
}

var untilTime time.Time
hasUntil := untilStr != ""
if hasUntil {
untilTime, err = time.Parse(time.RFC3339, untilStr)
if err != nil {
return api.NewToolCallResult("", fmt.Errorf("invalid until time format, expected RFC3339 (e.g. 2026-01-15T12:00:00Z): %w", err)), nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject invalid by-time inputs up front.

limit < 0 currently disables truncation, and until before since is reported as “no matching lines” instead of a caller error. Failing fast here keeps empty results meaningful.

🛠️ Suggested fix
 	limit := getInt(args, "limit", 500)
+	if limit < 0 {
+		return api.NewToolCallResult("", fmt.Errorf("limit must be >= 0")), nil
+	}
@@
 	if hasUntil {
 		untilTime, err = time.Parse(time.RFC3339, untilStr)
 		if err != nil {
 			return api.NewToolCallResult("", fmt.Errorf("invalid until time format, expected RFC3339 (e.g. 2026-01-15T12:00:00Z): %w", err)), nil
 		}
+		if untilTime.Before(sinceTime) {
+			return api.NewToolCallResult("", fmt.Errorf("until must be greater than or equal to since")), nil
+		}
 	}

Also applies to: 295-296

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 233 - 253, Validate the
input early: after reading limit with getInt (and before parsing times) return
an error via api.NewToolCallResult when limit is negative (require limit >= 0)
and when both untilStr and sinceStr are provided ensure untilTime >= sinceTime
(i.e., after parsing sinceTime and untilTime, if hasUntil and
untilTime.Before(sinceTime) return api.NewToolCallResult with a clear error like
"until must be equal or after since"); update the same validation later where
limit and until are used (same checks around the code paths that reference
limit, sinceStr, untilStr, sinceTime, untilTime, and hasUntil) so callers get a
fast, descriptive error instead of empty results.

Comment on lines +295 to +311
if limit > 0 && len(matchingLines) > limit {
matchingLines = matchingLines[:limit]
}

header := fmt.Sprintf("Logs for pod %s/%s between %s and ", namespace, pod, sinceStr)
if hasUntil {
header += untilStr
} else {
header += "latest"
}
if container != "" {
header += fmt.Sprintf(", container %s", container)
}
if previous {
header += " (previous)"
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the total match count separate from the limited output.

After limit truncates matchingLines, the header reports only the returned subset. With 5,000 matches and limit=500, callers are told there were 500 matches.

🛠️ Suggested fix
-	if limit > 0 && len(matchingLines) > limit {
+	totalMatches := len(matchingLines)
+	if limit > 0 && totalMatches > limit {
 		matchingLines = matchingLines[:limit]
 	}
@@
-	header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
+	header += fmt.Sprintf(":\n\nFound %d matching line(s)", totalMatches)
+	if limit > 0 && totalMatches > len(matchingLines) {
+		header += fmt.Sprintf(", returning first %d", len(matchingLines))
+	}
+	header += "\n\n"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if limit > 0 && len(matchingLines) > limit {
matchingLines = matchingLines[:limit]
}
header := fmt.Sprintf("Logs for pod %s/%s between %s and ", namespace, pod, sinceStr)
if hasUntil {
header += untilStr
} else {
header += "latest"
}
if container != "" {
header += fmt.Sprintf(", container %s", container)
}
if previous {
header += " (previous)"
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
if limit > 0 && len(matchingLines) > limit {
totalMatches := len(matchingLines)
matchingLines = matchingLines[:limit]
}
header := fmt.Sprintf("Logs for pod %s/%s between %s and ", namespace, pod, sinceStr)
if hasUntil {
header += untilStr
} else {
header += "latest"
}
if container != "" {
header += fmt.Sprintf(", container %s", container)
}
if previous {
header += " (previous)"
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)", totalMatches)
if limit > 0 && totalMatches > len(matchingLines) {
header += fmt.Sprintf(", returning first %d", len(matchingLines))
}
header += "\n\n"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/pod_logs.go` around lines 295 - 311, Record the total
number of matches before truncation and use that count in the header instead of
the post-truncate slice length: store totalMatches := len(matchingLines) before
the limit logic in the pod_logs.go function (where matchingLines and limit are
handled), then apply the truncation (matchingLines = matchingLines[:limit] if
needed) and build header using totalMatches for the "Found %d matching line(s)"
message while keeping the returned matchingLines slice limited.


// ReadETCDFile reads a raw ETCD info JSON file and returns its content
func (p *Provider) ReadETCDFile(filename string) ([]byte, error) {
filePath := filepath.Join(p.metadata.ContainerDir, "etcd_info", filename)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swghosh can we add a little defensive check here on the filePath that is constructed to ensure that nothing is escaping the etcd_info path on the container?

This is causing the security scan action to fail even though all the call sites currently use hard coded paths that are safe, as this is an exported method

Comment on lines +12 to +13
providerMu sync.RWMutex
provider *mg.Provider
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swghosh how will this work when more than one user connects to the MCP server and tries to check different must gathers?

Is that a valid use case? If not, can we guard against it somehow?

I'm worried about two users calling mustgather_use with different paths

Maybe we can cache these per session (should be on the ToolHandlerParams context iirc)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's a very valid use case.
We'd need to cache the must-gather path per session.

Comment on lines +33 to +45
// Load metadata files
loadMetadata(containerDir, &metadata)

// Load all resources
var resources []*unstructured.Unstructured

clusterResources, err := loadClusterScopedResources(containerDir)
if err == nil {
resources = append(resources, clusterResources...)
}

namespacedResources, err := loadNamespacedResources(containerDir)
if err == nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we maybe want to estimate how much memory the loading of resources will consume before doing so/compare against how much memory the pod has? Some must gathers can be quite large - this may crash the pod with OOM

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (or instead) is it possible to lazy load this, or does that not make sense?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy loading makes sense - also i'm wondering why we need to pre-load @swghosh . have you noticed any issues getting data on demand?

if n > 0 {
sb.Write(buf[:n])
}
if readErr != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we want to verify that this is actually an io.EOF error instead of some other kind of error so that we don't silently skip other errors

Comment on lines +74 to +81
nsSet := make(map[string]struct{})
for ns := range idx.byNamespace {
nsSet[ns] = struct{}{}
}
idx.namespaces = make([]string, 0, len(nsSet))
for ns := range nsSet {
idx.namespaces = append(idx.namespaces, ns)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we simplify this to just iterate over the keys in idx.byNamespace? As far as I can tell the nsSet is a map that has the same keys as idx.byNamespace so not sure why we are creating this here

// healthSymbol returns a symbol for health status
func healthSymbol(health string) string {
switch strings.ToLower(health) {
case "up", "healthy", "ok", "true", "firing":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swghosh you are the expert but maybe we want to move "firing" to "[WARN]" ?

Comment thread pkg/mcp/mcp.go Outdated
Prompts: &mcp.PromptCapabilities{ListChanged: !configuration.Stateless},
Tools: &mcp.ToolCapabilities{ListChanged: !configuration.Stateless},
Logging: &mcp.LoggingCapabilities{},
Prompts: &mcp.PromptCapabilities{ListChanged: !configuration.Stateless},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this line if not needed

Comment thread pkg/api/toolsets.go Outdated
Comment on lines +105 to +112
// ResourceRegistrar allows tools to register MCP resources at runtime.
// For example, the mustgather_use tool registers a must-gather:// resource
// after loading an archive.
type ResourceRegistrar interface {
AddResource(uri, name, description, mimeType, content string)
RemoveResources(uris ...string)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of change will need to be considered upstream first, we don't want to change this downstream

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a PR upstream for the MCP resource and templates registration boilerplate,
containers#1091

"path/filepath"
"strings"

"gopkg.in/yaml.v3"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sigs.k8s.io/yaml like we use in other places? this seems to be promoting the other yaml dep from indirect to direct

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use sigs.k8s.io/yaml instead.

@swghosh
Copy link
Copy Markdown
Member Author

swghosh commented Apr 22, 2026

re: #159 (comment) and #159 (comment)

@Cali0707 whenever you're around with this, PTAL at another proof-of-concept MCP resources server that I've been playing around with: https://github.com/swghosh/mcp-x-mg/tree/main/mustgather-resource-mcp-server#mcp-resources.
It aligns better with code mode style approaches, please feel free to drop your feedback around it here or over Slack.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2026
@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from 2ab4bf2 to 7604f45 Compare April 22, 2026 19:37
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 22, 2026

@swghosh: This pull request references MG-216 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Offline debugging analysis use cases require us to get data parsed from a must-gather archive rather than online cluster's kapiserver endpoint. Some tools are similar to the core toolset but there are more tools beyond resources_get especially journal logs, kubelet logs, etc.

Summary by CodeRabbit

  • New Features

  • Added openshift/mustgather toolset enabling offline analysis of Kubernetes/OpenShift must-gather archives.

  • Introduced tools to query ETCD data, cluster events, pod logs, node diagnostics, and monitoring metrics (Prometheus/AlertManager).

  • Added resource registration capability to MCP servers for dynamic resource management at runtime.

  • Tests

  • Added comprehensive test suites for must-gather toolset and resource registrar functionality.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from 7604f45 to 2d6467c Compare April 22, 2026 19:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/mcp/resource_test.go (1)

62-65: Prefer fail-fast assertions for captured registrars.

After Line 63/Line 64, consider asserting s.resourceReg and s.templateReg are non-nil to make wiring failures explicit before subtests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcp/resource_test.go` around lines 62 - 65, The test calls
s.CallTool("test_init", ...) to populate registrars but doesn't assert they were
actually set; add fail-fast assertions immediately after that call to ensure
s.resourceReg and s.templateReg are non-nil so wiring failures fail the test
early. Locate the initialization call (s.CallTool("test_init",
map[string]any{})) and add assertions checking s.resourceReg != nil and
s.templateReg != nil (use the test suite's Require/Assert helpers) before
proceeding to subtests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcp/resource.go`:
- Around line 10-15: There are two conflicting declarations of type
resourceRegistrar (used to implement api.ResourceRegistrar) in this package;
remove one to avoid duplicate-type compile errors and keep the implementation
that preserves the safer closure capture pattern (resourceContent := content)
used in the gosdk.go variant. Delete either the resourceRegistrar block in
pkg/mcp/resource.go or in pkg/mcp/gosdk.go, then ensure the remaining
resourceRegistrar struct references server *mcp.Server and implements the same
methods, and that any handler/closure code uses the resourceContent := content
pattern to avoid closure capture bugs.

In `@pkg/toolsets/mustgather/nodes.go`:
- Around line 230-242: The header currently uses len(matchingLines) after the
tail truncation, so it shows the tailed count instead of the total matches;
capture the total match count before you apply the tail (e.g. totalMatches :=
len(matchingLines) immediately before the tail logic) and then use totalMatches
in the header formatting instead of len(matchingLines); adjust the existing
block around matchingLines, tail, header, node, filter, and caseInsensitive in
nodes.go so the "Found X matching line(s)" reflects the original total while the
displayed matchingLines remain tailed.
- Around line 125-137: The sections slice defines two entries with the same key
"pods", making the second entry (title "Pod Resources", value diag.PodResources)
unreachable when filtering by shouldInclude; update the key for the Pod
Resources entry to a unique identifier such as "podresources" (or another
distinct string) and update any related schema/description to reflect the new
key so that shouldInclude("podresources") will correctly include
diag.PodResources.

---

Nitpick comments:
In `@pkg/mcp/resource_test.go`:
- Around line 62-65: The test calls s.CallTool("test_init", ...) to populate
registrars but doesn't assert they were actually set; add fail-fast assertions
immediately after that call to ensure s.resourceReg and s.templateReg are
non-nil so wiring failures fail the test early. Locate the initialization call
(s.CallTool("test_init", map[string]any{})) and add assertions checking
s.resourceReg != nil and s.templateReg != nil (use the test suite's
Require/Assert helpers) before proceeding to subtests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b0c259ea-bea9-4a6a-acd1-a13998bcfbd9

📥 Commits

Reviewing files that changed from the base of the PR and between 2ab4bf2 and 7604f45.

📒 Files selected for processing (26)
  • internal/test/mcp.go
  • internal/tools/update-readme/main.go
  • pkg/api/toolsets.go
  • pkg/mcp/gosdk.go
  • pkg/mcp/mcp.go
  • pkg/mcp/modules.go
  • pkg/mcp/resource.go
  • pkg/mcp/resource_test.go
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
  • pkg/mcp/toolsets_test.go
  • pkg/ocp/mustgather/etcd.go
  • pkg/ocp/mustgather/index.go
  • pkg/ocp/mustgather/loader.go
  • pkg/ocp/mustgather/monitoring.go
  • pkg/ocp/mustgather/provider.go
  • pkg/ocp/mustgather/types.go
  • pkg/toolsets/mustgather/etcd.go
  • pkg/toolsets/mustgather/events.go
  • pkg/toolsets/mustgather/helpers.go
  • pkg/toolsets/mustgather/monitoring.go
  • pkg/toolsets/mustgather/namespaces.go
  • pkg/toolsets/mustgather/nodes.go
  • pkg/toolsets/mustgather/pod_logs.go
  • pkg/toolsets/mustgather/resources.go
  • pkg/toolsets/mustgather/toolset.go
  • pkg/toolsets/mustgather/use.go
✅ Files skipped from review due to trivial changes (5)
  • pkg/mcp/toolsets_test.go
  • pkg/toolsets/mustgather/toolset.go
  • pkg/mcp/modules.go
  • pkg/ocp/mustgather/types.go
  • pkg/mcp/testdata/toolsets-openshift-mustgather-tools.json
🚧 Files skipped from review as they are similar to previous changes (10)
  • pkg/toolsets/mustgather/namespaces.go
  • pkg/toolsets/mustgather/use.go
  • internal/tools/update-readme/main.go
  • pkg/mcp/mcp.go
  • pkg/ocp/mustgather/monitoring.go
  • pkg/toolsets/mustgather/helpers.go
  • pkg/toolsets/mustgather/monitoring.go
  • pkg/toolsets/mustgather/events.go
  • pkg/ocp/mustgather/etcd.go
  • pkg/toolsets/mustgather/resources.go

Comment thread pkg/mcp/resource.go Outdated
Comment on lines +125 to +137
sections := []struct {
key, title, content string
}{
{"sysinfo", "System Info", diag.SysInfo},
{"lscpu", "CPU Info (lscpu)", diag.Lscpu},
{"cpu", "CPU Affinities", diag.CPUAffinities},
{"irq", "IRQ Affinities", diag.IRQAffinities},
{"lspci", "PCI Devices (lspci)", diag.Lspci},
{"dmesg", "Kernel Messages (dmesg)", diag.Dmesg},
{"cmdline", "Kernel Boot Parameters", diag.ProcCmdline},
{"pods", "Pods Info", diag.PodsInfo},
{"pods", "Pod Resources", diag.PodResources},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate key "pods" causes one section to be unreachable.

Lines 135 and 136 both use "pods" as the key. When a user passes include=pods, only the first matching section (Pods Info) is included because shouldInclude("pods") returns the same result for both. Either Pod Resources should use a different key (e.g., "podresources"), or the description should clarify this behavior.

 		{"cmdline", "Kernel Boot Parameters", diag.ProcCmdline},
 		{"pods", "Pods Info", diag.PodsInfo},
-		{"pods", "Pod Resources", diag.PodResources},
+		{"podresources", "Pod Resources", diag.PodResources},
 	}

Update the schema description accordingly:

-						"include":     {Type: "string", Description: "Comma-separated diagnostics to include: kubelet,sysinfo,cpu,irq,pods,lscpu,lspci,dmesg,cmdline (default: all)"},
+						"include":     {Type: "string", Description: "Comma-separated diagnostics to include: kubelet,sysinfo,cpu,irq,pods,podresources,lscpu,lspci,dmesg,cmdline (default: all)"},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sections := []struct {
key, title, content string
}{
{"sysinfo", "System Info", diag.SysInfo},
{"lscpu", "CPU Info (lscpu)", diag.Lscpu},
{"cpu", "CPU Affinities", diag.CPUAffinities},
{"irq", "IRQ Affinities", diag.IRQAffinities},
{"lspci", "PCI Devices (lspci)", diag.Lspci},
{"dmesg", "Kernel Messages (dmesg)", diag.Dmesg},
{"cmdline", "Kernel Boot Parameters", diag.ProcCmdline},
{"pods", "Pods Info", diag.PodsInfo},
{"pods", "Pod Resources", diag.PodResources},
}
sections := []struct {
key, title, content string
}{
{"sysinfo", "System Info", diag.SysInfo},
{"lscpu", "CPU Info (lscpu)", diag.Lscpu},
{"cpu", "CPU Affinities", diag.CPUAffinities},
{"irq", "IRQ Affinities", diag.IRQAffinities},
{"lspci", "PCI Devices (lspci)", diag.Lspci},
{"dmesg", "Kernel Messages (dmesg)", diag.Dmesg},
{"cmdline", "Kernel Boot Parameters", diag.ProcCmdline},
{"pods", "Pods Info", diag.PodsInfo},
{"podresources", "Pod Resources", diag.PodResources},
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/nodes.go` around lines 125 - 137, The sections slice
defines two entries with the same key "pods", making the second entry (title
"Pod Resources", value diag.PodResources) unreachable when filtering by
shouldInclude; update the key for the Pod Resources entry to a unique identifier
such as "podresources" (or another distinct string) and update any related
schema/description to reflect the new key so that shouldInclude("podresources")
will correctly include diag.PodResources.

Comment on lines +230 to +242
// Apply tail from the end of matches
if tail > 0 && len(matchingLines) > tail {
matchingLines = matchingLines[len(matchingLines)-tail:]
}

header := fmt.Sprintf("Kubelet logs for node %s filtered by '%s'", node, filter)
if caseInsensitive {
header += " (case-insensitive)"
}
if tail > 0 {
header += fmt.Sprintf(" (last %d matches)", tail)
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Match count header shows tailed count, not total matches.

After applying tail truncation on Lines 231-232, Line 242 reports len(matchingLines) which is the tailed count. If there were 500 matches and tail=10, users see "Found 10 matching line(s)" instead of the actual 500.

Suggested fix
+	totalMatches := len(matchingLines)
 	// Apply tail from the end of matches
 	if tail > 0 && len(matchingLines) > tail {
 		matchingLines = matchingLines[len(matchingLines)-tail:]
 	}
 
 	header := fmt.Sprintf("Kubelet logs for node %s filtered by '%s'", node, filter)
 	if caseInsensitive {
 		header += " (case-insensitive)"
 	}
 	if tail > 0 {
 		header += fmt.Sprintf(" (last %d matches)", tail)
 	}
-	header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
+	header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", totalMatches)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Apply tail from the end of matches
if tail > 0 && len(matchingLines) > tail {
matchingLines = matchingLines[len(matchingLines)-tail:]
}
header := fmt.Sprintf("Kubelet logs for node %s filtered by '%s'", node, filter)
if caseInsensitive {
header += " (case-insensitive)"
}
if tail > 0 {
header += fmt.Sprintf(" (last %d matches)", tail)
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", len(matchingLines))
totalMatches := len(matchingLines)
// Apply tail from the end of matches
if tail > 0 && len(matchingLines) > tail {
matchingLines = matchingLines[len(matchingLines)-tail:]
}
header := fmt.Sprintf("Kubelet logs for node %s filtered by '%s'", node, filter)
if caseInsensitive {
header += " (case-insensitive)"
}
if tail > 0 {
header += fmt.Sprintf(" (last %d matches)", tail)
}
header += fmt.Sprintf(":\n\nFound %d matching line(s)\n\n", totalMatches)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/toolsets/mustgather/nodes.go` around lines 230 - 242, The header
currently uses len(matchingLines) after the tail truncation, so it shows the
tailed count instead of the total matches; capture the total match count before
you apply the tail (e.g. totalMatches := len(matchingLines) immediately before
the tail logic) and then use totalMatches in the header formatting instead of
len(matchingLines); adjust the existing block around matchingLines, tail,
header, node, filter, and caseInsensitive in nodes.go so the "Found X matching
line(s)" reflects the original total while the displayed matchingLines remain
tailed.

@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from 2d6467c to ac8be42 Compare April 22, 2026 19:50
Copy link
Copy Markdown

@bentito bentito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR, but it introduces highly valuable and much-needed capabilities for offline RCA. I fully support merging this soon and mostly as-is to get the baseline functionality in place. We can address the architectural and performance suggestions below as follow-up work.

Only my comments below about concurrency on the package variables maybe need addressing pre-merge:

Particularly, as a follow-up, we need to ensure that our recent injection ideas for the netedge toolset integrate smoothly with this offline mode after it merges.

Here is a technical review and suggestions for future iterations, specifically looking at this from an AI agent's perspective:

1. Agent Workflow: The use Initialization Step

From an agent's perspective, forcing the LLM to explicitly call mustgather_use before accessing any other tools creates a brittle, linear workflow. Agents frequently forget prerequisite setup steps, requiring extra turns to recover from errors.

  • Suggestion for follow-up: Allow initializing the must-gather provider as a check-step internally for all the other must-gather tool calls. If it's not done, just do it silently when needed.

2. Concurrency and State Management

Currently, the loaded archive state is stored in a package-level global variable:

// pkg/toolsets/mustgather/helpers.go
var provider *mg.Provider
  • Suggestion for change: In a multi-client HTTP/SSE deployment, if User A calls mustgather_use(path="/archive1") and User B calls mustgather_use(path="/archive2"), User B will overwrite the global state, breaking User A's session. Moving to the startup-configuration approach mentioned above (tying the server instance to a specific archive) or implementing session-scoped state would resolve this race condition safely.

3. Memory Efficiency in Log Processing

The log filtering tools (mustgather_pod_logs_grep and mustgather_pod_logs_by_time) currently read the entire log file into memory using readTextFile before splitting and iterating.

  • Suggestion for follow-up: Because must-gather pod logs can easily reach hundreds of megabytes, this approach risks significant memory spikes or OOM kills on the MCP server. Refactoring these to use a bufio.Scanner to stream the file line-by-line from disk—applying the regex/time filters during the scan and only retaining the matched ines - would make this significantly more robust.

Great work on the core parsing and indexing logic! Let's get this merged and iterate on the optimizations.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bentito, swghosh
Once this PR has been reviewed and has the lgtm label, please assign manusa for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from ac8be42 to 1ef296e Compare May 5, 2026 08:24
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 5, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
swghosh and others added 2 commits May 5, 2026 22:38
* Add MCP resources to Toolset interface

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

* Add MCP resource capabilities to Server

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

* React to changes in the Toolset interface

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

* Update suggestions encountered during review

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

* Update suggestions from review

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

* Add more test coverage for MCP resources

(assertion pairing in resource_test.go, Blob/MIMEType-override coverage, filter/mutator tests, AddResource URI-validation panic)

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

---------

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

Add mustgather_events_by_time tool

Add mustgather_pod_logs_grep tool

Add mustgather_pod_logs_by_time tool

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>

Update to use MCP resources
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Signed-off-by: Swarup Ghosh <swghosh@redhat.com>
@swghosh swghosh force-pushed the wg-must-gather-mcp-tools branch from 1ef296e to 95aa015 Compare May 5, 2026 17:11
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 5, 2026

@swghosh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint 95aa015 link true /test lint
ci/prow/test 95aa015 link true /test test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants