Skip to content

[CELEBORN-2316] Introduce metadata operation metrics#3673

Open
AmandeepSingh285 wants to merge 10 commits intoapache:mainfrom
AmandeepSingh285:adding-metadata-metrics
Open

[CELEBORN-2316] Introduce metadata operation metrics#3673
AmandeepSingh285 wants to merge 10 commits intoapache:mainfrom
AmandeepSingh285:adding-metadata-metrics

Conversation

@AmandeepSingh285
Copy link
Copy Markdown

@AmandeepSingh285 AmandeepSingh285 commented Apr 29, 2026

What changes were proposed in this pull request?

Introduce metrics around RocksDB operations. Metrics to have success and failure count for metadata operations for RocksDB observability.

Why are the changes needed?

Introduce metrics around RocksDB metadata operations. Current implementation, metadata operations do not have any observability added. RocksDB goes into a read only mode when any critical errors are encountered which results in all write operations failing. Observability around metadata operations is critical and failures could result in metadata entering an inconsistent state.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested in local staging setup -
image

@SteNicholas SteNicholas changed the title [CELEBORN-2316] Added metrics for metadata operations [CELEBORN-2316] Introduce metadata operation metrics Apr 30, 2026
@SteNicholas
Copy link
Copy Markdown
Member

SteNicholas commented Apr 30, 2026

@AmandeepSingh285, thanks for first contribution. Please run dev/reformat to format code changes. Then add introduced metrics into celeborn-dashboard.json.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds worker-side metrics to improve observability of RocksDB-backed metadata operations by counting read/write successes and failures.

Changes:

  • Introduces MetadataMetrics and wires it into RocksDB put/get/delete and iterator reads/seeks to track success/failure counts.
  • Registers a new labeled counter in WorkerSource for metadata operation status (operation = read/write, status = success/fail).
  • Threads WorkerSource / AbstractSource through DBProvider.initDB(...) call sites so RocksDB can emit metrics.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
worker/src/test/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProviderSuiteJ.java Updates DB init calls to pass a WorkerSource after initDB signature change.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala Passes workerSource into DB initialization for graceful-shutdown recovery DB.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/WorkerSource.scala Adds labeled counters for metadata read/write success/failure.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java Passes source into DB initialization for sorted-files recovery DB.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDBIterator.java Wraps iterator seek/next in metric-recording reads (including status() checks).
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java Wraps RocksDB put/get/delete with metric recording and passes metrics into iterator.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/MetadataMetrics.java New helper to record labeled success/failure counters around RocksDB operations.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java Extends initDB to accept an AbstractSource and constructs RocksDB with it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AmandeepSingh285
Copy link
Copy Markdown
Author

@SteNicholas will work on updating the celeborn dashboard. Could you help with an initial review for the change. Thanks!

@SteNicholas
Copy link
Copy Markdown
Member

@AmandeepSingh285, the current implementation overall LGTM. Please add metrics in grafana dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandeepSingh285, please run dev/reformat to format code changes and also document the introduced metrics in monitoring.md.

Comment thread assets/grafana/celeborn-dashboard.json Outdated
Comment thread assets/grafana/celeborn-dashboard.json Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants