[CELEBORN-2316] Introduce metadata operation metrics#3673
[CELEBORN-2316] Introduce metadata operation metrics#3673AmandeepSingh285 wants to merge 10 commits intoapache:mainfrom
Conversation
|
@AmandeepSingh285, thanks for first contribution. Please run |
There was a problem hiding this comment.
Pull request overview
This PR adds worker-side metrics to improve observability of RocksDB-backed metadata operations by counting read/write successes and failures.
Changes:
- Introduces
MetadataMetricsand wires it into RocksDBput/get/deleteand iterator reads/seeks to track success/failure counts. - Registers a new labeled counter in
WorkerSourcefor metadata operation status (operation= read/write,status= success/fail). - Threads
WorkerSource/AbstractSourcethroughDBProvider.initDB(...)call sites so RocksDB can emit metrics.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| worker/src/test/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProviderSuiteJ.java | Updates DB init calls to pass a WorkerSource after initDB signature change. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala | Passes workerSource into DB initialization for graceful-shutdown recovery DB. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/WorkerSource.scala | Adds labeled counters for metadata read/write success/failure. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java | Passes source into DB initialization for sorted-files recovery DB. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDBIterator.java | Wraps iterator seek/next in metric-recording reads (including status() checks). |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java | Wraps RocksDB put/get/delete with metric recording and passes metrics into iterator. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/MetadataMetrics.java | New helper to record labeled success/failure counters around RocksDB operations. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java | Extends initDB to accept an AbstractSource and constructs RocksDB with it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@SteNicholas will work on updating the celeborn dashboard. Could you help with an initial review for the change. Thanks! |
|
@AmandeepSingh285, the current implementation overall LGTM. Please add metrics in grafana dashboard. |
There was a problem hiding this comment.
@AmandeepSingh285, please run dev/reformat to format code changes and also document the introduced metrics in monitoring.md.
What changes were proposed in this pull request?
Introduce metrics around RocksDB operations. Metrics to have success and failure count for metadata operations for RocksDB observability.
Why are the changes needed?
Introduce metrics around RocksDB metadata operations. Current implementation, metadata operations do not have any observability added. RocksDB goes into a read only mode when any critical errors are encountered which results in all write operations failing. Observability around metadata operations is critical and failures could result in metadata entering an inconsistent state.
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tested in local staging setup -
