[CELEBORN-2257] Add reporting of remote disks during registration.#3597
[CELEBORN-2257] Add reporting of remote disks during registration.#3597Dzeri96 wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes worker registration disk reporting so the master can see remote storage (HDFS/S3/OSS) immediately (before the first heartbeat), and refactors disk snapshot APIs / slot-allocation logic to distinguish local vs remote disks more clearly.
Changes:
- Renamed disk snapshot / healthy-dir helpers to explicitly mean “local” and added an “all disks” snapshot.
- Updated worker registration/heartbeat disk reporting to incorporate remote disks.
- Simplified master slot-allocation filtering by embedding disk-type metadata into
StorageInfo.Typeand using it in allocation logic.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManagerSuite.scala | Updates mocks to the renamed localDisksSnapshot() API. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala | Introduces localDisksSnapshot() / allDisksSnapshot() and renames “healthy working dirs” to local-only. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala | Switches registration to report all disks; refactors heartbeat disk update flow. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala | Uses local-only healthy working dirs check for slot reservation. |
| tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornHashCheckDiskSuite.scala | Updates test to use localDisksSnapshot(). |
| master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java | Simplifies disk filtering using StorageInfo.Type metadata; refactors usable-slot bookkeeping. |
| common/src/main/scala/org/apache/celeborn/common/meta/WorkerInfo.scala | Refactors slot recomputation / propagation logic and uses isDFS. |
| common/src/main/java/org/apache/celeborn/common/protocol/StorageInfo.java | Adds isDFS + mask metadata into StorageInfo.Type and introduces isAvailable(...). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3597 +/- ##
==========================================
- Coverage 67.13% 67.07% -0.06%
==========================================
Files 357 357
Lines 21860 21935 +75
Branches 1943 1947 +4
==========================================
+ Hits 14674 14711 +37
- Misses 6166 6213 +47
+ Partials 1020 1011 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
SteNicholas
left a comment
There was a problem hiding this comment.
@Dzeri96, thanks for contribution. Could you explain which fix mainly provided in this pull request?
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@SteNicholas So my changes are explained in the PR description pretty well I think. While @eolivelli was running his tests, he noticed that the current faulty behaviour presents a problem when the auto-scaling spawns new nodes. In this moment, the system is under pressure, and yet the newly-spawned nodes don't report remote disks, leading to performance degredation and the need to spawn more nodes. In hindsight though, I should have limited the PR to just this. It's just that while I was trying to understand the code, I made the other changes to make it more readable for myself. In the end I decided to include them too since we will be working on this part of the project a lot in the future. Also, don't forget to help me with writing a test! |
for PR overall LGTM, Additionally, you can use |
|
I'll try to write some tests this week and will ping you when I'm done @RexXiong |
SteNicholas
left a comment
There was a problem hiding this comment.
@Dzeri96, thanks for update. The changes overall LGTM. Please add integration test for the changes.
|
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
I haven't forgotten about this PR. It's just that the tests have proven to be hard to write and I hadn't had that much time. Hopefully I'll do it soon! |
|
I've spent so much time fighting Mockito on this one (issue opened), that I forgot what I wrote in this PR. @SteNicholas Please be so kind to take another look. |
# Conflicts: # master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java
|
The logs have expired in the meantime, but I think that the failing tests are not because of me. You guys can try to re-trigger them and take a look. |
|
Please check and fix CI |
|
I can locally reproduce some failing tests from the service suite. Let me check what's wrong... |
|
@SteNicholas Actually, the tests seem flaky. Sometimes they pass and sometimes they don't. Do you think this has something to do with my code or was it there before? |
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR resolve a correctness bug?
Not a correctness bug
Does this PR introduce any user-facing change?
No
How was this patch tested?
Important: I want help from the community on how to write tests for this.