[2091][performance] Track throughput metrics by florianscheidl · Pull Request #2124 · ecmwf/WeatherGenerator

florianscheidl · 2026-03-27T17:49:21Z

Description

Implements optional on-the-fly throughput metrics, logged per training step. The new metrics are named "performance.throughput.*" and track per-device and global throughput in terms of:

batches per second,
samples per second,
MB per second.

In multi-device and multi-node setups, we make an all-reduce call to get the global throughput metrics.

Usage

To activate throughput tracking, add track_performance_metrics: True in the training config, under train_logging, see the performance_*.yaml configs added here. Run with a base configuration, e.g.:

../WeatherGenerator-private/hpc/launch-slurm.py --time 8 --nodes=1 --base-config config/config_jepa.yml --config config/performance_jepa_config.yml

Issue Number

Closes #2091.

Preview:

We investigated the effect of batch sizes on throughput, see https://gitlab.jsc.fz-juelich.de/hedgedoc/SUW6Zq-BR3uYCwU3hmIb6w?both#.

Below are screenshots from MLFlow:

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…mance-metric-profiling

clessig

Some minor points but essentially ready to be merged

clessig · 2026-04-16T12:47:07Z

                self.ema_model.update(self.cf.general.istep * batch_size_total, batch_size_total)

+            self.perf_tracker.step(
+                batch,


For what is batch needed?

We compute the throughput in MB from the batch.

florianscheidl added 30 commits March 25, 2026 15:49

Throughput calculation with some shortcuts

c5f3f7e

meta device

a847645

small config to quickly iterate

a164514

debug refactor

52acede

Track throughput in terms of source data size

a889e5d

Correct torch device wrapping

a9f67d9

Update get_last_lr behavior

66b0578

Update minimal config

5fc0081

Proide loss function

80691d1

lr scheduler none issue without warmup

32c238f

Throughput warmup behaviour implies no updates

5f10d0f

Reinitialize throughput tracker

0dd6b67

Increasing throughput logging instead of resetting

062c953

Smaller config

35ce423

Align datatypes

16aa00f

Update config since we're running on 4 gpus per node

bc1333a

Destroying process group to get successful training?

312ccc4

Use bigger config again to avoid chaos

e5b18f3

Adapt window size to see throughput metrics

954daa5

Adapt config hoping metrics weill be logged to mlflow

fe1952f

window must be greater than 1

15bc44a

Add step to logging for perf metrics

841af12

Add tests, isolate perf related functions, add doc strings

c7ca2a6

Configure perf metrics logging behind config

844a33b

Bigger config based on jepa

bf3cb64

Try with increasing number of samples

44e7672

Update config

8f94e78

Cleanup

db20a6e

Update config

1dbd25e

Clean up and track global throughput in mb

7274c75

florianscheidl added 9 commits March 31, 2026 17:44

Fix function args

4f7bee6

Move tracking to train logging

c012030

Synching across ranks

8478ac9

Increase number of samples

639433c

FlopCounterMode kills flash attention, remove flop counting

df68572

Update tests

126e244

Clean up trainer

46a856d

Remove legacy contexts

a3165d2

Comment cleanup

6e18869

florianscheidl changed the title ~~[2091][performance] Track throughput and utilization metrics (optional)~~ [2091][performance] Track throughput metrics Apr 2, 2026

florianscheidl and others added 2 commits April 2, 2026 17:41

Lint update

a0dc799

Merge branch 'develop' into fscheidl/flo-85-first-iteration-of-perfor…

9fb0855

…mance-metric-profiling

florianscheidl marked this pull request as ready for review April 2, 2026 15:46

florianscheidl added 5 commits April 2, 2026 19:13

Remove unused world size and comment

2f93888

Jepa perf config num_samples 2

1591460

throughput per step metrics

b381d83

Remove per step metrics

1f2c4d8

Drop legacy import

896cade

ekouts suggested changes Apr 14, 2026

View reviewed changes

Comment thread tests/test_performance_utils.py Outdated

Comment thread tests/test_performance_utils.py Outdated

github-project-automation bot moved this to In Progress in WeatherGen-dev Apr 14, 2026

ekouts suggested changes Apr 14, 2026

View reviewed changes

Comment thread src/weathergen/utils/performance.py Outdated

Implemented suggestions

126d629

ekouts approved these changes Apr 16, 2026

View reviewed changes

florianscheidl and others added 2 commits April 16, 2026 10:13

adjust shape

c778af0

Merge branch 'develop' into fscheidl/flo-85-first-iteration-of-perfor…

3501e98

…mance-metric-profiling

florianscheidl requested a review from clessig April 16, 2026 08:15

clessig approved these changes Apr 16, 2026

View reviewed changes

florianscheidl added 3 commits April 16, 2026 16:03

Rename configs

1d59059

Refactor batch-size-per-gpu arg

0f9fb96

Correct remaining test

cca02c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2091][performance] Track throughput metrics#2124

[2091][performance] Track throughput metrics#2124
florianscheidl wants to merge 73 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling

florianscheidl commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clessig left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clessig Apr 16, 2026

Uh oh!

florianscheidl Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

florianscheidl commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Issue Number

Preview:

Checklist before asking for review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clessig Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

florianscheidl Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

florianscheidl commented Mar 27, 2026 •

edited

Loading