Skip to content

Rolfhm/dev/mutable config#2123

Draft
rolfhm wants to merge 9 commits intoecmwf:developfrom
rolfhm:rolfhm/dev/mutable-config
Draft

Rolfhm/dev/mutable config#2123
rolfhm wants to merge 9 commits intoecmwf:developfrom
rolfhm:rolfhm/dev/mutable-config

Conversation

@rolfhm
Copy link
Copy Markdown
Contributor

@rolfhm rolfhm commented Mar 27, 2026

Description

Beginning of a mutable config class for values that change during a run.

Issue Number

Closes #1528

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work. We probably need some discussion of how to write it and to make sure we don't need to modify too much code related to reading/writing config.

Comment thread src/weathergen/model/model_interface.py Outdated
overrides={},
):
model_creation_device = "meta" if with_ddp and with_fsdp else "cuda"
model_creation_device = "meta" if mcf.with_ddp and with_fsdp else "cuda"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent since with_ddp and with_fsdp are specified in the config. I think we should remove with_ddp in the config--DDP will always be used when possible and this is standard and correct behaviour.

We need to remove it in all configs then.

Comment thread src/weathergen/model/model_interface.py Outdated
halflife_steps=target_and_aux_calc_params.get("ema_halflife_in_thousands", 1e-3),
rampup_ratio=target_and_aux_calc_params.get("ema_ramp_up_ratio", 0.09),
is_model_sharded=(cf.with_ddp and cf.with_fsdp),
is_model_sharded=(with_ddp and cf.with_fsdp),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We re-determine this multiple times. I think we should introduce a flag in the mcf--this is more robust.


_logger = logging.getLogger(__name__)

class MutableConfig():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right name. It's not mutable during the run (except for istep) but it's determined at run initialization time. If I start from istep then this is more the run state.


class MutableConfig():
"""
lalala
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update

self.with_ddp = None
self.run_history = []

print("Trolololo 1")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

lalala
"""

def __init__(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this class have serialization and deserialization code? Also: a lot of code around the regular config is to ensure it is loaded and stored correctly. This has not been touched?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf is just a DictConfig from OmegaConf as far as I can tell. Is there any built in code in DictConfig to handle it?
Wondering if it would be easier to just make RunState (former MutableConfig) also a DictConfig? @tjhunter

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OmegaConf.DictConfig makes a lot of sense to me: it has some useful functionality (like merge()) that we want to have.


# adjust len to split loading across all workers and ensure it is multiple of batch_size
len_chunk = ((self.len // cf.world_size) // self.batch_size) * self.batch_size
len_chunk = ((self.len // mcf.world_size) // self.batch_size) * self.batch_size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.world_size

Comment thread src/weathergen/train/trainer_base.py Outdated
mcf.with_ddp = world_size > 1

return cf
return cf, mcf
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to return cf

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general.run_id and data_loader_rng_seed are also computed in this routine. Should they also be moved to runstate?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the decision if we consider run_id as a mutable object or not. It's (usually) determined at init time but then always constant. I hence would prefer if it's part of the immutable config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infra Issues related to infrastructure model Related to model training or definition (not generic infra)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

config should be immutable, dynamic parameters should be tracked in run_params

2 participants