Skip to content

Feature/visualize masking#2146

Open
wael-mika wants to merge 6 commits intoecmwf:developfrom
wael-mika:feature/visualize-masking
Open

Feature/visualize masking#2146
wael-mika wants to merge 6 commits intoecmwf:developfrom
wael-mika:feature/visualize-masking

Conversation

@wael-mika
Copy link
Copy Markdown
Contributor

Description

Implementation of the masking visualization script

Issue Number

Closes #2145

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline labels Mar 31, 2026
This script loads a config, builds a MultiStreamDataSampler, extracts one batch,
and plots a single variable for source and target with masking/cropping applied.
This script can run on a cpu or logging node without GPUs.
Please activate your .venv before running.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong. You should not have to do that. uv run should handle running without activating the environment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but after dealing with different systems, there are different Python versions running on each of those, and I found out that using uv run is the easiest way to handle this situation. I can add a command using python as well if you agree

Copy link
Copy Markdown
Contributor

@TillHae TillHae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have a few concerns with your script.

I didn't check parsing, docstrings, plotting, meta info, instance checks and didn't test the script.


def _to_numpy(arr):
if isinstance(arr, torch.Tensor):
return arr.detach().cpu().numpy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to move this to cpu?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script can be run completely on cpu, we do not need GPU for it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I have a gpu on my local machine I might want to use it for performance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe most of users will be using HPCs as the data is stored there. That said, it is only retrieving one sample so cpu is perfectly suited even on any machine. The performance is not important here as the script is used to generate one plot before training.

np.ndarray
Boolean array of same length as lats/lons indicating visibility.
"""
if len(lats) == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is len(lats) == len(lons) all the time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, always. Both lats and lons are derived from the same coords array via coords[:, 0] and coords[:, 1], so they are guaranteed to have equal length at every call site.

if max_points is None or len(vals) <= max_points:
return lats, lons, vals
idxs = rng.choice(len(vals), size=max_points, replace=False)
return lats[idxs], lons[idxs], vals[idxs]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be dangerous to do as it generates a discontinuous array that looks continuous

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. For scatter plots. Unlike line plots, scatter does not connect adjacent points. Each point is rendered independently, so random downsampling only reduces density. It doesn't create false continuity. The points=N label already shows how many points are displayed.
_downsample is only used when you set up "--max-points" for high resolution data and it is not used in individual cases if not specificaly sat up

if len(vals_ref) == 0:
vals_ref = vals_src if len(vals_src) else vals_tgt
vmin = np.nanpercentile(vals_ref, 2)
vmax = np.nanpercentile(vals_ref, 98)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make a big difference in most cases, but I think cutting the outliers can be unwanted sometimes, where we are interested in them and want to see it in our plot. Especially when evaluating masking this can be interesting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but this is not an evaluation plot, so extreme events are prevented here to keep the colorscale consistent.

)
else:
target_var_idx, target_var_name = _resolve_var_idx(
ds.source_channels,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to use ds.source_channels when we have no target_tokens in our target stream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When visualizing masking, I am choosing only one variable, which is enough for this purpose. Source and target here represent the student and teacher view which are different and can be set up in the config


pairs_data.append(
{
"pair_idx": source_idx,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the key "pair_idx" point to the value of pair_indices defined in l.829

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_idx already is a value from pair_indices (the loop iterates for source_idx in pair_indices). So "pair_idx": source_idx stores the source sample index, which is meaningful (it identifies which batch sample this is). Storing the enumeration position (0, 1, 2…) would be less informative and would make the filename pair0_, pair1_ regardless of which samples were selected.
Though the naming pair_idx vs source_idx is admittedly redundant — pair_idx could be renamed to make it clearer.

@TillHae
Copy link
Copy Markdown
Contributor

TillHae commented Mar 31, 2026

Thanks for clarifying. Now everything sounds logical to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Initiative: Visualize masking before pretraining/finetuning

3 participants