Conversation
…ntly fixing the white error on the maps
| This script loads a config, builds a MultiStreamDataSampler, extracts one batch, | ||
| and plots a single variable for source and target with masking/cropping applied. | ||
| This script can run on a cpu or logging node without GPUs. | ||
| Please activate your .venv before running. |
There was a problem hiding this comment.
this is wrong. You should not have to do that. uv run should handle running without activating the environment.
There was a problem hiding this comment.
I agree with you, but after dealing with different systems, there are different Python versions running on each of those, and I found out that using uv run is the easiest way to handle this situation. I can add a command using python as well if you agree
TillHae
left a comment
There was a problem hiding this comment.
Looks good, I just have a few concerns with your script.
I didn't check parsing, docstrings, plotting, meta info, instance checks and didn't test the script.
|
|
||
| def _to_numpy(arr): | ||
| if isinstance(arr, torch.Tensor): | ||
| return arr.detach().cpu().numpy() |
There was a problem hiding this comment.
Why do we have to move this to cpu?
There was a problem hiding this comment.
This script can be run completely on cpu, we do not need GPU for it
There was a problem hiding this comment.
But if I have a gpu on my local machine I might want to use it for performance
There was a problem hiding this comment.
I believe most of users will be using HPCs as the data is stored there. That said, it is only retrieving one sample so cpu is perfectly suited even on any machine. The performance is not important here as the script is used to generate one plot before training.
| np.ndarray | ||
| Boolean array of same length as lats/lons indicating visibility. | ||
| """ | ||
| if len(lats) == 0: |
There was a problem hiding this comment.
Is len(lats) == len(lons) all the time?
There was a problem hiding this comment.
Yes, always. Both lats and lons are derived from the same coords array via coords[:, 0] and coords[:, 1], so they are guaranteed to have equal length at every call site.
| if max_points is None or len(vals) <= max_points: | ||
| return lats, lons, vals | ||
| idxs = rng.choice(len(vals), size=max_points, replace=False) | ||
| return lats[idxs], lons[idxs], vals[idxs] |
There was a problem hiding this comment.
This might be dangerous to do as it generates a discontinuous array that looks continuous
There was a problem hiding this comment.
Disagree. For scatter plots. Unlike line plots, scatter does not connect adjacent points. Each point is rendered independently, so random downsampling only reduces density. It doesn't create false continuity. The points=N label already shows how many points are displayed.
_downsample is only used when you set up "--max-points" for high resolution data and it is not used in individual cases if not specificaly sat up
| if len(vals_ref) == 0: | ||
| vals_ref = vals_src if len(vals_src) else vals_tgt | ||
| vmin = np.nanpercentile(vals_ref, 2) | ||
| vmax = np.nanpercentile(vals_ref, 98) |
There was a problem hiding this comment.
It doesn't make a big difference in most cases, but I think cutting the outliers can be unwanted sometimes, where we are interested in them and want to see it in our plot. Especially when evaluating masking this can be interesting.
There was a problem hiding this comment.
I agree, but this is not an evaluation plot, so extreme events are prevented here to keep the colorscale consistent.
| ) | ||
| else: | ||
| target_var_idx, target_var_name = _resolve_var_idx( | ||
| ds.source_channels, |
There was a problem hiding this comment.
Why do we have to use ds.source_channels when we have no target_tokens in our target stream?
There was a problem hiding this comment.
When visualizing masking, I am choosing only one variable, which is enough for this purpose. Source and target here represent the student and teacher view which are different and can be set up in the config
|
|
||
| pairs_data.append( | ||
| { | ||
| "pair_idx": source_idx, |
There was a problem hiding this comment.
Shouldn't the key "pair_idx" point to the value of pair_indices defined in l.829
There was a problem hiding this comment.
source_idx already is a value from pair_indices (the loop iterates for source_idx in pair_indices). So "pair_idx": source_idx stores the source sample index, which is meaningful (it identifies which batch sample this is). Storing the enumeration position (0, 1, 2…) would be less informative and would make the filename pair0_, pair1_ regardless of which samples were selected.
Though the naming pair_idx vs source_idx is admittedly redundant — pair_idx could be renamed to make it clearer.
|
Thanks for clarifying. Now everything sounds logical to me. |
Description
Implementation of the masking visualization script
Issue Number
Closes #2145
Is this PR a draft? Mark it as draft.
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60