Fix FEM load-balanced layout reinitialization by aaadelmann · Pull Request #487 · IPPL-framework/ippl

aaadelmann · 2026-03-24T09:32:18Z

This PR fixes the FEM load-balancing failure, in collaboration with Hugo

closing #485.

The real problem was stale FEM field state after repartition. When Alpine triggered load balancing, the
solver called LagrangeSpace::initialize(mesh, layout) again, but the internal FEM workspace field resultField in src/FEM/LagrangeSpace.hpp was only passed through initialize(...). Since Field::initialize() is one-shot, that call became a no-op after the first setup, so resultField kept the old layout and old extents after repartition. The next evaluateAx() then operated on a workspace that no longer matched the redistributed domain decomposition, which explains the failed warmup solve, the huge t=0 field energy, and the later halo/exchange crashes.

The fix is to explicitly resize that workspace on every FEM reinitialization:

in src/FEM/LagrangeSpace.hpp, resultField.updateLayout(layout) is now called in both layout-aware initialization paths, so the workspace is always rebuilt to the current repartitioned layout
in alpine/LoadBalancer.hpp, rho_m->setFieldBC(rho_m->getFieldBC()) is called after rho_m.updateLayout(*fl) so periodic BC neighbor metadata is refreshed for the redistributed density field

Behavior After This Change

With this patch, the reduced FEM load-balanced runs no longer exhibit the old failure mode:

no operator extent mismatch
no 1000-iteration warmup stall
no t=0 energy blow-up by orders of magnitude
no follow-on halo corruption from stale FEM workspace state

This should not change the intended FEM discretization, conservation properties, or solver formulation. It is a state-management fix after repartition, not a numerical algorithm change. The floating-point path should follow the same correct code path as before load balancing; the difference is that the redistributed layout is now actually propagated into the FEM workspace and refreshed BC metadata.

Validation

I validated the fix in two stages.

Local reduced 2-rank CPU case:

control FEM 1.0: t=0 Ex = 9.9526424907379614
load-balanced FEM 0.01: t=0 Ex = 9.9532389888656407

A100 2-GPU reduced case on Merlin:

control FEM 1.0: t=0 Ex = 9.3334714317865917, CG entries 0,23 then 0,12
load-balanced FEM 0.01: t=0 Ex = 9.3018131240904989, CG entries 0,23 then 0,19

s-mayani · 2026-03-24T09:45:32Z

src/FEM/LagrangeSpace.hpp

-        // Initialize the resultField
+        // Initialize or resize the workspace field to the current layout.
        resultField.initialize(mesh, layout);
+        resultField.updateLayout(layout);


I am not sure I understand this updateLayout here, as there is not change of the layout between initialize and updateLayout, so the updateLayout does not need to be called.

s-mayani · 2026-03-24T09:45:38Z

src/FEM/LagrangeSpace.hpp

-        // Initialize the resultField
+        // Initialize or resize the workspace field to the current layout.
        resultField.initialize(mesh, layout);
+        resultField.updateLayout(layout);


Same as above.

s-mayani · 2026-03-24T09:58:59Z

alpine/LoadBalancer.hpp

        IpplTimings::startTimer(tupdateLayout);
        (*E_m).updateLayout(*fl);
        (*rho_m).updateLayout(*fl);
+        rho_m->setFieldBC(rho_m->getFieldBC());


I ran this, it caused an MPI Abort due to invalid count...

I ran it with both this change and the updateLayout change now, and it works. However, I do not understand still why the updateLayout is needed. In my opinion the updateLayout should be called before the solve, and not in the LagrangeSpace constructor...

I think it works because the loadBalancing only happens once in this Landau Damping case (at the beginning), and hence works with the call only in the constructor. However, if the load balancing changed the domain decomposition again during the simulation, I think it would crash again and cause incorrect results. The solution would be to call the updateLayout on the resultField every time the loadBalancer is called.

s-mayani · 2026-03-24T12:45:48Z

I have made a branch with the same changes proposed by Hugo but taking into account my comments.
Here it is: https://github.com/s-mayani/ippl/tree/fix_loadbalancing_fem

It seems to me that the addition of the line in alpine/LoadBalancer.hpp is not needed:
rho_m->setFieldBC(rho_m->getFieldBC());
Furthermore, now I have added an updateLayout function which updates the LagrangeSpace every time it is called.

Maybe Hugo can check it, and write a test where the load balancing changes during the simulation such that we can verify it.

s-mayani · 2026-03-24T15:38:06Z

Hugo is an AI tool

aaadelmann · 2026-03-24T20:54:23Z

Please push all you changes to and also add the messages of this PR to then lets see what Hugo can do :)

s-mayani · 2026-03-25T07:36:16Z

I have merged my branch to this one, and will look at the regression test soon.

s-mayani · 2026-03-25T10:54:58Z

Closing this pull request, clean version in PR #489.

Fix FEM load-balanced layout reinitialization

822b099

aaadelmann self-assigned this Mar 24, 2026

aaadelmann requested a review from s-mayani March 24, 2026 09:39

aaadelmann added the bug Something isn't working label Mar 24, 2026

aaadelmann linked an issue Mar 24, 2026 that may be closed by this pull request

Load balancing in Alpine not working for FEM solver on GPU #485

Open

aaadelmann added the gitlab-mirror label Mar 24, 2026

s-mayani reviewed Mar 24, 2026

View reviewed changes

s-mayani added 3 commits March 24, 2026 11:28

add updateLayout functionality

e341901

udpate FEM layout in load balancer

c47061a

fix call to get solver

b1bb169

aaadelmann and others added 4 commits March 25, 2026 07:37

Add PIC load-balancing regression test

b2b30bc

add documentation

4cc0bd7

Merge branch 'master' into fix_loadbalancing_fem

470f357

Merge branch 'fix_loadbalancing_fem' into fem-load-balanced-debug

714bbab

s-mayani mentioned this pull request Mar 25, 2026

Fix load balancing with FEM #489

Draft

s-mayani closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FEM load-balanced layout reinitialization#487

Fix FEM load-balanced layout reinitialization#487
aaadelmann wants to merge 8 commits intomasterfrom
fem-load-balanced-debug

aaadelmann commented Mar 24, 2026 •

edited

Loading

Uh oh!

s-mayani Mar 24, 2026

Uh oh!

s-mayani Mar 24, 2026

Uh oh!

s-mayani Mar 24, 2026

Uh oh!

s-mayani Mar 24, 2026 •

edited

Loading

Uh oh!

s-mayani Mar 24, 2026

Uh oh!

s-mayani commented Mar 24, 2026

Uh oh!

s-mayani commented Mar 24, 2026

Uh oh!

aaadelmann commented Mar 24, 2026

Uh oh!

s-mayani commented Mar 25, 2026

Uh oh!

s-mayani commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaadelmann commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-mayani Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

s-mayani Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

s-mayani Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

s-mayani Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s-mayani Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

s-mayani commented Mar 24, 2026

Uh oh!

s-mayani commented Mar 24, 2026

Uh oh!

aaadelmann commented Mar 24, 2026

Uh oh!

s-mayani commented Mar 25, 2026

Uh oh!

s-mayani commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aaadelmann commented Mar 24, 2026 •

edited

Loading

s-mayani Mar 24, 2026 •

edited

Loading