Skip to content

Fix FEM load-balanced layout reinitialization#487

Closed
aaadelmann wants to merge 8 commits intomasterfrom
fem-load-balanced-debug
Closed

Fix FEM load-balanced layout reinitialization#487
aaadelmann wants to merge 8 commits intomasterfrom
fem-load-balanced-debug

Conversation

@aaadelmann
Copy link
Copy Markdown
Member

@aaadelmann aaadelmann commented Mar 24, 2026

This PR fixes the FEM load-balancing failure, in collaboration with Hugo

closing #485.

The real problem was stale FEM field state after repartition. When Alpine triggered load balancing, the
solver called LagrangeSpace::initialize(mesh, layout) again, but the internal FEM workspace field resultField in src/FEM/LagrangeSpace.hpp was only passed through initialize(...). Since Field::initialize() is one-shot, that call became a no-op after the first setup, so resultField kept the old layout and old extents after repartition. The next evaluateAx() then operated on a workspace that no longer matched the redistributed domain decomposition, which explains the failed warmup solve, the huge t=0 field energy, and the later halo/exchange crashes.

The fix is to explicitly resize that workspace on every FEM reinitialization:

  • in src/FEM/LagrangeSpace.hpp, resultField.updateLayout(layout) is now called in both layout-aware initialization paths, so the workspace is always rebuilt to the current repartitioned layout
  • in alpine/LoadBalancer.hpp, rho_m->setFieldBC(rho_m->getFieldBC()) is called after rho_m.updateLayout(*fl) so periodic BC neighbor metadata is refreshed for the redistributed density field

Behavior After This Change

With this patch, the reduced FEM load-balanced runs no longer exhibit the old failure mode:

  • no operator extent mismatch
  • no 1000-iteration warmup stall
  • no t=0 energy blow-up by orders of magnitude
  • no follow-on halo corruption from stale FEM workspace state

This should not change the intended FEM discretization, conservation properties, or solver formulation. It is a state-management fix after repartition, not a numerical algorithm change. The floating-point path should follow the same correct code path as before load balancing; the difference is that the redistributed layout is now actually propagated into the FEM workspace and refreshed BC metadata.

Validation

I validated the fix in two stages.

Local reduced 2-rank CPU case:

  • control FEM 1.0: t=0 Ex = 9.9526424907379614
  • load-balanced FEM 0.01: t=0 Ex = 9.9532389888656407

A100 2-GPU reduced case on Merlin:

  • control FEM 1.0: t=0 Ex = 9.3334714317865917, CG entries 0,23 then 0,12
  • load-balanced FEM 0.01: t=0 Ex = 9.3018131240904989, CG entries 0,23 then 0,19

@aaadelmann aaadelmann self-assigned this Mar 24, 2026
@aaadelmann aaadelmann requested a review from s-mayani March 24, 2026 09:39
@aaadelmann aaadelmann added the bug Something isn't working label Mar 24, 2026
@aaadelmann aaadelmann linked an issue Mar 24, 2026 that may be closed by this pull request
// Initialize the resultField
// Initialize or resize the workspace field to the current layout.
resultField.initialize(mesh, layout);
resultField.updateLayout(layout);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this updateLayout here, as there is not change of the layout between initialize and updateLayout, so the updateLayout does not need to be called.

// Initialize the resultField
// Initialize or resize the workspace field to the current layout.
resultField.initialize(mesh, layout);
resultField.updateLayout(layout);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

IpplTimings::startTimer(tupdateLayout);
(*E_m).updateLayout(*fl);
(*rho_m).updateLayout(*fl);
rho_m->setFieldBC(rho_m->getFieldBC());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this, it caused an MPI Abort due to invalid count...

Copy link
Copy Markdown
Collaborator

@s-mayani s-mayani Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it with both this change and the updateLayout change now, and it works. However, I do not understand still why the updateLayout is needed. In my opinion the updateLayout should be called before the solve, and not in the LagrangeSpace constructor...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works because the loadBalancing only happens once in this Landau Damping case (at the beginning), and hence works with the call only in the constructor. However, if the load balancing changed the domain decomposition again during the simulation, I think it would crash again and cause incorrect results. The solution would be to call the updateLayout on the resultField every time the loadBalancer is called.

@s-mayani
Copy link
Copy Markdown
Collaborator

I have made a branch with the same changes proposed by Hugo but taking into account my comments.
Here it is: https://github.com/s-mayani/ippl/tree/fix_loadbalancing_fem

It seems to me that the addition of the line in alpine/LoadBalancer.hpp is not needed:
rho_m->setFieldBC(rho_m->getFieldBC());
Furthermore, now I have added an updateLayout function which updates the LagrangeSpace every time it is called.

Maybe Hugo can check it, and write a test where the load balancing changes during the simulation such that we can verify it.

@s-mayani
Copy link
Copy Markdown
Collaborator

Hugo is an AI tool

@aaadelmann
Copy link
Copy Markdown
Member Author

Please push all you changes to and also add the messages of this PR to then lets see what Hugo can do :)

@s-mayani
Copy link
Copy Markdown
Collaborator

I have merged my branch to this one, and will look at the regression test soon.

@s-mayani
Copy link
Copy Markdown
Collaborator

Closing this pull request, clean version in PR #489.

@s-mayani s-mayani closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working gitlab-mirror

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Load balancing in Alpine not working for FEM solver on GPU

2 participants