Fix FEM load-balanced layout reinitialization#487
Conversation
| // Initialize the resultField | ||
| // Initialize or resize the workspace field to the current layout. | ||
| resultField.initialize(mesh, layout); | ||
| resultField.updateLayout(layout); |
There was a problem hiding this comment.
I am not sure I understand this updateLayout here, as there is not change of the layout between initialize and updateLayout, so the updateLayout does not need to be called.
| // Initialize the resultField | ||
| // Initialize or resize the workspace field to the current layout. | ||
| resultField.initialize(mesh, layout); | ||
| resultField.updateLayout(layout); |
| IpplTimings::startTimer(tupdateLayout); | ||
| (*E_m).updateLayout(*fl); | ||
| (*rho_m).updateLayout(*fl); | ||
| rho_m->setFieldBC(rho_m->getFieldBC()); |
There was a problem hiding this comment.
I ran this, it caused an MPI Abort due to invalid count...
There was a problem hiding this comment.
I ran it with both this change and the updateLayout change now, and it works. However, I do not understand still why the updateLayout is needed. In my opinion the updateLayout should be called before the solve, and not in the LagrangeSpace constructor...
There was a problem hiding this comment.
I think it works because the loadBalancing only happens once in this Landau Damping case (at the beginning), and hence works with the call only in the constructor. However, if the load balancing changed the domain decomposition again during the simulation, I think it would crash again and cause incorrect results. The solution would be to call the updateLayout on the resultField every time the loadBalancer is called.
|
I have made a branch with the same changes proposed by Hugo but taking into account my comments. It seems to me that the addition of the line in alpine/LoadBalancer.hpp is not needed: Maybe Hugo can check it, and write a test where the load balancing changes during the simulation such that we can verify it. |
|
Hugo is an AI tool |
|
Please push all you changes to and also add the messages of this PR to then lets see what Hugo can do :) |
|
I have merged my branch to this one, and will look at the regression test soon. |
|
Closing this pull request, clean version in PR #489. |
This PR fixes the FEM load-balancing failure, in collaboration with Hugo
closing #485.
The real problem was stale FEM field state after repartition. When Alpine triggered load balancing, the
solver called LagrangeSpace::initialize(mesh, layout) again, but the internal FEM workspace field resultField in src/FEM/LagrangeSpace.hpp was only passed through initialize(...). Since Field::initialize() is one-shot, that call became a no-op after the first setup, so resultField kept the old layout and old extents after repartition. The next evaluateAx() then operated on a workspace that no longer matched the redistributed domain decomposition, which explains the failed warmup solve, the huge t=0 field energy, and the later halo/exchange crashes.
The fix is to explicitly resize that workspace on every FEM reinitialization:
Behavior After This Change
With this patch, the reduced FEM load-balanced runs no longer exhibit the old failure mode:
This should not change the intended FEM discretization, conservation properties, or solver formulation. It is a state-management fix after repartition, not a numerical algorithm change. The floating-point path should follow the same correct code path as before load balancing; the difference is that the redistributed layout is now actually propagated into the FEM workspace and refreshed BC metadata.
Validation
I validated the fix in two stages.
Local reduced 2-rank CPU case:
A100 2-GPU reduced case on Merlin: