Whilst configuring machines with the grubcmdline the check for the command line is not idempotent, so the same machine will reboot every time.
I suspect this is either
- related to
crashkernel appearing multiple times in the args (which we should fix in the caller, but should the role does not guard against) or
resume=UUID gets packed into the cmdline multiple times
The role should either throw a fatal: message, or not notify the reboot handler in these cases
Logs
TASK [stackhpc.linux.grubcmdline : Notify reboot handler] **********************
Friday 20 December 2024 11:22:41 +0000 (0:00:00.049) 0:00:08.921 *******
changed: [hv.example.com] =>
msg: Old command line was BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=
c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
....
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX_DEFAULT] *********
Friday 20 December 2024 11:22:42 +0000 (0:00:00.039) 0:00:09.543 *******
ok: [hv.example.com] =>
grub_cmdline_linux_default: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
...
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX] *****************
Friday 20 December 2024 11:22:42 +0000 (0:00:00.038) 0:00:09.582 *******
ok: [hv.example.com] =>
grub_cmdline_linux: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
.....
.....
TASK [stackhpc.linux.grubcmdline : Display newly computed GRUB_CMDLINE_LINUX_DEFAULT] ***
Friday 20 December 2024 11:22:42 +0000 (0:00:00.046) 0:00:09.816 *******
ok: [hv.example.com] =>
grub_cmdline_linux_new:
- crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
- resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4
- rhgb
- quiet
- crashkernel=auto
- rhgb
- quiet
- crashkernel=auto
After reboot:
cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
Calling Module
vars:
kernel_cmdline_args:
- "rhgb"
- "quiet"
- "crashkernel=auto"
kernel_cmdline_args_remove:
- hugepage
- selinux
roles:
- name: stackhpc.linux.grubcmdline
vars:
kernel_cmdline: "{{ kernel_cmdline_args }}"
kernel_cmdline_remove: "{{ kernel_cmdline_args_remove }}"
Own thoughts:
- Changing the default parameters on
/etc/grub2.cfg or grubby args is working around the existing tooling built into the OS and grub:
Adding args should be done through templating files into /etc/grub.d for each override (RH Grubby Docs and StackOverflow example), such as /etc/grub.d/60-stackhpc-grubcmdline-quiet then simply invoke grubby or grub-mkconfig without args
Removing entries would still edit /etc/default/grub, but ansible.builtin.lineinfile should do the "surgical cuts" instead to minimise the changes to the file rather than trying generate a "new" cmdline using Jinja templates (which could generate wrong and result in an unbootable machine).
Alternatively, a grub script can be templated out /etc/grub.d which iterates over the GRUB_CMDLINE_LINUX* variables and appends every argument back except the matching one using the for in and if statements built into grub script. This would be more correct, but I don't have an adaptable example to hand.
The biggest advantages are:
- Prevents the role fighting upstream changes if they change the default cmdline (e.g. in-place OS upgrades)
- Would avoid the problems with the UUID
- Makes troubleshooting easy to diagnose if it's a distro or kernel flag change:
mv /etc/grub.d/*stackhpc-grubcmdline /etc/grub.d.bak && grubby ...
- Idempotent by default, template through ansible then use a handler for re-running
grubby and notify reboot if any template / lineinfile has changed
If not the above changes:
- I think
old_cmdline != kernel_cmdline | select() | sort | list is potentially brittle as there will be more problems like this in the future. Instead it could be something akin to the following pseudo code:
changed_when: not(all(grub_cmdline_linux_new in kernel_cmdline)) or any(grub_cmdline_linux_remove in kernel_cmdline)
This would only check for keywords the user has explicitly set in the role, rather than being affected by parameters (such as root=UUID=abc...) which they have not specified at all. However, this still leaves the potential problems around directly generating /etc/default/grub
Whilst configuring machines with the
grubcmdlinethe check for the command line is not idempotent, so the same machine will reboot every time.I suspect this is either
crashkernelappearing multiple times in the args (which we should fix in the caller, but should the role does not guard against) orresume=UUIDgets packed into the cmdline multiple timesThe role should either throw a
fatal:message, or not notify the reboot handler in these casesLogs
After reboot:
Calling Module
Own thoughts:
/etc/grub2.cfgor grubby args is working around the existing tooling built into the OS and grub:Adding args should be done through templating files into
/etc/grub.dfor each override (RH Grubby Docs and StackOverflow example), such as/etc/grub.d/60-stackhpc-grubcmdline-quietthen simply invokegrubbyorgrub-mkconfigwithout argsRemoving entries would still edit
/etc/default/grub, butansible.builtin.lineinfileshould do the "surgical cuts" instead to minimise the changes to the file rather than trying generate a "new" cmdline using Jinja templates (which could generate wrong and result in an unbootable machine).Alternatively, a grub script can be templated out
/etc/grub.dwhich iterates over theGRUB_CMDLINE_LINUX*variables and appends every argument back except the matching one using thefor inandifstatements built into grub script. This would be more correct, but I don't have an adaptable example to hand.The biggest advantages are:
mv /etc/grub.d/*stackhpc-grubcmdline /etc/grub.d.bak && grubby ...grubbyandnotify rebootif any template /lineinfilehas changedIf not the above changes:
old_cmdline != kernel_cmdline | select() | sort | listis potentially brittle as there will be more problems like this in the future. Instead it could be something akin to the following pseudo code:This would only check for keywords the user has explicitly set in the role, rather than being affected by parameters (such as
root=UUID=abc...) which they have not specified at all. However, this still leaves the potential problems around directly generating/etc/default/grub