Skip to content

grubcmdline notifies reboot handler unnecessarily  #36

@DavidFair

Description

@DavidFair

Whilst configuring machines with the grubcmdline the check for the command line is not idempotent, so the same machine will reboot every time.

I suspect this is either

  • related to crashkernel appearing multiple times in the args (which we should fix in the caller, but should the role does not guard against) or
  • resume=UUID gets packed into the cmdline multiple times

The role should either throw a fatal: message, or not notify the reboot handler in these cases

Logs

TASK [stackhpc.linux.grubcmdline : Notify reboot handler] **********************
Friday 20 December 2024  11:22:41 +0000 (0:00:00.049)       0:00:08.921 ******* 
changed: [hv.example.com] => 
  msg: Old command line was BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=
c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
....
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX_DEFAULT] *********
Friday 20 December 2024  11:22:42 +0000 (0:00:00.039)       0:00:09.543 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux_default: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
...
TASK [stackhpc.linux.grubcmdline : Display GRUB_CMDLINE_LINUX] *****************
Friday 20 December 2024  11:22:42 +0000 (0:00:00.038)       0:00:09.582 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux: crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto
.....
.....
  
  
TASK [stackhpc.linux.grubcmdline : Display newly computed GRUB_CMDLINE_LINUX_DEFAULT] ***
Friday 20 December 2024  11:22:42 +0000 (0:00:00.046)       0:00:09.816 ******* 
ok: [hv.example.com] => 
  grub_cmdline_linux_new:
  - crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
  - resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4
  - rhgb
  - quiet
  - crashkernel=auto
  - rhgb
  - quiet
  - crashkernel=auto

After reboot:

cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-503.15.1.el9_5.x86_64 root=UUID=6caabe9f-f426-4105-a38e-52e98545a68a ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=c38017eb-0531-46fa-bf97-a9265a9caed4 rhgb quiet crashkernel=auto rhgb quiet crashkernel=auto

Calling Module

  vars:
    kernel_cmdline_args:
      - "rhgb"
      - "quiet"
      - "crashkernel=auto"
    kernel_cmdline_args_remove:
      - hugepage
      - selinux
  roles:
    - name: stackhpc.linux.grubcmdline
      vars:
        kernel_cmdline: "{{ kernel_cmdline_args }}"
        kernel_cmdline_remove: "{{ kernel_cmdline_args_remove }}"

Own thoughts:

  • Changing the default parameters on /etc/grub2.cfg or grubby args is working around the existing tooling built into the OS and grub:

Adding args should be done through templating files into /etc/grub.d for each override (RH Grubby Docs and StackOverflow example), such as /etc/grub.d/60-stackhpc-grubcmdline-quiet then simply invoke grubby or grub-mkconfig without args

Removing entries would still edit /etc/default/grub, but ansible.builtin.lineinfile should do the "surgical cuts" instead to minimise the changes to the file rather than trying generate a "new" cmdline using Jinja templates (which could generate wrong and result in an unbootable machine).
Alternatively, a grub script can be templated out /etc/grub.d which iterates over the GRUB_CMDLINE_LINUX* variables and appends every argument back except the matching one using the for in and if statements built into grub script. This would be more correct, but I don't have an adaptable example to hand.

The biggest advantages are:

  • Prevents the role fighting upstream changes if they change the default cmdline (e.g. in-place OS upgrades)
  • Would avoid the problems with the UUID
  • Makes troubleshooting easy to diagnose if it's a distro or kernel flag change: mv /etc/grub.d/*stackhpc-grubcmdline /etc/grub.d.bak && grubby ...
  • Idempotent by default, template through ansible then use a handler for re-running grubby and notify reboot if any template / lineinfile has changed

If not the above changes:

  • I think old_cmdline != kernel_cmdline | select() | sort | list is potentially brittle as there will be more problems like this in the future. Instead it could be something akin to the following pseudo code:
changed_when: not(all(grub_cmdline_linux_new in kernel_cmdline)) or any(grub_cmdline_linux_remove in kernel_cmdline)

This would only check for keywords the user has explicitly set in the role, rather than being affected by parameters (such as root=UUID=abc...) which they have not specified at all. However, this still leaves the potential problems around directly generating /etc/default/grub

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions