DNM: kola testiso folding proof of concept#4511
DNM: kola testiso folding proof of concept#4511dustymabe wants to merge 13 commits intocoreos:mainfrom
Conversation
Store the test execution timeout context on RuntimeConfig.TestExecTimeout so that BaseCluster.SSH can enforce the timeout without requiring callers to pass a context.Context through every function signature. The harness sets TestExecTimeout to h.TimeoutContext() when building the RuntimeConfig in runTest(). BaseCluster.SSH uses Start()+Wait() with a select on this context, closing the SSH session when the context is cancelled. This enables us to essentially have timeout checking on every SSH() call we do. Written-by: <anthropic/claude-opus-4-6>
There were two deferred functions and we might as well combine them into one.
In the previous commit we plumbed through timeout/cancelling into every SSH command so now we don't really need to SkipStartMachine and then call mach.Start() inside a RunWithExecTimeoutCheck() here in RunTest any longer.
Some tests don't start their machines via the Harness, but rather directly in the tests via `NewMachine()`. In those cases we were releasing the memory reservation prematurely. Let's asynchronously wait for the machines to be up and then release the memory reservation.
The early return guard that checks t.ReservedMemoryCountMiB == 0 was performed outside the mutex. Since releaseMemoryCount can be called concurrently from both the async memory-release goroutine and the deferred cleanup function, this unsynchronized read could see a stale value under the Go memory model, potentially causing a double-subtract. Move the check inside the mutex to ensure the read of t.ReservedMemoryCountMiB has a proper happens-before relationship with the write that zeroes it. Written-by: <anthropic/claude-opus-4-6>
Maintaining separate QemuMachineOptions and MachineOptions structs led to complications: awkward two-level nesting at call sites, a shadowed Firmware field, and type assertions needed just to pass QEMU-specific options. Move the QEMU-only fields (HostForwardPorts, DisablePDeathSig, OverrideBackingFile, Nvme, Cex) into MachineOptions and delete the QemuMachineOptions type. Add EnsureNoQEMUOnlyOptions() so non-QEMU platforms (aws, azure, gcp, do, esx, openstack) reject these options early with a clear error. Keep NewMachineWithQemuOptions as a thin alias on qemu/qemuiso clusters for backward compatibility. Written-by: <anthropic/claude-opus-4.6>
MultiPathDisk, PrimaryDisk, MinMemory, NumaNodes, AdditionalNics, AppendKernelArgs, AppendFirstbootKernelArgs, and Firmware are only implemented by QEMU-based platforms. Move them below the QEMU-only comment boundary and check them in EnsureNoQEMUOnlyOptions(), removing the now-redundant per-platform rejection checks in the non-QEMU cluster implementations. The remaining cross-platform fields are AdditionalDisks (qemu, qemuiso, azure, gcloud), MinDiskSize (qemu, aws), and InstanceType (azure). Written-by: <anthropic/claude-opus-4.6>
Now that QemuMachineOptions has been folded into MachineOptions, NewMachineWithQemuOptions is just a trivial wrapper around NewMachineWithOptions. Delete it and update all callers to use NewMachineWithOptions through the Cluster interface directly. This removes the need for type assertions to *qemu.Cluster in tests that only used them to access NewMachineWithQemuOptions. Tests that still need the type assertion (luks tang setup, ostree sync, rhcos upgrade) keep it for platform-specific branching logic, but call NewMachineWithOptions through the interface rather than through the concrete type. Written-by: <anthropic/claude-opus-4.6>
Previously, kola spawn only entered the DisablePDeathSig / spawnMachineOptions code path when the platform was "qemu". This guard existed because DisablePDeathSig and the machine options JSON file were QEMU-specific concepts that required a type assertion to *qemu.Cluster. Now that MachineOptions is the single unified type used by all platforms, and non-QEMU platforms call EnsureNoQEMUOnlyOptions() in their NewMachineWithOptions(), the guard is no longer necessary: - On QEMU: behavior is unchanged. - On non-QEMU with --remove=false: previously this silently ignored the user's intent (DisablePDeathSig has no meaning outside QEMU) and created the machine via NewMachine() with no options. Now it calls NewMachineWithOptions() which will reject DisablePDeathSig with a clear error, surfacing the fact that the flag combination is unsupported rather than silently doing the wrong thing. - On non-QEMU with --machine-options: same improvement — the user gets a clear error if the JSON contains QEMU-only fields. Written-by: <anthropic/claude-opus-4.6>
…ptions The Test struct duplicated many fields from platform.MachineOptions (MultiPathDisk, AdditionalDisks, PrimaryDisk, MinMemory, MinDiskSize, NumaNodes, AdditionalNics, AppendKernelArgs, AppendFirstbootKernelArgs, InstanceType). The harness then manually copied each one into a MachineOptions before creating machines. Replace all of these with a single MachineOptions field on the Test struct. The harness now uses t.MachineOptions directly. Fields that are not MachineOptions (InjectContainer, ReservedMemoryCountMiB) stay on the Test struct. Update the two test files (multipath.go, network.go) that set these fields in their registrations, and update harness.go external test registration to populate the MachineOptions sub-struct. Written-by: <anthropic/claude-opus-4.6>
The basic.uefi, basic.uefi-secure, and basic.nvme tests previously set ClusterSize to 0 and manually created machines inside wrapper functions (uefiWithBasicTests, uefiSecureWithBasicTests, nvmeBasicTests) just to pass Firmware/Nvme in MachineOptions. They then manually called ScpKolet and LocalTests. Now that the Test struct carries MachineOptions directly, these tests can set ClusterSize to 1 with the desired MachineOptions and use LocalTests as their Run function. The harness handles machine creation, kolet upload, and native function dispatch automatically. This removes the three wrapper functions, runBasicTests, and the kola/cluster imports that were only needed for them. Written-by: <anthropic/claude-opus-4.6>
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Code Review
This pull request performs a major refactoring of the kola test harness and platform abstraction layer. It consolidates QEMU-specific machine options into a unified MachineOptions struct and introduces a TestExecTimeout context within the platform's SSH implementation to handle command timeouts automatically. Additionally, it integrates the install-media tests into the standard kola registry and simplifies machine creation across several test suites. I have no feedback to provide as there were no review comments to assess.
This merges mantle/cmd/kola/testiso.go into mantle/kola/tests/install-media/test-install-media.go. It keeps most of the actual testing the same but switches it so that now the tests are just registered with the harness and then called just like any other kola test. There is a lot of work still to do to dedupe code with mantle/platform/qemu but it at least compiles and the tests run.
They can be run as part of the normal kola run now. Let's just hack the jenkinsfile up a bit to test this out.
ffb0696 to
b77efce
Compare
| root.AddCommand(cmdTestIso) | ||
| for _, test := range getAllTests(kola.CosaBuild) { | ||
| register.RegisterTest(®ister.Test{ | ||
| Name: "install-media." + test, |
There was a problem hiding this comment.
This is a good opportunity to reclassify these tests as "install-media" tests because not all of them actually test ISO images.
| register.RegisterTest(®ister.Test{ | ||
| Name: "install-media." + test, | ||
| Description: "The %s install test", | ||
| Run: runTestInstall, |
There was a problem hiding this comment.
the runTestInstall function is mostly what we had in the past. It parses the test name and then dispatches the right test.
| addNmKeyfile = false | ||
| enable4k = false | ||
| enableMultipath = false | ||
| enableUefi = false | ||
| enableUefiSecure = false | ||
| isOffline = false |
There was a problem hiding this comment.
These globals won't really work if we were to run tests in parallel. @nikita-dubrovskii PR #4377 doesn't have this problem.
|
@dustymabe: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This builds on #4509 and #4510
and shows a way to keep some the testiso tests being defined the way they are today where there is just a list of "test names" that are then translated into actual tests that get run based on each of the components in the name (i.e. each component of
install-media.miniso-install.4k.uefimeans something and the test function picks up on it and does the right thing.