fix(ds-identify): parse multi-line YAML datasource_list#6883
Conversation
Multi-line YAML lists are valid YAML. The prior code assumed that
multi-line datasource_list values could not be parsed, but both
multi-line flow sequences and block sequences are standard YAML:
# multi-line flow sequence
datasource_list: [
Azure
]
# block sequence
datasource_list:
- Azure
- None
The existing read_datasource_list() only handled single-line flow
sequences (e.g. 'datasource_list: [Azure, None]'). When a config
file contains a multi-line flow sequence, the parser saw only '['
without the closing ']', produced an empty dslist, and fell back to
the full default datasource list. On Azure VMs this meant
ds-identify checked every datasource unnecessarily.
Add read_datasource_list_multiline() to handle both multi-line flow
sequences and block sequences as a fallback when single-line parsing
produces no result. Also tighten get_single_line_flow_sequence() to
reject values that lack a closing ']', so incomplete flow sequences
fall through to the multi-line parser correctly.
An alternative approach would be to shell out to python3 for YAML
parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a
hard dependency of cloud-init and is guaranteed to be present.
This would handle every valid YAML construct correctly with zero
edge cases and much less code. However, ds-identify is designed as
a pure POSIX shell script for speed — it runs as a systemd generator
early in boot where startup latency matters, and on systems where
cloud-init should be disabled it avoids the cost of a python3
interpreter entirely. The shell-based approach preserves that
property while covering the formats seen in practice. We could
maybe do one and attempt to fall back to the other?
When datasource_list is present in config but cannot be parsed by
any method, ds-identify now emits a distinct error indicating the
key was found but unparsable. I think maybe this error needs to go
to console or otherwise be detected by cloud-init later and escalated
as part of status, etc.
663ee0f to
5800d1d
Compare
| local files="" f="" line="" val="" mode="" dslist="" found_file="" | ||
| local flow_content="" | ||
| files="${PATH_ETC_CI_CFG} ${PATH_ETC_CI_CFG_D}/*.cfg" | ||
| # shellcheck disable=2086 |
There was a problem hiding this comment.
Could you explain the need to disable SC2086? I see that the files var contains a string that expands two constants with a space in between them. Then, on line 669, files are set without quoting $files, so the files var will expand into two strings based on the space in the $files string acting as an implicit delimiter. Is this intended behavior, and if so, does $PATH_ETC_CI_CFG contain (or not possess) a file extension? It would seem that this is intended and avoids the need to iterate over a list object for each file name when setting the files, but I want to be sure.
| mode="" | ||
| while IFS= read -r line || [ -n "$line" ]; do | ||
| line="${line%%#*}" | ||
| if [ "$mode" = "flow" ]; then |
There was a problem hiding this comment.
There's been a lot of implicit use of conditionals with test ([ condition ]) commands, but here, we explicitly set the if/then condition. Is there a reason for mixing and matching implicit and explicit conditional use? My preference is explicit for human readability, but ultimately, this should probably stay consistent with whatever the other files favor unless there is a reason to mix the two.
Multi-line YAML lists are valid YAML. The prior code assumed that multi-line datasource_list values could not be parsed, but both multi-line flow sequences and block sequences are standard YAML:
multi-line flow sequence
datasource_list: [
Azure
]
block sequence
datasource_list:
- Azure
- None
The existing read_datasource_list() only handled single-line flow sequences (e.g. 'datasource_list: [Azure, None]'). When a config file contains a multi-line flow sequence, the parser saw only '[' without the closing ']', produced an empty dslist, and fell back to the full default datasource list. On Azure VMs this meant ds-identify checked every datasource unnecessarily.
Add read_datasource_list_multiline() to handle both multi-line flow sequences and block sequences as a fallback when single-line parsing produces no result. Also tighten get_single_line_flow_sequence() to reject values that lack a closing ']', so incomplete flow sequences fall through to the multi-line parser correctly.
An alternative approach would be to shell out to python3 for YAML parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a hard dependency of cloud-init and is guaranteed to be present. This would handle every valid YAML construct correctly with zero edge cases and much less code. However, ds-identify is designed as a pure POSIX shell script for speed — it runs as a systemd generator early in boot where startup latency matters, and on systems where cloud-init should be disabled it avoids the cost of a python3 interpreter entirely. The shell-based approach preserves that property while covering the formats seen in practice. We could maybe do one and attempt to fall back to the other?
When datasource_list is present in config but cannot be parsed by any method, ds-identify now emits a distinct error indicating the key was found but unparsable. I think maybe this error needs to go to console or otherwise be detected by cloud-init later and escalated as part of status, etc.