[SPEC] Add relative paths to v4 spec#15630
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
Commented a bit on .. in path resolution.
it'd be a good test to submit a v3 manifest with file:/tables/table1/../../etc/passwd as a path and see if relativizing it detected the invalid path at that point
rambleraptor
left a comment
There was a problem hiding this comment.
I've got a couple stylistic things to help improve readability
stevenzwu
left a comment
There was a problem hiding this comment.
overall, it looks good to me. just some minor comments/questions
wypoon
left a comment
There was a problem hiding this comment.
I feel that you're trying to avoid mentioning path separators, and that that makes things unclear and confusing. I feel that it makes more sense to say that table location does not end in a path separator, that relative paths do not begin with a path separator, and when appending relative paths, we need to add the path separators in the appropriate places.
|
|
||
| All location fields in format versions 3 and prior contain fully-qualified paths. | ||
|
|
||
| Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) |
There was a problem hiding this comment.
do you want to link to #paths-in-metadata?
There was a problem hiding this comment.
+1, that includes Path Relativization and Path Resolution. Relative Paths is a little confusing if it does not link to the section with the same name.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM
slight nit for clarification
| Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. | ||
|
|
||
| * If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. | ||
| * If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. |
There was a problem hiding this comment.
nit: It might be helpful to explicitly highlight stored as an absolute path without modification
There was a problem hiding this comment.
I'm not sure we need to clarify. I think that's important to say for consuming from metadata, but how the persisted path is arrived at is different. If something is producing the path, it can pretty much do what ever it wants with the structure as long as it's absolute when it's first persisted.
This is a little nuanced, but I think it would be overreaching in this particular context.
|
|
||
| ### Paths in Metadata | ||
|
|
||
| Path strings stored in Iceberg metadata location fields are classified as one of two types: |
There was a problem hiding this comment.
Nit: There are a few references to "fully qualified path" later in the context of v3 and prior, without it being explicitly defined. Since we're classifying paths into two types
below, it might be worth briefly noting that fully qualified paths from v3 and prior are considered absolute paths. This could help connect the dots more easily.
There was a problem hiding this comment.
We don't want to do this (see other comments on this topic). We don't want to go back and define things that weren't defined for prior versions since it could introduce additional requirements on older versions. The prior spec only referred to "fully-qualified" and "URI with Scheme" for fields and we're not trying to rewrite those versions of the spec.
| * If the path contains a URI scheme, it is absolute and is used without modification. | ||
| * If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. | ||
|
|
||
| The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). |
There was a problem hiding this comment.
I'm not sure we need the examples for duplicate separator. I think that's pretty straight forward?
There was a problem hiding this comment.
Others asked for this explicitly to show what is expected if you sufix/prefix with a separator and what the behavior would look like. The point is to show that you do not de-dup or strip them.
|
|
||
| #### Table Location Specification | ||
|
|
||
| When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should provide a table's location |
There was a problem hiding this comment.
| When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should provide a table's location | |
| When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be maintained and provided by the catalog. ``` |
There was a problem hiding this comment.
We don't want to restrict this to catalogs only.
Please see this comment: #15630 (comment)
|
|
||
| The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below.) | ||
|
|
||
| Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. |
There was a problem hiding this comment.
nit: I had to read this sentence multiple times to understand it
There was a problem hiding this comment.
Me too ;), might need a comma somewhere.
|
|
||
| The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below.) | ||
|
|
||
| Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. |
There was a problem hiding this comment.
It looks we use v4 instead of V4 in all other places.
Adds text to the spec for relative paths.
See full proposal at #13141