Core: Add V4 location relativization utilities#16174
Conversation
| * against bare local paths. | ||
| */ | ||
| static String resolve(String path, String tableLocation) { | ||
| if (path == null || isAbsolute(path) || !isAbsolute(tableLocation)) { |
There was a problem hiding this comment.
!isAbsolute(tableLocation) check doesn't make sense. By definition, the table location must be absolute or omitted, so this can only return false. At this point, the table location should have been provided (either from metadata or from the catalog). We should assert that it's not null, but probably not check at this point that it is a valid uri (I feel we should assume that check has already been performed).
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeLocationNotUnderTableLocation() { |
There was a problem hiding this comment.
I'd like for this to test both a table/path mismatch and a bucket mismatch.
rdblue
left a comment
There was a problem hiding this comment.
I think the code is correct. There are a couple minor comments on the tests, but otherwise I think this is good.
| * location is returned as-is. | ||
| */ | ||
| public static String relativizeLocation(String tableLocation, String location) { | ||
| if (location.startsWith(tableLocation)) { |
There was a problem hiding this comment.
Prefix-collision bug: startsWith matches sibling locations that share a prefix with the table location.
relativizeLocation("s3://bucket/table", "s3://bucket/table_v2/data/00000-0.parquet")
-> "_v2/data/00000-0.parquet" // wrong; this location is NOT under tableLocation
The new testRelativizeLocationNotUnderTableLocation covers different-bucket and different-path cases, but neither shares a prefix with table, so this case still slips through. Worth a test for the table vs table_v2 style collision.
One possible solution if the path separator is / char.
if (location.startsWith(tableLocation)) {
String rest = location.substring(tableLocation.length());
if (rest.isEmpty() || rest.startsWith("/")) {
return rest;
}
}
return location;There was a problem hiding this comment.
I had this defensive check in the initial version of this PR, but removed it because I got feedback that we should not do assumptions about path separators. cc @danielcweeks
There was a problem hiding this comment.
The issue here isn't really about whether / is a separator — at the storage layer, S3 is a flat keyspace (bucket + opaque object key), and there's no folder concept in S3 itself. Requiring path-separator semantics at that level isn't right.
The actual problem with byte-prefix startsWith is relocation correctness, which is the whole purpose of supporting relative paths.
Same-location round-trip looks fine in isolation:
tableLocation = "s3://bucket/table"
file = "s3://bucket/table_v2/file" (a sibling of the table, not a child)
relativize -> "_v2/file"
resolve back -> "s3://bucket/table_v2/file" ✓ round-trips
But move the table to s3://newbucket/newtable and resolve the same stored relative form:
resolve("s3://newbucket/newtable", "_v2/file")
-> "s3://newbucket/newtable_v2/file"
The file that originally lived at s3://bucket/table_v2/file (a sibling of the old table) now resolves to s3://newbucket/newtable_v2/file, with no reason that key should exist or contain the right data. A file that isn't structurally a child of the table location cannot survive relocation when stored as a relative path, and the whole point of relative paths is to make relocation safe.
The spec PR's rule is "If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored." — but it doesn't pin down what "common prefix" means. Byte-prefix permits the case above; a boundary/segment-aware notion forbids it. The relocation argument points to the latter — otherwise relative paths are unsafe for any sibling-prefix layout.
cc @rdblue / @danielcweeks: should s3://bucket/table_v2/file be relativizable against s3://bucket/table? If not, the spec should make that explicit, and relativizeLocation should refuse to produce a relative form in that case.
There was a problem hiding this comment.
I think the scenario Steven described is definitely possible. I had called out this in the proposal doc.
I agree that this is something we need to consider in the spec.
There was a problem hiding this comment.
I think it's reasonable to add defensive checks if it can be done efficiently in the reference implementation, but I'm not convinced there's a way to cleanly solve for this in the spec. This scenario should resolve to an absolute path, but I don't think we can provide a concrete solution without relying on a separator character.
There was a problem hiding this comment.
This scenario should resolve to an absolute path
Great. This is the first thing that we need to agree on.
but I don't think we can provide a concrete solution without relying on a separator character.
This is the tricky part. The location is an URI (not a s3 object key string). It might be reasonable to assume / as a separator character for hierarchies?
The spec already mentioned the path separator. Can we require that the relative portion must start with the path separator?
Paths used as prefixes must not end in a path separator.
The relative portion is appended to the prefix without
introduction of any additional separator characters.
There was a problem hiding this comment.
Can we require that the relative portion must start with the path separator
Alternatively, we can flip it require that the table location must end with a path separator. Both options solve this problem.
Implementors will find it a bit surprising that the relative path starts with a /. For instance, if you use the Rust Url libraries you get surprising results:
use url::Url;
fn main() {
let base = Url::parse("s3://bucket/mytable").unwrap();
let iceberg_url = base.join("/data/file.parquet").unwrap();
# prints out: Iceberg absolute: s3://bucket/data/file.parquet (mytable is dropped)
println!("Iceberg absolute: {iceberg_url}");
}
This can be avoided if we flip it and require the location must end with the separator instead. The downside of this approach is that we might lose the early termination done in the current version of this PR and may have to look at more characters for a colon instead. But honestly it may not make much of a difference in modern CPUs if JVM is using JIT compilation with SIMD instructions.
There was a problem hiding this comment.
There was originally pushback against including the separator at the end of the table prefix (I don't remember the specific argument, but I'll look into that). I don't see a good reason at this point to not change the separator location. That would also align better with what people intuitively think of when looking at a path (relative paths typically don't start with a /).
I'll follow up on this.
There was a problem hiding this comment.
Thanks for the follow up. The new spec does solve the problem. I have updated the code in this PR accordingly.
Add isAbsolute, resolve, and relativize methods for converting between absolute and relative file paths. These will be used by v4 metadata to store locations relative to the table location.
|
@danielcweeks @rdblue @stevenzwu @nastra FYI, I have updated the implementation of relative paths based on the latest relative path spec PR. |
|
|
||
| String resolved = LocationUtil.resolveLocation(tableLocation, relativized); | ||
| assertThat(resolved).isEqualTo(absoluteLocation); | ||
| } |
There was a problem hiding this comment.
maybe also add a test where tableLocation ends with / or the relative location starts with / to cover the case that is mentioned in the spec: https://github.com/apache/iceberg/pull/15630/changes#diff-36347a47c3bf67ea2ef6309ea96201814032d21bb5f162dfae4045508c15588aR213
| * as-is. Otherwise, the table location and the relative location are joined by the URI separator | ||
| * character {@code /}. | ||
| */ | ||
| public static String resolveLocation(String tableLocation, String location) { |
There was a problem hiding this comment.
minor: we're not guarding against null parameters here and in relativizeLocation. Also might be worth adding a comment that this is expecting tableLocation to not end with / or location to not start with /
There was a problem hiding this comment.
did you forget to push the guarding against null parameters?
|
|
||
| @Test | ||
| public void testResolveRelativeLocations() { | ||
| String tableLocation = "s3://bucket/table"; |
There was a problem hiding this comment.
These are good tests, but now that the spec contains a table of examples, I would make sure they include all of the cases from the spec. I think that just means changing bucket/table to bucket/db/table.
| } | ||
|
|
||
| @Test | ||
| public void testResolveAbsoluteLocationsUnchanged() { |
There was a problem hiding this comment.
Similar to the comment below, it would be good to have a different bucket and a matching bucket different path test.
| } | ||
|
|
||
| @Test | ||
| public void testRelativizeLocationEqualToTableLocation() { |
There was a problem hiding this comment.
I agree with this decision.
| } | ||
|
|
||
| @Test | ||
| public void testResolveWithTrailingOrLeadingSlashProducesDuplicateSeparator() { |
There was a problem hiding this comment.
I'm glad to see this. I would also add the examples from the spec here.
There was a problem hiding this comment.
Yes, added examples from spec.
Add utility methods for converting between absolute and relative file paths. These will be used by v4 metadata to store locations relative to the table location.
Proposal: https://s.apache.org/iceberg-spec-relative-path
Spec PR: #15630