Skip to content

Core: Add V4 location relativization utilities#16174

Open
anoopj wants to merge 11 commits into
apache:mainfrom
anoopj:v4-location-util
Open

Core: Add V4 location relativization utilities#16174
anoopj wants to merge 11 commits into
apache:mainfrom
anoopj:v4-location-util

Conversation

@anoopj
Copy link
Copy Markdown
Contributor

@anoopj anoopj commented Apr 30, 2026

Add utility methods for converting between absolute and relative file paths. These will be used by v4 metadata to store locations relative to the table location.

Proposal: https://s.apache.org/iceberg-spec-relative-path
Spec PR: #15630

@github-actions github-actions Bot added the core label Apr 30, 2026
@anoopj anoopj moved this to In review in V4: metadata tree Apr 30, 2026
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
* against bare local paths.
*/
static String resolve(String path, String tableLocation) {
if (path == null || isAbsolute(path) || !isAbsolute(tableLocation)) {
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isAbsolute(tableLocation) check doesn't make sense. By definition, the table location must be absolute or omitted, so this can only return false. At this point, the table location should have been provided (either from metadata or from the catalog). We should assert that it's not null, but probably not check at this point that it is a valid uri (I feel we should assume that check has already been performed).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
@anoopj anoopj force-pushed the v4-location-util branch from 1b8accb to ce7d435 Compare May 1, 2026 00:05
@anoopj anoopj requested a review from danielcweeks May 1, 2026 00:09
@anoopj anoopj force-pushed the v4-location-util branch from ce7d435 to cbfb44b Compare May 1, 2026 15:07
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
@anoopj anoopj requested a review from danielcweeks May 6, 2026 21:59
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestRelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RelativePathUtil.java Outdated
@anoopj anoopj force-pushed the v4-location-util branch from 681076f to fa10500 Compare May 8, 2026 15:12
@anoopj anoopj requested review from nastra and rdblue May 8, 2026 15:23
Comment thread core/src/test/java/org/apache/iceberg/util/TestLocationUtil.java Outdated
}

@Test
public void testRelativizeLocationNotUnderTableLocation() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for this to test both a table/path mismatch and a bucket mismatch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is correct. There are a couple minor comments on the tests, but otherwise I think this is good.

* location is returned as-is.
*/
public static String relativizeLocation(String tableLocation, String location) {
if (location.startsWith(tableLocation)) {
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix-collision bug: startsWith matches sibling locations that share a prefix with the table location.

relativizeLocation("s3://bucket/table", "s3://bucket/table_v2/data/00000-0.parquet")
  -> "_v2/data/00000-0.parquet"   // wrong; this location is NOT under tableLocation

The new testRelativizeLocationNotUnderTableLocation covers different-bucket and different-path cases, but neither shares a prefix with table, so this case still slips through. Worth a test for the table vs table_v2 style collision.

One possible solution if the path separator is / char.

  if (location.startsWith(tableLocation)) {
    String rest = location.substring(tableLocation.length());
    if (rest.isEmpty() || rest.startsWith("/")) {
      return rest;
    }
  }

  return location;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this defensive check in the initial version of this PR, but removed it because I got feedback that we should not do assumptions about path separators. cc @danielcweeks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here isn't really about whether / is a separator — at the storage layer, S3 is a flat keyspace (bucket + opaque object key), and there's no folder concept in S3 itself. Requiring path-separator semantics at that level isn't right.

The actual problem with byte-prefix startsWith is relocation correctness, which is the whole purpose of supporting relative paths.

Same-location round-trip looks fine in isolation:

tableLocation = "s3://bucket/table"
file          = "s3://bucket/table_v2/file"   (a sibling of the table, not a child)
relativize    -> "_v2/file"
resolve back  -> "s3://bucket/table_v2/file"  ✓ round-trips

But move the table to s3://newbucket/newtable and resolve the same stored relative form:

resolve("s3://newbucket/newtable", "_v2/file")
  -> "s3://newbucket/newtable_v2/file"

The file that originally lived at s3://bucket/table_v2/file (a sibling of the old table) now resolves to s3://newbucket/newtable_v2/file, with no reason that key should exist or contain the right data. A file that isn't structurally a child of the table location cannot survive relocation when stored as a relative path, and the whole point of relative paths is to make relocation safe.

The spec PR's rule is "If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored." — but it doesn't pin down what "common prefix" means. Byte-prefix permits the case above; a boundary/segment-aware notion forbids it. The relocation argument points to the latter — otherwise relative paths are unsafe for any sibling-prefix layout.

cc @rdblue / @danielcweeks: should s3://bucket/table_v2/file be relativizable against s3://bucket/table? If not, the spec should make that explicit, and relativizeLocation should refuse to produce a relative form in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scenario Steven described is definitely possible. I had called out this in the proposal doc.

I agree that this is something we need to consider in the spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to add defensive checks if it can be done efficiently in the reference implementation, but I'm not convinced there's a way to cleanly solve for this in the spec. This scenario should resolve to an absolute path, but I don't think we can provide a concrete solution without relying on a separator character.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario should resolve to an absolute path

Great. This is the first thing that we need to agree on.

but I don't think we can provide a concrete solution without relying on a separator character.

This is the tricky part. The location is an URI (not a s3 object key string). It might be reasonable to assume / as a separator character for hierarchies?

The spec already mentioned the path separator. Can we require that the relative portion must start with the path separator?

Paths used as prefixes must not end in a path separator. 
The relative portion is appended to the prefix without
 introduction of any additional separator characters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we require that the relative portion must start with the path separator

Alternatively, we can flip it require that the table location must end with a path separator. Both options solve this problem.

Implementors will find it a bit surprising that the relative path starts with a /. For instance, if you use the Rust Url libraries you get surprising results:

use url::Url;

fn main() {
      let base = Url::parse("s3://bucket/mytable").unwrap();

      let iceberg_url = base.join("/data/file.parquet").unwrap();
      # prints out: Iceberg absolute: s3://bucket/data/file.parquet (mytable is dropped)
      println!("Iceberg absolute: {iceberg_url}");
}

This can be avoided if we flip it and require the location must end with the separator instead. The downside of this approach is that we might lose the early termination done in the current version of this PR and may have to look at more characters for a colon instead. But honestly it may not make much of a difference in modern CPUs if JVM is using JIT compilation with SIMD instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was originally pushback against including the separator at the end of the table prefix (I don't remember the specific argument, but I'll look into that). I don't see a good reason at this point to not change the separator location. That would also align better with what people intuitively think of when looking at a path (relative paths typically don't start with a /).

I'll follow up on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow up. The new spec does solve the problem. I have updated the code in this PR accordingly.

Comment thread core/src/main/java/org/apache/iceberg/util/LocationUtil.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/util/TestLocationUtil.java
anoopj added 8 commits May 19, 2026 13:22
Add isAbsolute, resolve, and relativize methods for converting
between absolute and relative file paths. These will be used by
v4 metadata to store locations relative to the table location.
@anoopj anoopj force-pushed the v4-location-util branch from c511f29 to bafcd0d Compare May 19, 2026 20:22
@anoopj
Copy link
Copy Markdown
Contributor Author

anoopj commented May 19, 2026

@danielcweeks @rdblue @stevenzwu @nastra FYI, I have updated the implementation of relative paths based on the latest relative path spec PR.


String resolved = LocationUtil.resolveLocation(tableLocation, relativized);
assertThat(resolved).isEqualTo(absoluteLocation);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add a test where tableLocation ends with / or the relative location starts with / to cover the case that is mentioned in the spec: https://github.com/apache/iceberg/pull/15630/changes#diff-36347a47c3bf67ea2ef6309ea96201814032d21bb5f162dfae4045508c15588aR213

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. Done.

* as-is. Otherwise, the table location and the relative location are joined by the URI separator
* character {@code /}.
*/
public static String resolveLocation(String tableLocation, String location) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we're not guarding against null parameters here and in relativizeLocation. Also might be worth adding a comment that this is expecting tableLocation to not end with / or location to not start with /

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forget to push the guarding against null parameters?


@Test
public void testResolveRelativeLocations() {
String tableLocation = "s3://bucket/table";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good tests, but now that the spec contains a table of examples, I would make sure they include all of the cases from the spec. I think that just means changing bucket/table to bucket/db/table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@Test
public void testResolveAbsoluteLocationsUnchanged() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment below, it would be good to have a different bucket and a matching bucket different path test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}

@Test
public void testRelativizeLocationEqualToTableLocation() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this decision.

}

@Test
public void testResolveWithTrailingOrLeadingSlashProducesDuplicateSeparator() {
Copy link
Copy Markdown
Contributor

@rdblue rdblue May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see this. I would also add the examples from the spec here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added examples from spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

5 participants