Fix/import issue#1240
Open
bplatz wants to merge 4 commits into
Open
Conversation
…ld_class_stat_entries domain mismatch). The SPOT-merge collector stores ValueTypeTag values into prop_dts (it only has the per-flake o_type byte, not the dict). The consumers were treating those keys as DatatypeDictId indices into dt_tags, which produced wrong tags post-reindex (xsd:integer→xsd:boolean, xsd:date→rdf:langString, …) and UNKNOWN on the import path (which passed &[] for dt_tags). Drop the dt_tags parameter from build_class_stat_entries and build_class_stats_json; the stored u16 is already a ValueTypeTag value, so cast directly to u8. Update the SpotClassStats doc to reflect the correct domain. Adds a regression test asserting xsd:string, xsd:integer, xsd:boolean, xsd:dateTime, xsd:date, and IRI-ref tags after reindex.
These were the only unambiguous OType→ValueTypeTag arms missing from the stats datatype mapping; previously class stats reported them as UNKNOWN. NUM_BIG_OVERFLOW is intentionally left unmapped: it carries both arbitrary-precision xsd:decimal and i64-overflow xsd:integer (both share ObjKind::NUM_BIG), and the SPOT-merge RunRecordV2 stream doesn't carry the dt sid needed to disambiguate. Comment in id_hook.rs explains what plumbing would be required to fix this faithfully. Extends reindex_class_stats_report_correct_datatypes to assert both new arms.
The import path built the FIR6 root inline with `IndexStats.size = 0`, even though `total_commit_size` was correctly tracked and stored on the root. The normal indexing path runs a size-propagation block in `root_assembly` that copies `root.total_commit_size` into `stats.size` and proportionally allocates per-graph sizes by flake count, but the import path skipped it. Result: `info` reported `size: 0` for the ledger, every named graph, and stats.size after a streaming import, while the same data landing via the normal commit path showed correct sizes. Factors the 18-line size-distribution block into `IndexStats::distribute_total_size_by_flakes` and uses it from root_assembly, both incremental sites, and the new import site. Regression test in `import_collects_stats` asserts `stats.size > 0` and per-graph `graphs[0].size > 0` after import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes incorrect datatype labels in per-class stats (
stats.classes[*].properties[*].types).The class stats collector stores datatype buckets as
ValueTypeTagvalues, but the stats builder was treating those values asDatatypeDictIdindices. This caused initial imports to report literal property types asUNKNOWN, and rebuilds to report deterministic but wrong labels like@id,xsd:boolean, orrdf:langString.This PR keeps the class stats path consistently in the
ValueTypeTagdomain and removes the now-misleadingdt_tagslookup. It also adds missing unambiguousOTypemappings forxsd:gYearMonthandxsd:gMonthDay, while intentionally leaving ambiguous carriers likeNUM_BIG_OVERFLOWasUNKNOWNuntil they can be disambiguated with arena access or preserved semantic datatype tags.Test Plan
ValueTypeTagvalues, including string, integer, boolean, dateTime, date, gYearMonth, and@idreferences.