From ede8e42db0dfca8c834a8b2e2890811284806a7d Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Fri, 11 Dec 2020 18:04:58 +0100 Subject: [PATCH 1/6] PARQUET-1950: Define core features / compliance level --- CoreFeatures.md | 181 +++++++++++++++++++++++++++++++++ README.md | 7 ++ src/main/thrift/parquet.thrift | 7 ++ 3 files changed, 195 insertions(+) create mode 100644 CoreFeatures.md diff --git a/CoreFeatures.md b/CoreFeatures.md new file mode 100644 index 000000000..f79c13fb0 --- /dev/null +++ b/CoreFeatures.md @@ -0,0 +1,181 @@ + + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certian release makes a compliance level that +the different implementations can tied to. If an implementation claims that it +provides the functionality of a parquet-format release core features it must +implement all of the listed features according the specification (both read and +write path). This way it is easier to ensure compatibility between the +different parquet implementations. +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on this list are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.b.x` it must be able to read any +parquet files written by implementations supporting the release `a.d.y` where +`b >= d`. + +If a parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. + +## The "list" + +This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +### Types + +#### Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + +#### Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN + values in case of V2 pages +* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5) + **(?)** + parquet-mr: Used for V2 pages to encode INT32 and INT64 values. +* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6) + **(?)** + parquet-mr: Not used directly +* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7) + **(?)** + parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and + FIXED\_LEN\_BYTE\_ARRAY values +* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: Used for V2 page dictionaries +* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9) + **(?)** + parquet-mr: Not used by default; can be used only via explicit configuration + +NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is +deprecated and not used directly (boolean values are encoded with this under +PLAIN) so not included in this list. + +**TODO**: In parquet-mr dictionary encoding is not enabled for +FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason +behind. Any experience/idea about this from other implementations? + +### Compression + +The following compression algorithms are supported (including `UNCOMPRESSED`). +* `SNAPPY` +* `GZIP` +* `LZO` **(?)** +* `BROTLI` **(?)** +* `LZ4` **(?)** +* `ZSTD` **(?)** + +### Statistics + +However understanding statistics is not crucial to read the data in a file we +still list these features as wrongly specified/implemented statistics can still +cause losing data unnoticed. +The following features related to statistics are supported. +* The row group level min/max values: The fields `min\_value` and `max\_value` + shall be used in the `Statistics` object according to the specification +* [Column Index](PageIndex.md) + +NOTE: Writing page level statistics to the data page headers is not required. + +The list of `column\_orders` in `FileMetaData` must be set according to the +notes. See the special handlings required for floating point numbers at +`ColumnOrder`. + diff --git a/README.md b/README.md index 3f837906f..4aae4777c 100644 --- a/README.md +++ b/README.md @@ -239,6 +239,13 @@ There are many places in the format for compatible extensions: - Encodings: Encodings are specified by enum and more can be added in the future. - Page types: Additional page types can be added and safely skipped. +## Compatibility +Because of the many features got into the Parquet format it is hard for the +different implementations to keep up. We introduced the list of "core +features". This document is versioned by the parquet format releases and defines +a compliance level for the different implementations. See +[CoreFeatures.md](CoreFeatures.md) for more details. + ## Contributing Comment on the issue and/or contact [the parquet-dev mailing list](http://mail-archives.apache.org/mod_mbox/parquet-dev/) with your questions and ideas. Changes to this core format definition are proposed and discussed in depth on the mailing list. You may also be interested in contributing to the Parquet-MR subproject, which contains all the Java-side implementation and APIs. See the "How To Contribute" section of the [Parquet-MR project](https://github.com/apache/parquet-mr#how-to-contribute) diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 0e091d7e8..4b42e6f9e 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -1041,6 +1041,13 @@ struct FileMetaData { * Used only in encrypted files with plaintext footer. */ 9: optional binary footer_signing_key_metadata + + /** + * This field might be set with the version number of a parquet-format release + * if this file is created by using only the features listed in the related + * list of core features. See CoreFeatures.md for details. + */ + 10: optional string compliance_level } /** Crypto metadata for files with encrypted footer **/ From 75bd1b7a8b586375f7196156512196f1073ded75 Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Mon, 14 Dec 2020 12:13:24 +0100 Subject: [PATCH 2/6] Apply suggestions from code review Co-authored-by: emkornfield --- CoreFeatures.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/CoreFeatures.md b/CoreFeatures.md index f79c13fb0..521508b52 100644 --- a/CoreFeatures.md +++ b/CoreFeatures.md @@ -24,8 +24,8 @@ list is a subset of the features which parquet-format makes available. ## Purpose -The list of core features for a certian release makes a compliance level that -the different implementations can tied to. If an implementation claims that it +The list of core features for a certain release makes a compliance level that +for implementations . If an implementation claims that it provides the functionality of a parquet-format release core features it must implement all of the listed features according the specification (both read and write path). This way it is easier to ensure compatibility between the @@ -34,7 +34,7 @@ We cannot and don't want to stop our clients to use any features that are not on this list but it shall be highlighted that using these features might make the written parquet files unreadable by other implementations. We can say that the features available in a parquet-format release (and one of the -implementations of it) and not on this list are experimental. +implementations of it) and not on the core feature list are experimental. ## Versioning @@ -56,7 +56,7 @@ The idea is to only include features which are specified correctly and proven to be useful for everyone. Because of that we require to have at least two different implementations that are released and widely tested. -## The "list" +## Core feature list This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) where all the data structures we might use in a parquet file are defined. @@ -165,7 +165,7 @@ The following compression algorithms are supported (including `UNCOMPRESSED`). ### Statistics -However understanding statistics is not crucial to read the data in a file we +Statistics are not required for reading data but incorrect or under specified statistics implementation can cause data loss. still list these features as wrongly specified/implemented statistics can still cause losing data unnoticed. The following features related to statistics are supported. @@ -178,4 +178,3 @@ NOTE: Writing page level statistics to the data page headers is not required. The list of `column\_orders` in `FileMetaData` must be set according to the notes. See the special handlings required for floating point numbers at `ColumnOrder`. - From d2bab9e6acc71046312045d299bdf4d4abf612e4 Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Mon, 14 Dec 2020 12:30:02 +0100 Subject: [PATCH 3/6] PARQUET-1950: Address comments * Separate requirements for writers and readers * Add requirement for interoperability tests * Remove TODO about the dictionary encoding of FIXED values as it seems only parquet-mr has limitations but the other implementations do not * Remove LZO because licensing issues makes hard to include them in the implementations --- CoreFeatures.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/CoreFeatures.md b/CoreFeatures.md index 521508b52..0cbe09ed5 100644 --- a/CoreFeatures.md +++ b/CoreFeatures.md @@ -24,17 +24,18 @@ list is a subset of the features which parquet-format makes available. ## Purpose -The list of core features for a certain release makes a compliance level that -for implementations . If an implementation claims that it -provides the functionality of a parquet-format release core features it must -implement all of the listed features according the specification (both read and -write path). This way it is easier to ensure compatibility between the -different parquet implementations. +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different parquet implementations. + We cannot and don't want to stop our clients to use any features that are not on this list but it shall be highlighted that using these features might make the written parquet files unreadable by other implementations. We can say that the features available in a parquet-format release (and one of the -implementations of it) and not on the core feature list are experimental. +implementations of it) and not on the *core feature list* are experimental. ## Versioning @@ -54,7 +55,9 @@ in the thrift object `FileMetaData`. The idea is to only include features which are specified correctly and proven to be useful for everyone. Because of that we require to have at least two -different implementations that are released and widely tested. +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. ## Core feature list @@ -149,16 +152,11 @@ NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is deprecated and not used directly (boolean values are encoded with this under PLAIN) so not included in this list. -**TODO**: In parquet-mr dictionary encoding is not enabled for -FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason -behind. Any experience/idea about this from other implementations? - ### Compression The following compression algorithms are supported (including `UNCOMPRESSED`). * `SNAPPY` * `GZIP` -* `LZO` **(?)** * `BROTLI` **(?)** * `LZ4` **(?)** * `ZSTD` **(?)** From fd5bf7dde9b9a112d0fd738fe1fe57c0bea63674 Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Mon, 1 Feb 2021 13:18:24 +0100 Subject: [PATCH 4/6] PARQUET-1950: Fix review comments --- CoreFeatures.md | 20 ++++++++++---------- README.md | 6 +++--- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/CoreFeatures.md b/CoreFeatures.md index 0cbe09ed5..ee684b580 100644 --- a/CoreFeatures.md +++ b/CoreFeatures.md @@ -29,11 +29,11 @@ implementations. If a writer implementation claims that it is at a certain compliance level then it must use only features from the *core feature list* of that parquet-format release. If a reader implementation claims the same if must implement all of the listed features. This way it is easier to ensure -compatibility between the different parquet implementations. +compatibility between the different Parquet implementations. We cannot and don't want to stop our clients to use any features that are not on this list but it shall be highlighted that using these features might make -the written parquet files unreadable by other implementations. We can say that +the written Parquet files unreadable by other implementations. We can say that the features available in a parquet-format release (and one of the implementations of it) and not on the *core feature list* are experimental. @@ -43,13 +43,13 @@ This document is versioned by the parquet-format releases which follows the scheme of semantic versioning. It means that no feature will be deleted from this document under the same major version. (We might deprecate some, though.) Because of the semantic versioning if one implementation supports the core -features of the parquet-format release `a.b.x` it must be able to read any -parquet files written by implementations supporting the release `a.d.y` where -`b >= d`. +features of the parquet-format release `a.c.x` it must be able to read any +Parquet files written by implementations supporting the release `a.b.y` where +`c >= b`. -If a parquet file is written according to a released version of this document +If a Parquet file is written according to a released version of this document it might be a good idea to write this version into the field `compliance_level` -in the thrift object `FileMetaData`. +in the Thrift object `FileMetaData`. ## Adding new features @@ -61,8 +61,8 @@ implementation can read the data written by the other one and vice versa. ## Core feature list -This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) -where all the data structures we might use in a parquet file are defined. +This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a Parquet file are defined. ### File structure @@ -94,7 +94,7 @@ here. The [logical type](LogicalTypes.md)s are practically annotations helping to understand the related primitive type (or structure). Originally we have had -the `ConvertedType` enum in the thrift file representing all the possible +the `ConvertedType` enum in the Thrift file representing all the possible logical types. After a while we realized it is hard to extend and so introduced the `LogicalType` union. For backward compatibility reasons we allow to use the old `ConvertedType` values according to the specified rules but we expect that diff --git a/README.md b/README.md index 4aae4777c..221a38f5a 100644 --- a/README.md +++ b/README.md @@ -240,9 +240,9 @@ There are many places in the format for compatible extensions: - Page types: Additional page types can be added and safely skipped. ## Compatibility -Because of the many features got into the Parquet format it is hard for the -different implementations to keep up. We introduced the list of "core -features". This document is versioned by the parquet format releases and defines +Because of the many features that have been added to the Parquet format not all +of the implementations was able to keep up. We introduced the list of "core +features". This document is versioned by the Parquet format releases and defines a compliance level for the different implementations. See [CoreFeatures.md](CoreFeatures.md) for more details. From 2dfe463c948948f7d9624bee3cdd4706eb3488b5 Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Tue, 2 Feb 2021 09:55:43 +0100 Subject: [PATCH 5/6] PARQUET-1950: exclude column chunk file reference --- CoreFeatures.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/CoreFeatures.md b/CoreFeatures.md index ee684b580..043adfc8f 100644 --- a/CoreFeatures.md +++ b/CoreFeatures.md @@ -74,6 +74,16 @@ The following page types are supported: **TODO**: list optional fields that must be filled properly. +#### Column chunk file reference + +The optional field `file_path` in the `ColumnChunk` object of the Parquet footer +(aka Parquet Thrift file) makes it available to reference an external file. This +option was used for different features like _summary files_ or +_external column chunks_. These features were never specified correctly and +they did not spread across the different implementations. Because of that we do +not include these features in this document and therefore the field `file_path` +is not supported. + ### Types #### Primitive types From b400ff23fa1363e8e8641be8751a3e1ba4ebf2b1 Mon Sep 17 00:00:00 2001 From: Gabor Szadovszky Date: Wed, 10 Mar 2021 10:16:49 +0100 Subject: [PATCH 6/6] PARQUET-1950: Remove LZ4 due to its deprecation --- CoreFeatures.md | 1 - 1 file changed, 1 deletion(-) diff --git a/CoreFeatures.md b/CoreFeatures.md index 043adfc8f..258744ca4 100644 --- a/CoreFeatures.md +++ b/CoreFeatures.md @@ -168,7 +168,6 @@ The following compression algorithms are supported (including `UNCOMPRESSED`). * `SNAPPY` * `GZIP` * `BROTLI` **(?)** -* `LZ4` **(?)** * `ZSTD` **(?)** ### Statistics