Skip to content

GenerateStatistics API Change #75

@paulgc

Description

@paulgc

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode) .

This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample and tfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord, tfdv.generate_statistics_from_csv and tfdv.generate_statistics_from_dataframe).

TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions