Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode) .
This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample and tfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord, tfdv.generate_statistics_from_csv and tfdv.generate_statistics_from_dataframe).
TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.
Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data),
GenerateStatisticsAPI will take Arrow tables as input instead ofDict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns areListArrayof primitive types (e.g.,int8,int16,int32,int64,uint8,uint16,uint32,uint64,float16,float32,float64,binary,string,unicode) .This change should be a no-op if you construct the pipeline using the default decoders (e.g.,
tfdv.DecodeTFExampleandtfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g.,tfdv.generate_statistics_from_tfrecord,tfdv.generate_statistics_from_csvandtfdv.generate_statistics_from_dataframe).TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.