hipscat.io.parquet_metadata
#
Utility functions for handling parquet metadata files
Module Contents#
Functions#
|
Convenience method to find the min and max inside a statistics dictionary, |
Get the healpix pixel according to a parquet file's metadata. |
|
|
Generate parquet metadata, using the already-partitioned parquet files |
|
Write parquet metadata files for some pyarrow table batches. |
|
Generator for metadata fragment row groups in a parquet metadata file. |
- row_group_stat_single_value(row_group, stat_key: str)[source]#
Convenience method to find the min and max inside a statistics dictionary, and raise an error if they’re unequal.
- Parameters:
row_group – dataset fragment row group
stat_key (str) – column name of interest.
- Returns:
The value of the specified row group statistic
- get_healpix_pixel_from_metadata(metadata: pyarrow.parquet.FileMetaData, norder_column: str = 'Norder', npix_column: str = 'Npix') hipscat.pixel_math.healpix_pixel.HealpixPixel [source]#
Get the healpix pixel according to a parquet file’s metadata.
This is determined by the value of Norder and Npix in the table’s data
- Parameters:
metadata (pyarrow.parquet.FileMetaData) – full metadata for a single file.
- Returns:
Healpix pixel representing the Norder and Npix from the first row group.
- write_parquet_metadata(catalog_path: str, order_by_healpix=True, storage_options: dict = None, output_path: str = None)[source]#
Generate parquet metadata, using the already-partitioned parquet files for this catalog.
For more information on the general parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files
- Parameters:
catalog_path (str) – base path for the catalog
order_by_healpix (bool) – use False if the dataset is not to be reordered by breadth-first healpix pixel (e.g. secondary indexes)
storage_options – dictionary that contains abstract filesystem credentials
output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified
- write_parquet_metadata_for_batches(batches: List[List[pyarrow.RecordBatch]], output_path: str = None, storage_options: dict = None)[source]#
Write parquet metadata files for some pyarrow table batches. This writes the batches to a temporary parquet dataset using local storage, and generates the metadata for the partitioned catalog parquet files.
- Parameters:
batches (List[List[pa.RecordBatch]]) – create one row group per RecordBatch, grouped into tables by the inner list.
output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified
storage_options – dictionary that contains abstract filesystem credentials