`hipscat.io.parquet_metadata`#

Utility functions for handling parquet metadata files

Module Contents#

Functions#

`row_group_stat_single_value`(row_group, stat_key)	Convenience method to find the min and max inside a statistics dictionary,
`get_healpix_pixel_from_metadata`(...)	Get the healpix pixel according to a parquet file's metadata.
`write_parquet_metadata`(catalog_path[, ...])	Generate parquet metadata, using the already-partitioned parquet files
`write_parquet_metadata_for_batches`(batches[, ...])	Write parquet metadata files for some pyarrow table batches.
`read_row_group_fragments`(metadata_file[, storage_options])	Generator for metadata fragment row groups in a parquet metadata file.

row_group_stat_single_value(row_group, stat_key: str)[source]#

Convenience method to find the min and max inside a statistics dictionary, and raise an error if they’re unequal.

Parameters:

row_group – dataset fragment row group
stat_key (str) – column name of interest.

Returns:

The value of the specified row group statistic

get_healpix_pixel_from_metadata(metadata: pyarrow.parquet.FileMetaData, norder_column: str = 'Norder', npix_column: str = 'Npix') → hipscat.pixel_math.healpix_pixel.HealpixPixel[source]#

Get the healpix pixel according to a parquet file’s metadata.

This is determined by the value of Norder and Npix in the table’s data

Parameters:: metadata (pyarrow.parquet.FileMetaData) – full metadata for a single file.
Returns:: Healpix pixel representing the Norder and Npix from the first row group.

write_parquet_metadata(catalog_path: str, order_by_healpix=True, storage_options: dict = None, output_path: str = None)[source]#

Generate parquet metadata, using the already-partitioned parquet files for this catalog.

For more information on the general parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Parameters:

catalog_path (str) – base path for the catalog
order_by_healpix (bool) – use False if the dataset is not to be reordered by breadth-first healpix pixel (e.g. secondary indexes)
storage_options – dictionary that contains abstract filesystem credentials
output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified

write_parquet_metadata_for_batches(batches: List[List[pyarrow.RecordBatch]], output_path: str = None, storage_options: dict = None)[source]#

Write parquet metadata files for some pyarrow table batches. This writes the batches to a temporary parquet dataset using local storage, and generates the metadata for the partitioned catalog parquet files.

Parameters:

batches (List[List[pa.RecordBatch]]) – create one row group per RecordBatch, grouped into tables by the inner list.
output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified
storage_options – dictionary that contains abstract filesystem credentials

read_row_group_fragments(metadata_file: str, storage_options: dict = None)[source]#

Generator for metadata fragment row groups in a parquet metadata file.

Parameters:

metadata_file (str) – path to _metadata file.
storage_options – dictionary that contains abstract filesystem credentials

hipscat.io.parquet_metadata

Contents

hipscat.io.parquet_metadata#

Module Contents#

Functions#

`hipscat.io.parquet_metadata`#