hipscat.io.parquet_metadata#

Utility functions for handling parquet metadata files

Module Contents#

Functions#

row_group_stat_single_value(row_group, stat_key)

Convenience method to find the min and max inside a statistics dictionary,

get_healpix_pixel_from_metadata(...)

Get the healpix pixel according to a parquet file's metadata.

write_parquet_metadata(catalog_path[, ...])

Generate parquet metadata, using the already-partitioned parquet files

write_parquet_metadata_for_batches(batches[, ...])

Write parquet metadata files for some pyarrow table batches.

read_row_group_fragments(metadata_file[, storage_options])

Generator for metadata fragment row groups in a parquet metadata file.

row_group_stat_single_value(row_group, stat_key: str)[source]#

Convenience method to find the min and max inside a statistics dictionary, and raise an error if they’re unequal.

Parameters:
  • row_group – dataset fragment row group

  • stat_key (str) – column name of interest.

Returns:

The value of the specified row group statistic

get_healpix_pixel_from_metadata(metadata: pyarrow.parquet.FileMetaData, norder_column: str = 'Norder', npix_column: str = 'Npix') hipscat.pixel_math.healpix_pixel.HealpixPixel[source]#

Get the healpix pixel according to a parquet file’s metadata.

This is determined by the value of Norder and Npix in the table’s data

Parameters:

metadata (pyarrow.parquet.FileMetaData) – full metadata for a single file.

Returns:

Healpix pixel representing the Norder and Npix from the first row group.

write_parquet_metadata(catalog_path: str, order_by_healpix=True, storage_options: dict = None, output_path: str = None)[source]#

Generate parquet metadata, using the already-partitioned parquet files for this catalog.

For more information on the general parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Parameters:
  • catalog_path (str) – base path for the catalog

  • order_by_healpix (bool) – use False if the dataset is not to be reordered by breadth-first healpix pixel (e.g. secondary indexes)

  • storage_options – dictionary that contains abstract filesystem credentials

  • output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified

write_parquet_metadata_for_batches(batches: List[List[pyarrow.RecordBatch]], output_path: str = None, storage_options: dict = None)[source]#

Write parquet metadata files for some pyarrow table batches. This writes the batches to a temporary parquet dataset using local storage, and generates the metadata for the partitioned catalog parquet files.

Parameters:
  • batches (List[List[pa.RecordBatch]]) – create one row group per RecordBatch, grouped into tables by the inner list.

  • output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified

  • storage_options – dictionary that contains abstract filesystem credentials

read_row_group_fragments(metadata_file: str, storage_options: dict = None)[source]#

Generator for metadata fragment row groups in a parquet metadata file.

Parameters:
  • metadata_file (str) – path to _metadata file.

  • storage_options – dictionary that contains abstract filesystem credentials