hats.io.parquet_metadata
========================

.. py:module:: hats.io.parquet_metadata

.. autoapi-nested-parse::

   Utility functions for handling parquet metadata files

   ..
       !! processed by numpydoc !!


Functions
---------

.. autoapisummary::

   hats.io.parquet_metadata.write_parquet_metadata
   hats.io.parquet_metadata.aggregate_column_statistics
   hats.io.parquet_metadata.aggregate_column_statistics_from_cache
   hats.io.parquet_metadata.per_partition_statistics_from_cache
   hats.io.parquet_metadata.per_partition_statistics
   hats.io.parquet_metadata.write_per_partition_statistics_from_metadata
   hats.io.parquet_metadata.pick_metadata_schema_file
   hats.io.parquet_metadata.nested_frame_to_vo_schema
   hats.io.parquet_metadata.write_voparquet_in_common_metadata


Module Contents
---------------

.. py:function:: write_parquet_metadata(catalog_path: str | pathlib.Path | upath.UPath, *, order_by_healpix=True, output_path: str | pathlib.Path | upath.UPath | None = None, create_thumbnail: bool = False, thumbnail_threshold: int = 1000000, create_metadata: bool = True, create_per_partition_stats: bool = False)

   
   Write Parquet dataset-level metadata files (and optional thumbnail) for a catalog.

   Creates files::

       catalog/
       ├── data_thumbnail.parquet           (only if create_thumbnail=True)
       ├── per_partition_statistics.parquet (only if create_per_partition_stats=True)
       ├── ...
       └── dataset/
           ├── _common_metadata             (always written)
           ├── _metadata                    (only if create_metadata=True)
           └──  ...

   ``dataset/_common_metadata`` contains the full schema of the dataset. This file
   will know all of the columns and their types, as well as any file-level key-value
   metadata associated with the full Parquet dataset.

   ``dataset/_metadata`` contains the combined row group footers from all Parquet files
   in the dataset, which allows readers to read the entire dataset without having
   to open each individual Parquet file. This file can be large for datasets with
   many files, so users may choose to omit it by setting ``create_metadata=False``.

   ``data_thumbnail.parquet`` gives the user a quick overview of the whole dataset.
   It is a compact file containing one row from each data partition, up to a maximum
   of ``thumbnail_threshold`` rows.

   ``per_partition_statistics.parquet`` contains summary statistics from all columns
   in data partition files, e.g. column min/max values, count of null values, etc.

   :Parameters:

       **catalog_path** : str | Path | UPath
           Base path for the catalog root.

       **order_by_healpix** : bool, default=True
           If True, reorder combined metadata by breadth-first Healpix pixel ordering
           (e.g., secondary indexes). Set False for datasets that should not be reordered.
           Does not modify dataset files on disk.

       **output_path** : str | Path | UPath | None, default=None
           Base path to write metadata files. If None, uses ``catalog_path``.

       **create_thumbnail** : bool, default=False
           If True, writes a compact ``data_thumbnail.parquet`` containing one row per
           sampled file.

       **thumbnail_threshold** : int, default=1_000_000
           Maximum number of rows in the thumbnail (or maximum number of files, if
           thumbnail_threshold exceeds the number of files). One row per partition.

       **create_metadata** : bool, default=True
           If True, writes ``dataset/_metadata`` combining row group footers.

       **create_per_partition_stats** : bool, default=False
           If True, writes ``per_partition_statistics.parquet`` containing summary
           statistics from all columns in data partition files.


   :Returns:

       int
           Total number of rows across all parquet files in the dataset.


   .. rubric:: Notes

   For more information on the general Parquet metadata files, and why we write them, see
   https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

   For more information on HATS-specific metadata files and conventions, see
   https://www.ivoa.net/documents/Notes/HATS/


   ..
       !! processed by numpydoc !!

.. py:function:: aggregate_column_statistics(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)

   
   Read footer statistics in parquet metadata, and report on global min/max values.


   :Parameters:

       **metadata_file** : str | Path | UPath
           path to `_metadata` file

       **exclude_hats_columns** : bool
           exclude HATS spatial and partitioning fields
           from the statistics. Defaults to True.

       **exclude_columns** : list[str]
           additional columns to exclude from the statistics.

       **include_columns** : list[str]
           if specified, only return statistics for the column
           names provided. Defaults to None, and returns all non-hats columns.

       **only_numeric_columns** : bool
           only include columns that are numeric (integer or floating point) in the
           statistics. If True, the entire frame should be numeric.
           (Default value = False)

       **include_pixels** : list[HealpixPixel]
           if specified, only return statistics
           for the pixels indicated. Defaults to none, and returns all pixels.


   :Returns:

       pd.Dataframe
           Pandas dataframe with global summary statistics


   ..
       !! processed by numpydoc !!

.. py:function:: aggregate_column_statistics_from_cache(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)

   
   Using cached footer statistics in parquet metadata, and report on global min/max values.


   :Parameters:

       **metadata_file** : str | Path | UPath
           path to `_metadata` file

       **exclude_hats_columns** : bool
           exclude HATS spatial and partitioning fields
           from the statistics. Defaults to True.

       **exclude_columns** : list[str]
           additional columns to exclude from the statistics.

       **include_columns** : list[str]
           if specified, only return statistics for the column
           names provided. Defaults to None, and returns all non-hats columns.

       **only_numeric_columns** : bool
           only include columns that are numeric (integer or floating point) in the
           statistics. If True, the entire frame should be numeric.
           (Default value = False)

       **include_pixels** : list[HealpixPixel]
           if specified, only return statistics
           for the pixels indicated. Defaults to none, and returns all pixels.


   :Returns:

       pd.Dataframe
           Pandas dataframe with global summary statistics


   ..
       !! processed by numpydoc !!

.. py:function:: per_partition_statistics_from_cache(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None, per_row_group: bool = False)

   
   Read footer statistics in parquet metadata, and report on statistics about
   each pixel partition.

   The statistics gathered are a subset of the available attributes in the
   ``pyarrow.parquet.ColumnChunkMetaData``:

   - ``min_value`` - minimum value seen in a single data partition
   - ``max_value`` - maximum value seen in a single data partition
   - ``null_count`` - number of null values
   - ``row_count`` - total number of values. note that this will only vary by column
     if you have some nested columns in your dataset
   - ``disk_bytes`` - Compressed size of the data in the parquet file, in bytes
   - ``memory_bytes`` - Uncompressed size, in bytes

   :Parameters:

       **metadata_file** : str | Path | UPath
           path to `_metadata` file

       **exclude_hats_columns** : bool
           exclude HATS spatial and partitioning fields
           from the statistics. Defaults to True.

       **exclude_columns** : list[str]
           additional columns to exclude from the statistics.

       **include_columns** : list[str]
           if specified, only return statistics for the column
           names provided. Defaults to None, and returns all non-hats columns.

       **only_numeric_columns** : bool
           only include columns that are numeric (integer or
           floating point) in the statistics. If True, the entire frame should be numeric.
           (Default value = False)

       **include_stats** : list[str]
           if specified, only return the kinds of values from list
           (min_value, max_value, null_count, row_count, disk_bytes, memory_bytes).
           Defaults to None, and returns all values.

       **multi_index** : bool
           should the returned frame be created with a multi-index, first on
           pixel, then on column name? Default is False, and instead indexes on pixel, with
           separate columns per-data-column and stat value combination.
           (Default value = False)

       **include_pixels** : list[HealpixPixel]
           if specified, only return statistics
           for the pixels indicated. Defaults to none, and returns all pixels.

       **per_row_group** : bool
           should the returned data be even more fine-grained and provide
           per row group (within each pixel) level statistics? Default is currently False.


   :Returns:

       pd.Dataframe
           Pandas dataframe with granular per-pixel statistics


   ..
       !! processed by numpydoc !!

.. py:function:: per_partition_statistics(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None, per_row_group: bool = False)

   
   Read footer statistics in parquet metadata, and report on statistics about
   each pixel partition.

   The statistics gathered are a subset of the available attributes in the
   ``pyarrow.parquet.ColumnChunkMetaData``:

   - ``min_value`` - minimum value seen in a single data partition
   - ``max_value`` - maximum value seen in a single data partition
   - ``null_count`` - number of null values
   - ``row_count`` - total number of values. note that this will only vary by column
     if you have some nested columns in your dataset
   - ``disk_bytes`` - Compressed size of the data in the parquet file, in bytes
   - ``memory_bytes`` - Uncompressed size, in bytes

   :Parameters:

       **metadata_file** : str | Path | UPath
           path to `_metadata` file

       **exclude_hats_columns** : bool
           exclude HATS spatial and partitioning fields
           from the statistics. Defaults to True.

       **exclude_columns** : list[str]
           additional columns to exclude from the statistics.

       **include_columns** : list[str]
           if specified, only return statistics for the column
           names provided. Defaults to None, and returns all non-hats columns.

       **only_numeric_columns** : bool
           only include columns that are numeric (integer or
           floating point) in the statistics. If True, the entire frame should be numeric.
           (Default value = False)

       **include_stats** : list[str]
           if specified, only return the kinds of values from list
           (min_value, max_value, null_count, row_count, disk_bytes, memory_bytes).
           Defaults to None, and returns all values.

       **multi_index** : bool
           should the returned frame be created with a multi-index, first on
           pixel, then on column name? Default is False, and instead indexes on pixel, with
           separate columns per-data-column and stat value combination.
           (Default value = False)

       **include_pixels** : list[HealpixPixel]
           if specified, only return statistics
           for the pixels indicated. Defaults to none, and returns all pixels.

       **per_row_group** : bool
           should the returned data be even more fine-grained and provide
           per row group (within each pixel) level statistics? Default is currently False.


   :Returns:

       pd.Dataframe
           Pandas dataframe with granular per-pixel statistics


   ..
       !! processed by numpydoc !!

.. py:function:: write_per_partition_statistics_from_metadata(catalog_base_dir: str | pathlib.Path | upath.UPath)

   
   Reads the footer statistics from `dataset/_metadata` file, collects the per-pixel-statistics,
   and writes out at `per_partition_statistics.parquet`


   :Parameters:

       **catalog_base_dir** : str | Path | UPath
           base path for the catalog


   ..
       !! processed by numpydoc !!

.. py:function:: pick_metadata_schema_file(catalog_base_dir: str | pathlib.Path | upath.UPath) -> upath.UPath | None

   
   Determines the appropriate file to read for parquet metadata
   stored in the _common_metadata or _metadata files.


   :Parameters:

       **catalog_base_dir** : str | Path | UPath
           base path for the catalog


   :Returns:

       UPath | None
           path to a parquet file containing metadata schema.


   ..
       !! processed by numpydoc !!

.. py:function:: nested_frame_to_vo_schema(nested_frame: nested_pandas.NestedFrame, *, verbose: bool = False, field_units: dict | None = None, field_ucds: dict | None = None, field_descriptions: dict | None = None, field_utypes: dict | None = None)

   
   Create VOTableFile metadata, based on the names and types of fields in the NestedFrame.
   Add ancillary attributes to fields where they are provided in the optional dictionaries.
   Note on field names with nested columns: to include ancillary attributes (units, ucds, etc)
   for a nested sub-column, use dot notation (e.g. ``"lightcurve.band"``). You can add ancillary
   attributes for the entire nested column group using the nested column name (e.g. ``"lightcurve"``).


   :Parameters:

       **nested_frame** : npd.NestedFrame
           nested frame representing catalog data. this can be empty, as we only need to
           know about the column names and types.

       **verbose: bool**
           Should we print out additional debugging statements about the vo metadata?

       **field_units: dict | None**
           dictionary mapping column names to astropy units (or string representation of units)

       **field_ucds: dict | None**
           dictionary mapping column names to UCDs (Uniform Content Descriptors)

       **field_descriptions: dict | None**
           dictionary mapping column names to free-text descriptions

       **field_utypes: dict | None**
           dictionary mapping column names to utypes


   :Returns:

       VOTableFile
           VO object containing all relevant metadata (but no data)


   ..
       !! processed by numpydoc !!

.. py:function:: write_voparquet_in_common_metadata(catalog_base_dir: str | pathlib.Path | upath.UPath, *, verbose: bool = False, field_units: dict | None = None, field_ucds: dict | None = None, field_descriptions: dict | None = None, field_utypes: dict | None = None)

   
   Create VOTableFile metadata, based on the names and types of fields in the parquet files,
   and write to a ``catalog_base_dir/dataset/_common_metadata`` parquet file.
   Add ancillary attributes to fields where they are provided in the optional dictionaries.
   Note on field names with nested columns: to include ancillary attributes (units, ucds, etc)
   for a nested sub-column, use dot notation (e.g. ``"lightcurve.band"``). You can add ancillary
   attributes for the entire nested column group using the nested column name (e.g. ``"lightcurve"``).


   :Parameters:

       **catalog_base_dir** : str | Path | UPath
           base path for the catalog

       **verbose: bool**
           Should we print out additional debugging statements about the vo metadata?

       **field_units: dict | None**
           dictionary mapping column names to astropy units (or string representation of units)

       **field_ucds: dict | None**
           dictionary mapping column names to UCDs (Uniform Content Descriptors)

       **field_descriptions: dict | None**
           dictionary mapping column names to free-text descriptions

       **field_utypes: dict | None**
           dictionary mapping column names to utypes


   ..
       !! processed by numpydoc !!