hipscat.catalog.association_catalog#

Submodules#

Package Contents#

Classes#

AssociationCatalog

A HiPSCat Catalog for enabling fast joins between two HiPSCat catalogs

AssociationCatalogInfo

Catalog Info for a HiPSCat Association Catalog

PartitionJoinInfo

Association catalog metadata with which partitions matches occur in the join

class AssociationCatalog(catalog_info: CatalogInfoClass, pixels: hipscat.catalog.healpix_dataset.healpix_dataset.PixelInputTypes, join_pixels: JoinPixelInputTypes, catalog_path=None, moc: mocpy.MOC | None = None, storage_options: Dict[Any, Any] | None = None)[source]#

Bases: hipscat.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HiPSCat Catalog for enabling fast joins between two HiPSCat catalogs

Catalogs of this type are partitioned based on the partitioning of the left catalog. The partition_join_info metadata file specifies all pairs of pixels in the Association Catalog, corresponding to each pair of partitions in each catalog that contain rows to join.

CatalogInfoClass: typing_extensions.TypeAlias#
catalog_info: AssociationCatalog.CatalogInfoClass#
JoinPixelInputTypes#
get_join_pixels() pandas.DataFrame[source]#

Get join pixels listing all pairs of pixels from left and right catalogs that contain matching association rows

Returns:

pd.DataFrame with each row being a pair of pixels from the primary and join catalogs

static _get_partition_join_info_from_pixels(join_pixels: JoinPixelInputTypes) hipscat.catalog.association_catalog.partition_join_info.PartitionJoinInfo[source]#
classmethod _read_args(catalog_base_dir: hipscat.io.FilePointer, storage_options: Dict[Any, Any] | None = None) Tuple[CatalogInfoClass, hipscat.catalog.healpix_dataset.healpix_dataset.PixelInputTypes, JoinPixelInputTypes][source]#
classmethod _check_files_exist(catalog_base_dir: hipscat.io.FilePointer, storage_options: dict = None)[source]#
class AssociationCatalogInfo[source]#

Bases: hipscat.catalog.dataset.base_catalog_info.BaseCatalogInfo

Catalog Info for a HiPSCat Association Catalog

primary_catalog: str | None#

Catalog name for the primary (left) side of association

primary_column: str | None#

Column name in the primary (left) side of join

primary_column_association: str | None#

Column name in the association table that matches the primary (left) side of join

join_catalog: str | None#

Catalog name for the joining (right) side of association

join_column: str | None#

Column name in the joining (right) side of join

join_column_association: str | None#

Column name in the association table that matches the joining (right) side of join

contains_leaf_files: bool = False#

Whether or not the association catalog contains leaf parquet files

required_fields#
DEFAULT_TYPE#
REQUIRED_TYPE#
class PartitionJoinInfo(join_info_df: pandas.DataFrame, catalog_base_dir: str = None)[source]#

Association catalog metadata with which partitions matches occur in the join

PRIMARY_ORDER_COLUMN_NAME = 'Norder'#
PRIMARY_PIXEL_COLUMN_NAME = 'Npix'#
JOIN_ORDER_COLUMN_NAME = 'join_Norder'#
JOIN_PIXEL_COLUMN_NAME = 'join_Npix'#
COLUMN_NAMES#
_check_column_names()[source]#
primary_to_join_map() Dict[hipscat.pixel_math.healpix_pixel.HealpixPixel, List[hipscat.pixel_math.healpix_pixel.HealpixPixel]][source]#

Generate a map from a single primary pixel to one or more pixels in the join catalog.

Lots of cute comprehension is happening here, so watch out! We create tuple of (primary order/pixel) and [array of tuples of (join order/pixel)]

Returns:

dictionary mapping (primary order/pixel) to [array of (join order/pixel)]

write_to_metadata_files(catalog_path: hipscat.io.FilePointer = None, storage_options: dict = None)[source]#

Generate parquet metadata, using the known partitions.

Parameters:
  • catalog_path (FilePointer) – base path for the catalog

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

Raises:

ValueError – if no path is provided, and could not be inferred.

write_to_csv(catalog_path: hipscat.io.FilePointer = None, storage_options: dict = None)[source]#

Write all partition data to CSV files.

Two files will be written:

  • partition_info.csv - covers all primary catalog pixels, and should match the file structure

  • partition_join_info.csv - covers all pairwise relationships between primary and join catalogs.

Parameters:
  • catalog_path – FilePointer to the directory where the partition_join_info.csv file will be written

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

Raises:

ValueError – if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: hipscat.io.FilePointer, storage_options: dict = None) PartitionJoinInfo[source]#

Read partition join info from a file within a hipscat directory.

This will look for a partition_join_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.

Parameters:
  • catalog_base_dir – path to the root directory of the catalog

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file

Raises:

FileNotFoundError – if neither desired file is found in the catalog_base_dir

classmethod read_from_file(metadata_file: hipscat.io.FilePointer, strict: bool = False, storage_options: dict = None) PartitionJoinInfo[source]#

Read partition join info from a _metadata file to create an object

Parameters:
  • metadata_file (FilePointer) – FilePointer to the _metadata file

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

  • strict (bool) – use strict parsing of _metadata file. this is slower, but gives more helpful error messages in the case of invalid data.

Returns:

A PartitionJoinInfo object with the data from the file

classmethod _read_from_metadata_file(metadata_file: hipscat.io.FilePointer, strict: bool = False, storage_options: dict = None) pandas.DataFrame[source]#

Read partition join info from a _metadata file to create an object

Parameters:
  • metadata_file (FilePointer) – FilePointer to the _metadata file

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

  • strict (bool) – use strict parsing of _metadata file. this is slower, but gives more helpful error messages in the case of invalid data.

Returns:

A PartitionJoinInfo object with the data from the file

classmethod read_from_csv(partition_join_info_file: hipscat.io.FilePointer, storage_options: dict = None) PartitionJoinInfo[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:
  • partition_join_info_file (FilePointer) – FilePointer to the partition_join_info.csv file

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file

classmethod _read_from_csv(partition_join_info_file: hipscat.io.FilePointer, storage_options: dict = None) pandas.DataFrame[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:
  • partition_join_info_file (FilePointer) – FilePointer to the partition_join_info.csv file

  • storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file