hipscat.catalog.association_catalog

`hipscat.catalog.association_catalog`#

Submodules#

Package Contents#

Classes#

`AssociationCatalog`	A HiPSCat Catalog for enabling fast joins between two HiPSCat catalogs
`AssociationCatalogInfo`	Catalog Info for a HiPSCat Association Catalog
`PartitionJoinInfo`	Association catalog metadata with which partitions matches occur in the join

class AssociationCatalog(catalog_info: CatalogInfoClass, pixels: hipscat.catalog.healpix_dataset.healpix_dataset.PixelInputTypes, join_pixels: JoinPixelInputTypes, catalog_path=None, moc: mocpy.MOC | None = None, storage_options: Dict[Any, Any] | None = None)[source]#

Bases: hipscat.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HiPSCat Catalog for enabling fast joins between two HiPSCat catalogs

Catalogs of this type are partitioned based on the partitioning of the left catalog. The partition_join_info metadata file specifies all pairs of pixels in the Association Catalog, corresponding to each pair of partitions in each catalog that contain rows to join.

CatalogInfoClass: typing_extensions.TypeAlias#

catalog_info: AssociationCatalog.CatalogInfoClass#

JoinPixelInputTypes#

get_join_pixels() → pandas.DataFrame[source]#

Get join pixels listing all pairs of pixels from left and right catalogs that contain matching association rows

Returns:: pd.DataFrame with each row being a pair of pixels from the primary and join catalogs

static _get_partition_join_info_from_pixels(join_pixels: JoinPixelInputTypes) → hipscat.catalog.association_catalog.partition_join_info.PartitionJoinInfo[source]#

classmethod _read_args(catalog_base_dir: hipscat.io.FilePointer, storage_options: Dict[Any, Any] | None = None) → Tuple[CatalogInfoClass, hipscat.catalog.healpix_dataset.healpix_dataset.PixelInputTypes, JoinPixelInputTypes][source]#

classmethod _check_files_exist(catalog_base_dir: hipscat.io.FilePointer, storage_options: dict = None)[source]#

class AssociationCatalogInfo[source]#

Bases: hipscat.catalog.dataset.base_catalog_info.BaseCatalogInfo

Catalog Info for a HiPSCat Association Catalog

primary_catalog: str | None#: Catalog name for the primary (left) side of association

primary_column: str | None#: Column name in the primary (left) side of join

primary_column_association: str | None#: Column name in the association table that matches the primary (left) side of join

join_catalog: str | None#: Catalog name for the joining (right) side of association

join_column: str | None#: Column name in the joining (right) side of join

join_column_association: str | None#: Column name in the association table that matches the joining (right) side of join

contains_leaf_files: bool = False#: Whether or not the association catalog contains leaf parquet files

required_fields#

DEFAULT_TYPE#

REQUIRED_TYPE#

class PartitionJoinInfo(join_info_df: pandas.DataFrame, catalog_base_dir: str = None)[source]#

Association catalog metadata with which partitions matches occur in the join

PRIMARY_ORDER_COLUMN_NAME = 'Norder'#

PRIMARY_PIXEL_COLUMN_NAME = 'Npix'#

JOIN_ORDER_COLUMN_NAME = 'join_Norder'#

JOIN_PIXEL_COLUMN_NAME = 'join_Npix'#

COLUMN_NAMES#

_check_column_names()[source]#

primary_to_join_map() → Dict[hipscat.pixel_math.healpix_pixel.HealpixPixel, List[hipscat.pixel_math.healpix_pixel.HealpixPixel]][source]#

Generate a map from a single primary pixel to one or more pixels in the join catalog.

Lots of cute comprehension is happening here, so watch out! We create tuple of (primary order/pixel) and [array of tuples of (join order/pixel)]

Returns:: dictionary mapping (primary order/pixel) to [array of (join order/pixel)]

write_to_metadata_files(catalog_path: hipscat.io.FilePointer = None, storage_options: dict = None)[source]#

Generate parquet metadata, using the known partitions.

Parameters:

catalog_path (FilePointer) – base path for the catalog
storage_options (dict) – dictionary that contains abstract filesystem credentials

Raises:

ValueError – if no path is provided, and could not be inferred.

write_to_csv(catalog_path: hipscat.io.FilePointer = None, storage_options: dict = None)[source]#

Write all partition data to CSV files.

Two files will be written:

partition_info.csv - covers all primary catalog pixels, and should match the file structure

partition_join_info.csv - covers all pairwise relationships between primary and join catalogs.

Parameters:

catalog_path – FilePointer to the directory where the partition_join_info.csv file will be written
storage_options (dict) – dictionary that contains abstract filesystem credentials

Raises:

ValueError – if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: hipscat.io.FilePointer, storage_options: dict = None) → PartitionJoinInfo[source]#

Read partition join info from a file within a hipscat directory.

This will look for a partition_join_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.

Parameters:

catalog_base_dir – path to the root directory of the catalog
storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file

Raises:

FileNotFoundError – if neither desired file is found in the catalog_base_dir

classmethod read_from_file(metadata_file: hipscat.io.FilePointer, strict: bool = False, storage_options: dict = None) → PartitionJoinInfo[source]#

Read partition join info from a _metadata file to create an object

Parameters:

metadata_file (FilePointer) – FilePointer to the _metadata file
storage_options (dict) – dictionary that contains abstract filesystem credentials
strict (bool) – use strict parsing of _metadata file. this is slower, but gives more helpful error messages in the case of invalid data.

Returns:

A PartitionJoinInfo object with the data from the file

classmethod _read_from_metadata_file(metadata_file: hipscat.io.FilePointer, strict: bool = False, storage_options: dict = None) → pandas.DataFrame[source]#

Read partition join info from a _metadata file to create an object

Parameters:

metadata_file (FilePointer) – FilePointer to the _metadata file
storage_options (dict) – dictionary that contains abstract filesystem credentials
strict (bool) – use strict parsing of _metadata file. this is slower, but gives more helpful error messages in the case of invalid data.

Returns:

A PartitionJoinInfo object with the data from the file

classmethod read_from_csv(partition_join_info_file: hipscat.io.FilePointer, storage_options: dict = None) → PartitionJoinInfo[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:

partition_join_info_file (FilePointer) – FilePointer to the partition_join_info.csv file
storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file

classmethod _read_from_csv(partition_join_info_file: hipscat.io.FilePointer, storage_options: dict = None) → pandas.DataFrame[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:

partition_join_info_file (FilePointer) – FilePointer to the partition_join_info.csv file
storage_options (dict) – dictionary that contains abstract filesystem credentials

Returns:

A PartitionJoinInfo object with the data from the file

hipscat.catalog.association_catalog

Contents

hipscat.catalog.association_catalog#

Submodules#

Package Contents#

Classes#

`hipscat.catalog.association_catalog`#