hats.catalog

Contents

hats.catalog#

Catalog data wrappers

Submodules#

Classes#

AssociationCatalog

A HATS Catalog for enabling fast joins between two HATS catalogs

PartitionJoinInfo

Association catalog metadata with which partitions matches occur in the join

Catalog

A HATS Catalog with data stored in a HEALPix Hive partitioned structure

CatalogType

Enum for possible types of catalog

Dataset

A base HATS dataset that contains a properties file

TableProperties

Container class for catalog metadata

MapCatalog

A HATS table to represent non-point-source data in a continuous map.

MarginCatalog

A HATS Catalog used to contain the 'margin' of another HATS catalog.

PartitionInfo

Container class for per-partition info.

Package Contents#

class AssociationCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], join_pixels: list | pandas.DataFrame | hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo, catalog_path=None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog for enabling fast joins between two HATS catalogs

Catalogs of this type are partitioned based on the partitioning of the left catalog. The partition_join_info metadata file specifies all pairs of pixels in the Association Catalog, corresponding to each pair of partitions in each catalog that contain rows to join.

join_info#
get_join_pixels() pandas.DataFrame[source]#

Get join pixels listing all pairs of pixels from left and right catalogs that contain matching association rows

Returns:

pd.DataFrame with each row being a pair of pixels from the primary and join catalogs

static _get_partition_join_info_from_pixels(join_pixels: list | pandas.DataFrame | hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo) hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo[source]#
class PartitionJoinInfo(join_info_df: pandas.DataFrame, catalog_base_dir: str = None)[source]#

Association catalog metadata with which partitions matches occur in the join

PRIMARY_ORDER_COLUMN_NAME = 'Norder'#
PRIMARY_PIXEL_COLUMN_NAME = 'Npix'#
JOIN_ORDER_COLUMN_NAME = 'join_Norder'#
JOIN_PIXEL_COLUMN_NAME = 'join_Npix'#
COLUMN_NAMES#
data_frame#
catalog_base_dir = None#
_check_column_names()[source]#
primary_to_join_map() dict[hats.pixel_math.healpix_pixel.HealpixPixel, list[hats.pixel_math.healpix_pixel.HealpixPixel]][source]#

Generate a map from a single primary pixel to one or more pixels in the join catalog.

Lots of cute comprehension is happening here, so watch out! We create tuple of (primary order/pixel) and [array of tuples of (join order/pixel)]

Returns:

dictionary mapping (primary order/pixel) to [array of (join order/pixel)]

write_to_csv(catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#

Write all partition data to CSV files.

Two files will be written:

  • partition_info.csv - covers all primary catalog pixels, and should match the file structure

  • partition_join_info.csv - covers all pairwise relationships between primary and join catalogs.

Parameters:

catalog_path – path to the directory where the partition_join_info.csv file will be written

Raises:

ValueError – if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None = None) PartitionJoinInfo[source]#

Read partition join info from a partition_join_info file within a hats directory.

Parameters:

catalog_base_dir – path to the root directory of the catalog

Returns:

A PartitionJoinInfo object with the data from the file

Raises:

FileNotFoundError – if the desired file is found in the catalog_base_dir

classmethod read_from_csv(partition_join_info_file: str | pathlib.Path | upath.UPath) PartitionJoinInfo[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:

partition_join_info_file (UPath) – path to the partition_join_info.csv file

Returns:

A PartitionJoinInfo object with the data from the file

classmethod _read_from_csv(partition_join_info_file: str | pathlib.Path | upath.UPath) pandas.DataFrame[source]#

Read partition join info from a partition_join_info.csv file to create an object

Parameters:

partition_join_info_file (UPath) – path to the partition_join_info.csv file

Returns:

A PartitionJoinInfo object with the data from the file

class Catalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are partitioned spatially, contain partition_info metadata specifying the pixels in Catalog, and on disk conform to the parquet partitioning structure Norder=/Dir=/Npix=.parquet

generate_negative_tree_pixels() list[hats.pixel_math.HealpixPixel][source]#

Get the leaf nodes at each healpix order that have zero catalog data.

For example, if an example catalog only had data points in pixel 0 at order 0, then this method would return order 0’s pixels 1 through 11. Used for getting full coverage on margin caches.

Returns:

List of HealpixPixels representing the ‘negative tree’ for the catalog.

class CatalogType[source]#

Bases: str, enum.Enum

Enum for possible types of catalog

OBJECT = 'object'#
SOURCE = 'source'#
ASSOCIATION = 'association'#
INDEX = 'index'#
MARGIN = 'margin'#
MAP = 'map'#
classmethod all_types()[source]#

Fetch a list of all catalog types

class Dataset(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None)[source]#

A base HATS dataset that contains a properties file and the data contained in parquet files

catalog_info#
catalog_name#
catalog_path = None#
on_disk#
catalog_base_dir = None#
schema = None#
aggregate_column_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:
  • exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

  • exclude_columns (List[str]) – additional columns to exclude from the statistics.

  • include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

per_pixel_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_stats: list[str] = None, multi_index=False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

Parameters:
  • exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

  • exclude_columns (List[str]) – additional columns to exclude from the statistics.

  • include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

  • include_stats (List[str]) – if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.

  • multi_index (bool) – should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with

  • combination. (separate columns per-data-column and stat value)

class TableProperties(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Container class for catalog metadata

catalog_name: str = None#
catalog_type: hats.catalog.catalog_type.CatalogType = None#
total_rows: int = None#
ra_column: str | None = None#
dec_column: str | None = None#
default_columns: list[str] | None = None#

Which columns should be read from parquet files, when user doesn’t otherwise specify.

primary_catalog: str | None = None#

Reference to object catalog. Relevant for nested, margin, association, and index.

margin_threshold: float | None = None#

Threshold of the pixel boundary, expressed in arcseconds.

primary_column: str | None = None#

Column name in the primary (left) side of join.

primary_column_association: str | None = None#

Column name in the association table that matches the primary (left) side of join.

join_catalog: str | None = None#

Catalog name for the joining (right) side of association.

join_column: str | None = None#

Column name in the joining (right) side of join.

join_column_association: str | None = None#

Column name in the association table that matches the joining (right) side of join.

contains_leaf_files: bool | None = None#

Whether or not the association catalog contains leaf parquet files.

indexing_column: str | None = None#

Column that we provide an index over.

extra_columns: list[str] | None = None#

Any additional payload columns included in index.

model_config#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

npix_suffix: str = None#

Suffix of the Npix partitions. In the standard HATS directory structure, this is ‘.parquet’ because there is a single file in each Npix partition and it is named like ‘Npix=313.parquet’. Other valid directory structures include those with the same single file per partition but which use a different suffix (e.g., npix_suffix = ‘.parq’ or ‘.snappy.parquet’), and also those in which the Npix partitions are actually directories containing 1+ files underneath (and then npix_suffix = ‘/’).

classmethod space_delimited_list(str_value: str) list[str][source]#

Convert a space-delimited list string into a python list of strings.

serialize_as_space_delimited_list(str_list: Iterable[str]) str[source]#

Convert a python list of strings into a space-delimited string.

check_allowed_and_required() typing_extensions.Self[source]#

Check that type-specific fields are appropriate, and required fields are set.

copy_and_update(**kwargs)[source]#

Create a validated copy of these table properties, updating the fields provided in kwargs.

explicit_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that have been explicitly set, and are not “extra” keys.

extra_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that are “extra” keys.

__str__()[source]#

Friendly string representation based on named fields.

classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#

Read field values from a java-style properties file.

to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#

Write fields to a java-style properties file.

class MapCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS table to represent non-point-source data in a continuous map.

class MarginCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog used to contain the ‘margin’ of another HATS catalog.

Catalogs of this type are used alongside a primary catalog, and contains the margin points for each HEALPix pixel - any points that are within a certain distance from the HEALPix pixel boundary. This is used to ensure spatial operations such as crossmatching can be performed efficiently while maintaining accuracy.

filter_by_moc(moc: mocpy.MOC) typing_extensions.Self[source]#

Filter the pixels in the margin catalog to only include the margin pixels that overlap with the moc

For the case of margin pixels, this includes any pixels whose margin areas may overlap with the moc. This is not always done with a high accuracy, but always includes any pixels that will overlap, and may include extra partitions that do not.

Parameters:

moc (mocpy.MOC) – the moc to filter by

Returns:

A new margin catalog with only the pixels that overlap or that have margin area that overlap with the moc. Note that we reset the total_rows to None, as updating would require a scan over the new pixel sizes.

class PartitionInfo(pixel_list: list[hats.pixel_math.healpix_pixel.HealpixPixel], catalog_base_dir: str = None)[source]#

Container class for per-partition info.

METADATA_ORDER_COLUMN_NAME = 'Norder'#
METADATA_PIXEL_COLUMN_NAME = 'Npix'#
pixel_list#
catalog_base_dir = None#
get_healpix_pixels() list[hats.pixel_math.healpix_pixel.HealpixPixel][source]#

Get healpix pixel objects for all pixels represented as partitions.

Returns:

List of HealpixPixel

get_highest_order() int[source]#

Get the highest healpix order for the dataset.

Returns:

int representing highest order.

write_to_file(partition_info_file: str | pathlib.Path | upath.UPath | None = None, catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#

Write all partition data to CSV file.

If no paths are provided, the catalog base directory from the read_from_dir call is used.

Parameters:
  • partition_info_file – path to where the partition_info.csv file will be written.

  • catalog_path – base directory for a catalog where the partition_info.csv file will be written.

Raises:

ValueError – if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None) PartitionInfo[source]#

Read partition info from a file within a hats directory.

This will look for a partition_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.

Parameters:

catalog_base_dir – path to the root directory of the catalog

Returns:

A PartitionInfo object with the data from the file

Raises:

FileNotFoundError – if neither desired file is found in the catalog_base_dir

classmethod read_from_file(metadata_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#

Read partition info from a _metadata file to create an object

Parameters:

metadata_file (UPath) – path to the _metadata file

Returns:

A PartitionInfo object with the data from the file

classmethod _read_from_metadata_file(metadata_file: str | pathlib.Path | upath.UPath) list[hats.pixel_math.healpix_pixel.HealpixPixel][source]#

Read partition info list from a _metadata file.

Parameters:

metadata_file (UPath) – path to the _metadata file

Returns:

The list of HealpixPixel extracted from the data in the metadata file

classmethod read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#

Read partition info from a partition_info.csv file to create an object

Parameters:

partition_info_file (UPath) – path to the partition_info.csv file

Returns:

A PartitionInfo object with the data from the file

classmethod _read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#

Read partition info from a partition_info.csv file to create an object

Parameters:

partition_info_file (UPath) – path to the partition_info.csv file

Returns:

A PartitionInfo object with the data from the file

as_dataframe()[source]#

Construct a pandas dataframe for the partition info pixels.

Returns:

Dataframe with order, directory, and pixel info.

classmethod from_healpix(healpix_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel]) PartitionInfo[source]#

Create a partition info object from a list of constituent healpix pixels.

Parameters:

healpix_pixels – list of healpix pixels

Returns:

A PartitionInfo object with the same healpix pixels

calculate_fractional_coverage()[source]#

Calculate what fraction of the sky is covered by partition tiles.