hats.catalog

hats.catalog#

Catalog data wrappers

Submodules#

Classes#

`AssociationCatalog`	A HATS Catalog for enabling fast joins between two HATS catalogs
`Catalog`	A HATS Catalog with data stored in a HEALPix Hive partitioned structure
`CatalogCollection`	A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure
`CatalogType`	Enum for possible types of catalog
`CollectionProperties`	Container class for catalog metadata
`Dataset`	A base HATS dataset that contains a properties file and the data contained in parquet files
`TableProperties`	Container class for catalog metadata
`IndexCatalog`	An index into HATS Catalog for enabling fast lookups on non-spatial values.
`MapCatalog`	A HATS table to represent non-point-source data in a continuous map.
`MarginCatalog`	A HATS Catalog used to contain the 'margin' of another HATS catalog.
`PartitionInfo`	Container class for per-partition info.

Package Contents#

class AssociationCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog for enabling fast joins between two HATS catalogs

class Catalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are partitioned spatially, contain partition_info metadata specifying the pixels in Catalog, and on disk conform to the parquet partitioning structure Norder=/Dir=/Npix=.parquet

generate_negative_tree_pixels() → list[hats.pixel_math.HealpixPixel][source]#

Get the leaf nodes at each healpix order that have zero catalog data.

For example, if an example catalog only had data points in pixel 0 at order 0, then this method would return order 0’s pixels 1 through 11. Used for getting full coverage on margin caches.

Returns:

list[HealpixPixel]: List of HealpixPixels representing the ‘negative tree’ for the catalog.

class CatalogCollection(collection_path: upath.UPath, collection_properties: hats.catalog.dataset.collection_properties.CollectionProperties, main_catalog: hats.catalog.catalog.Catalog)[source]#

A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are described by a collection.properties file which specifies the underlying main catalog, margin catalog and index catalog paths. These catalogs are stored at the root of the collection, each in its separate directory:

catalog_collection/
├── main_catalog/
├── margin_catalog/
├── index_catalog/
├── collection.properties

Margin and index catalogs are optional but there could also be multiple of them. The catalogs used by default are specified in the collection.properties file in the default_margin and default_index keywords.

collection_path#

collection_properties#

main_catalog#

property main_catalog_dir: upath.UPath#: Path to the main catalog directory

property all_margins: list[str] | None#: The list of margin catalog names in the collection

property default_margin: str | None#: The name of the default margin

property default_margin_catalog_dir: upath.UPath | None#: Path to the default margin catalog directory

property all_indexes: dict[str, str] | None#: The mapping of indexes in the collection

property default_index_field: str | None#: The name of the default index field

property default_index_catalog_dir: upath.UPath | None#: Path to the default index catalog directory

get_index_dir_for_field(field_name: str | None = None) → upath.UPath | None[source]#: Path to the field’s index catalog directory

get_healpix_pixels() → list[hats.pixel_math.HealpixPixel][source]#: The list of HEALPix pixels of the main catalog

get_margin_thresholds() → dict[str, float][source]#

Get the margin thresholds for all margin catalogs in the collection.

Returns:

dict[str, float]: A dictionary mapping margin catalog names to their threshold values.

Raises:

ValueError: if a catalog listed as a margin catalog is not configured as a margin catalog.

class CatalogType[source]#

Bases: str, enum.Enum

Enum for possible types of catalog

OBJECT = 'object'#

SOURCE = 'source'#

ASSOCIATION = 'association'#

INDEX = 'index'#

MARGIN = 'margin'#

MAP = 'map'#

classmethod all_types()[source]#: Fetch a list of all catalog types

class CollectionProperties(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Container class for catalog metadata

name: str = None#

hats_primary_table_url: str = None#: Reference to object catalog. Relevant for nested, margin, association, and index.

all_margins: Annotated[list[str] | None, Field(default=None)]#

default_margin: str | None = None#

all_indexes: Annotated[dict[str, str] | None, Field(default=None)]#

default_index: str | None = None#

model_config#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod space_delimited_list(str_value: str) → list[str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:

str_value: str: a space-delimited list string

Returns:

list[str]: a python list of strings

classmethod index_tuples(str_value: str) → dict[str, str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:

str_value: str: a space-delimited list string

Returns:

dict[str, str]: a python dict of strings

Raises:

ValueError: if the string list has an odd number of elements (and so is not pairs of field and index name)

serialize_list_as_space_delimited_list(str_list: Iterable[str]) → str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:

str_list: Iterable[str]: a python list of strings

Returns:

str: a space-delimited string

serialize_dict_as_space_delimited_list(str_dict: dict[str, str]) → str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:

str_dict: dict[str, str]: a python dict of strings

Returns:

str: a space-delimited string

check_allowed_and_required() → Self[source]#: Check that type-specific fields are appropriate, and required fields are set.

check_default_margin_exists() → Self[source]#: Check that the default margin is in the list of all margins.

check_default_index_exists() → Self[source]#: Check that the default index is in the list of all indexes.

explicit_dict()[source]#: Create a dict, based on fields that have been explicitly set, and are not “extra” keys.

__str__()[source]#: Friendly string representation based on named fields.

classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) → Self[source]#

Read field values from a java-style properties file.

Parameters:

catalog_dir: str | Path | UPath: base directory of catalog.

Returns:

CollectionProperties: new object from the contents of a collection.properties file in the directory.

to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath)[source]#

Write fields to a java-style properties file.

Parameters:

catalog_dir: str | Path | UPath: base directory of catalog.

class Dataset(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

A base HATS dataset that contains a properties file and the data contained in parquet files

catalog_info#

catalog_name#

catalog_path = None#

catalog_base_dir = None#

schema = None#

snapshot = None#

property original_schema: pyarrow.Schema | None#: The original on-disk schema, before any column selection.

property on_disk: bool#: Is the catalog stored on disk?

property unmodified: bool#: Has the catalog been modified from its original on disk state?

aggregate_column_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:

exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

Returns:

Dataframe: aggregated statistics.

per_pixel_statistics(*, exclude_hats_columns: bool = True, exclude_columns: list[str] | None = None, include_columns: list[str] | None = None, only_numeric_columns: bool = False, include_stats: list[str] | None = None, multi_index=False, per_row_group: bool = False)[source]#: Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

per_partition_statistics(*, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index=False, per_row_group: bool = False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

Parameters:

exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
include_statslist[str]: if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.
multi_indexbool: should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)

Returns:

Dataframe: all statistics.

class TableProperties(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Container class for catalog metadata

catalog_name: str = None#

catalog_type: hats.catalog.catalog_type.CatalogType = None#

total_rows: int | None = None#

ra_column: str | None = None#

dec_column: str | None = None#

default_columns: list[str] | None = None#: Which columns should be read from parquet files, when user doesn’t otherwise specify.

healpix_column: str | None = None#: Column name that provides a spatial index of healpix values at some fixed, high order. A typical value would be _healpix_29, but can vary.

healpix_order: int | None = None#: For the spatial index of healpix values in hats_col_healpix what is the fixed, high order. A typicaly value would be 29, but can vary.

primary_catalog: str | None = None#: Reference to object catalog. Relevant for nested, margin, association, and index.

margin_threshold: float | None = None#: Threshold of the pixel boundary, expressed in arcseconds.

primary_column: str | None = None#: Column name in the primary (left) side of join.

primary_column_association: str | None = None#: Column name in the association table that matches the primary (left) side of join.

join_catalog: str | None = None#: Catalog name for the joining (right) side of association.

join_column: str | None = None#: Column name in the joining (right) side of join.

join_column_association: str | None = None#: Column name in the association table that matches the joining (right) side of join.

assn_max_separation: float | None = None#: The maximum separation between two points in an association catalog, expressed in arcseconds.

contains_leaf_files: bool | None = None#: Whether or not the association catalog contains leaf parquet files.

indexing_column: str | None = None#: Column that we provide an index over.

extra_columns: list[str] | None = None#: Any additional payload columns included in index.

npix_suffix: str = None#: Suffix of the Npix partitions. In the standard HATS directory structure, this is '.parquet' because there is a single file in each Npix partition and it is named like 'Npix=313.parquet'. Other valid directory structures include those with the same single file per partition but which use a different suffix (e.g., 'npix_suffix' = '.parq' or '.snappy.parquet'), and also those in which the Npix partitions are actually directories containing 1+ files underneath (and then 'npix_suffix' = '/').

skymap_order: int | None = None#: Nested Order of the healpix skymap stored in the default skymap.fits.

skymap_alt_orders: list[int] | None = None#: Nested Order (K) of the healpix skymaps stored in altnernative skymap.K.fits.

hats_max_rows: int | None = None#: Maximum number of rows in any partition of the catalog.

hats_max_bytes: int | None = None#: Maximum number of bytes in any partition of the catalog.

hats_estsize: int | None = None#: Estimated size of the catalog on disk, in kilobytes.

moc_sky_fraction: float | None = None#

model_config#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod space_delimited_list(str_value: str) → list[str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:

str_value: str: a space-delimited list string

Returns:

list[str]: python list of strings

classmethod space_delimited_int_list(str_value: str | list[int]) → list[int][source]#

Convert a space-delimited list string into a python list of integers.

Parameters:

str_valuestr | list[int]: string representation of a list of integers, delimited by space, comma, or semicolon, or a list of integers.

Returns:

list[int]: a python list of integers

Raises:

ValueError: if any non-digit characters are encountered

serialize_as_space_delimited_list(str_list: Iterable) → str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:

str_list: Iterable: a python list of strings

Returns:

str: a space-delimited string.

check_required() → Self[source]#: Check that type-specific fields are appropriate, and required fields are set.

copy_and_update(**kwargs)[source]#

Create a validated copy of these table properties, updating the fields provided in kwargs.

Parameters:

**kwargs: values to update

Returns:

TableProperties: new instance of properties object

explicit_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that have been explicitly set, and are not “extra” keys.

Parameters:

by_aliasbool: (Default value = False)
exclude_nonebool: (Default value = True)

Returns:

dict: all keys that are attributes of this class and not “extra”.

extra_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that are “extra” keys.

Parameters:

by_aliasbool: (Default value = False)
exclude_nonebool: (Default value = True)

Returns:

dict: all keys that are not attributes of this class, e.g. “extra”.

__repr__()[source]#

__str__()[source]#: Friendly string representation based on named fields.

classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) → Self[source]#

Read field values from a java-style properties file.

Parameters:

catalog_dir: str | Path | UPath: path to a catalog directory.

Returns:

TableProperties: object created from the contents of a hats.properties file in the given directory

Raises:

FileNotFoundError: if there is no properties or hats.properties file in the directory

to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath)[source]#

Write fields to a java-style properties file.

Parameters:

catalog_dir: str | Path | UPath: directory to write the file

static new_provenance_dict(path: str | pathlib.Path | upath.UPath | None = None, builder: str | None = None, **kwargs) → dict[source]#

Constructs the provenance properties for a HATS catalog.

Parameters:

path: str | Path | UPath | None: The path to the catalog directory.
builderstr | None: The name and version of the tool that created the catalog.
**kwargs: Additional properties to include/override in the dictionary.

Returns:

dict: A dictionary with properties for the HATS catalog.

class IndexCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

Bases: hats.catalog.dataset.Dataset

An index into HATS Catalog for enabling fast lookups on non-spatial values.

Note that this is not a true “HATS Catalog”, as it is not partitioned spatially.

loc_partitions(ids) → list[hats.pixel_math.HealpixPixel][source]#

Find the set of partitions in the primary catalog for the ids provided.

Parameters:

ids: primary catalog for the ids

Returns:

list[HealpixPixel]: partitions of leaf parquet files in the primary catalog that may contain rows for the id values

class MapCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS table to represent non-point-source data in a continuous map.

class MarginCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, snapshot: hats.catalog.catalog_snapshot.CatalogSnapshot | None = None, generate_snapshot: bool = False)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog used to contain the ‘margin’ of another HATS catalog.

Catalogs of this type are used alongside a primary catalog, and contains the margin points for each HEALPix pixel - any points that are within a certain distance from the HEALPix pixel boundary. This is used to ensure spatial operations such as crossmatching can be performed efficiently while maintaining accuracy.

filter_by_moc(moc: mocpy.MOC) → Self[source]#

Filter the pixels in the margin catalog to only include the margin pixels that overlap with the moc

For the case of margin pixels, this includes any pixels whose margin areas may overlap with the moc. This is not always done with a high accuracy, but always includes any pixels that will overlap, and may include extra partitions that do not.

Parameters:

mocmocpy.MOC: the moc to filter by

Returns:

MarginCatalog: A new margin catalog with only the pixels that overlap or that have margin area that overlap with the moc. Note that we reset the total_rows to None, as updating would require a scan over the new pixel sizes.

class PartitionInfo(pixel_list: list[hats.pixel_math.healpix_pixel.HealpixPixel], catalog_base_dir: str = None)[source]#

Container class for per-partition info.

METADATA_ORDER_COLUMN_NAME = 'Norder'#

METADATA_PIXEL_COLUMN_NAME = 'Npix'#

pixel_list = []#

catalog_base_dir = None#

get_healpix_pixels() → list[hats.pixel_math.healpix_pixel.HealpixPixel][source]#

Get healpix pixel objects for all pixels represented as partitions.

Returns:

list[HealpixPixel]: List of HealpixPixel

get_highest_order() → int[source]#

Get the highest healpix order for the dataset.

Returns:

int: int representing highest order.

__len__()[source]#

The number of partitions.

Returns:

int: The number of partition pixels.

Write all partition data to CSV file.

If no paths are provided, the catalog base directory from the read_from_dir call is used.

Parameters:

partition_info_filestr | Path | UPath | None: path to where the partition_info.csv file will be written.
catalog_pathstr | Path | UPath | None: base directory for a catalog where the partition_info.csv file will be written.

Raises:

ValueError: if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None) → PartitionInfo[source]#

Read partition info from a file within a hats directory.

This will look for a partition_info.csv file, and if not found, the partition info will be computed from the individual catalog files. Computing from catalog files will be slower: in internal testing, it took about half a second to compute from a catalog with ~40k partitions, versus a few milliseconds to read from the CSV file.

Parameters:

catalog_base_dirstr | Path | UPath | None: Path to the root directory of the catalog

Returns:

PartitionInfo: A PartitionInfo object with the data from the file

classmethod read_from_file(metadata_file: str | pathlib.Path | upath.UPath) → PartitionInfo[source]#

Read partition info from a _metadata file to create an object

Parameters:

metadata_filestr | Path | UPath: path to the _metadata file

Returns:

PartitionInfo: A PartitionInfo object with the data from the file

classmethod read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) → PartitionInfo[source]#

Read partition info from a partition_info.csv file to create an object

Parameters:

partition_info_filestr | Path | UPath: path to the partition_info.csv file

Returns:

PartitionInfo: A PartitionInfo object with the data from the file

as_dataframe()[source]#

Construct a pandas dataframe for the partition info pixels.

Returns:

pd.DataFrame: Pandas Dataframe with order, directory, and pixel info.

classmethod from_healpix(healpix_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel]) → PartitionInfo[source]#

Create a partition info object from a list of constituent healpix pixels.

Parameters:

healpix_pixels: list[HealpixPixel]: a list of constituent healpix pixels

Returns:

PartitionInfo: A PartitionInfo object with the same healpix pixels

calculate_fractional_coverage()[source]#: Calculate what fraction of the sky is covered by partition tiles.