hats.catalog#
Catalog data wrappers
Submodules#
Classes#
A HATS Catalog for enabling fast joins between two HATS catalogs |
|
Association catalog metadata with which partitions matches occur in the join |
|
A HATS Catalog with data stored in a HEALPix Hive partitioned structure |
|
Enum for possible types of catalog |
|
A base HATS dataset that contains a properties file |
|
Container class for catalog metadata |
|
A HATS table to represent non-point-source data in a continuous map. |
|
A HATS Catalog used to contain the 'margin' of another HATS catalog. |
|
Container class for per-partition info. |
Package Contents#
- class AssociationCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], join_pixels: list | pandas.DataFrame | hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo, catalog_path=None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset
A HATS Catalog for enabling fast joins between two HATS catalogs
Catalogs of this type are partitioned based on the partitioning of the left catalog. The partition_join_info metadata file specifies all pairs of pixels in the Association Catalog, corresponding to each pair of partitions in each catalog that contain rows to join.
- join_info#
- get_join_pixels() pandas.DataFrame [source]#
Get join pixels listing all pairs of pixels from left and right catalogs that contain matching association rows
- Returns:
pd.DataFrame with each row being a pair of pixels from the primary and join catalogs
- static _get_partition_join_info_from_pixels(join_pixels: list | pandas.DataFrame | hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo) hats.catalog.association_catalog.partition_join_info.PartitionJoinInfo [source]#
- class PartitionJoinInfo(join_info_df: pandas.DataFrame, catalog_base_dir: str = None)[source]#
Association catalog metadata with which partitions matches occur in the join
- PRIMARY_ORDER_COLUMN_NAME = 'Norder'#
- PRIMARY_PIXEL_COLUMN_NAME = 'Npix'#
- JOIN_ORDER_COLUMN_NAME = 'join_Norder'#
- JOIN_PIXEL_COLUMN_NAME = 'join_Npix'#
- COLUMN_NAMES#
- data_frame#
- catalog_base_dir = None#
- primary_to_join_map() dict[hats.pixel_math.healpix_pixel.HealpixPixel, list[hats.pixel_math.healpix_pixel.HealpixPixel]] [source]#
Generate a map from a single primary pixel to one or more pixels in the join catalog.
Lots of cute comprehension is happening here, so watch out! We create tuple of (primary order/pixel) and [array of tuples of (join order/pixel)]
- Returns:
dictionary mapping (primary order/pixel) to [array of (join order/pixel)]
- write_to_csv(catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#
Write all partition data to CSV files.
Two files will be written:
partition_info.csv - covers all primary catalog pixels, and should match the file structure
partition_join_info.csv - covers all pairwise relationships between primary and join catalogs.
- Parameters:
catalog_path – path to the directory where the partition_join_info.csv file will be written
- Raises:
ValueError – if no path is provided, and could not be inferred.
- classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None = None) PartitionJoinInfo [source]#
Read partition join info from a partition_join_info file within a hats directory.
- Parameters:
catalog_base_dir – path to the root directory of the catalog
- Returns:
A PartitionJoinInfo object with the data from the file
- Raises:
FileNotFoundError – if the desired file is found in the catalog_base_dir
- classmethod read_from_csv(partition_join_info_file: str | pathlib.Path | upath.UPath) PartitionJoinInfo [source]#
Read partition join info from a partition_join_info.csv file to create an object
- Parameters:
partition_join_info_file (UPath) – path to the partition_join_info.csv file
- Returns:
A PartitionJoinInfo object with the data from the file
- classmethod _read_from_csv(partition_join_info_file: str | pathlib.Path | upath.UPath) pandas.DataFrame [source]#
Read partition join info from a partition_join_info.csv file to create an object
- Parameters:
partition_join_info_file (UPath) – path to the partition_join_info.csv file
- Returns:
A PartitionJoinInfo object with the data from the file
- class Catalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset
A HATS Catalog with data stored in a HEALPix Hive partitioned structure
Catalogs of this type are partitioned spatially, contain partition_info metadata specifying the pixels in Catalog, and on disk conform to the parquet partitioning structure Norder=/Dir=/Npix=.parquet
- generate_negative_tree_pixels() list[hats.pixel_math.HealpixPixel] [source]#
Get the leaf nodes at each healpix order that have zero catalog data.
For example, if an example catalog only had data points in pixel 0 at order 0, then this method would return order 0’s pixels 1 through 11. Used for getting full coverage on margin caches.
- Returns:
List of HealpixPixels representing the ‘negative tree’ for the catalog.
- class CatalogType[source]#
Bases:
str
,enum.Enum
Enum for possible types of catalog
- OBJECT = 'object'#
- SOURCE = 'source'#
- ASSOCIATION = 'association'#
- INDEX = 'index'#
- MARGIN = 'margin'#
- MAP = 'map'#
- class Dataset(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None)[source]#
A base HATS dataset that contains a properties file and the data contained in parquet files
- catalog_info#
- catalog_name#
- catalog_path = None#
- on_disk#
- catalog_base_dir = None#
- schema = None#
- aggregate_column_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#
Read footer statistics in parquet metadata, and report on global min/max values.
- Parameters:
exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columns (List[str]) – additional columns to exclude from the statistics.
include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
- per_pixel_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_stats: list[str] = None, multi_index=False)[source]#
Read footer statistics in parquet metadata, and report on statistics about each pixel partition.
- Parameters:
exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columns (List[str]) – additional columns to exclude from the statistics.
include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
include_stats (List[str]) – if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.
multi_index (bool) – should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with
combination. (separate columns per-data-column and stat value)
- class TableProperties(/, **data: Any)[source]#
Bases:
pydantic.BaseModel
Container class for catalog metadata
- catalog_name: str = None#
- catalog_type: hats.catalog.catalog_type.CatalogType = None#
- total_rows: int = None#
- ra_column: str | None = None#
- dec_column: str | None = None#
- default_columns: list[str] | None = None#
Which columns should be read from parquet files, when user doesn’t otherwise specify.
- primary_catalog: str | None = None#
Reference to object catalog. Relevant for nested, margin, association, and index.
- margin_threshold: float | None = None#
Threshold of the pixel boundary, expressed in arcseconds.
- primary_column: str | None = None#
Column name in the primary (left) side of join.
- primary_column_association: str | None = None#
Column name in the association table that matches the primary (left) side of join.
- join_catalog: str | None = None#
Catalog name for the joining (right) side of association.
- join_column: str | None = None#
Column name in the joining (right) side of join.
- join_column_association: str | None = None#
Column name in the association table that matches the joining (right) side of join.
- contains_leaf_files: bool | None = None#
Whether or not the association catalog contains leaf parquet files.
- indexing_column: str | None = None#
Column that we provide an index over.
- extra_columns: list[str] | None = None#
Any additional payload columns included in index.
- model_config#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- npix_suffix: str = None#
Suffix of the Npix partitions. In the standard HATS directory structure, this is ‘.parquet’ because there is a single file in each Npix partition and it is named like ‘Npix=313.parquet’. Other valid directory structures include those with the same single file per partition but which use a different suffix (e.g., npix_suffix = ‘.parq’ or ‘.snappy.parquet’), and also those in which the Npix partitions are actually directories containing 1+ files underneath (and then npix_suffix = ‘/’).
- classmethod space_delimited_list(str_value: str) list[str] [source]#
Convert a space-delimited list string into a python list of strings.
- serialize_as_space_delimited_list(str_list: Iterable[str]) str [source]#
Convert a python list of strings into a space-delimited string.
- check_allowed_and_required() typing_extensions.Self [source]#
Check that type-specific fields are appropriate, and required fields are set.
- copy_and_update(**kwargs)[source]#
Create a validated copy of these table properties, updating the fields provided in kwargs.
- explicit_dict(by_alias=False, exclude_none=True)[source]#
Create a dict, based on fields that have been explicitly set, and are not “extra” keys.
- extra_dict(by_alias=False, exclude_none=True)[source]#
Create a dict, based on fields that are “extra” keys.
- class MapCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset
A HATS table to represent non-point-source data in a continuous map.
- class MarginCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset
A HATS Catalog used to contain the ‘margin’ of another HATS catalog.
Catalogs of this type are used alongside a primary catalog, and contains the margin points for each HEALPix pixel - any points that are within a certain distance from the HEALPix pixel boundary. This is used to ensure spatial operations such as crossmatching can be performed efficiently while maintaining accuracy.
- filter_by_moc(moc: mocpy.MOC) typing_extensions.Self [source]#
Filter the pixels in the margin catalog to only include the margin pixels that overlap with the moc
For the case of margin pixels, this includes any pixels whose margin areas may overlap with the moc. This is not always done with a high accuracy, but always includes any pixels that will overlap, and may include extra partitions that do not.
- Parameters:
moc (mocpy.MOC) – the moc to filter by
- Returns:
A new margin catalog with only the pixels that overlap or that have margin area that overlap with the moc. Note that we reset the total_rows to None, as updating would require a scan over the new pixel sizes.
- class PartitionInfo(pixel_list: list[hats.pixel_math.healpix_pixel.HealpixPixel], catalog_base_dir: str = None)[source]#
Container class for per-partition info.
- METADATA_ORDER_COLUMN_NAME = 'Norder'#
- METADATA_PIXEL_COLUMN_NAME = 'Npix'#
- pixel_list#
- catalog_base_dir = None#
- get_healpix_pixels() list[hats.pixel_math.healpix_pixel.HealpixPixel] [source]#
Get healpix pixel objects for all pixels represented as partitions.
- Returns:
List of HealpixPixel
- get_highest_order() int [source]#
Get the highest healpix order for the dataset.
- Returns:
int representing highest order.
- write_to_file(partition_info_file: str | pathlib.Path | upath.UPath | None = None, catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#
Write all partition data to CSV file.
If no paths are provided, the catalog base directory from the read_from_dir call is used.
- Parameters:
partition_info_file – path to where the partition_info.csv file will be written.
catalog_path – base directory for a catalog where the partition_info.csv file will be written.
- Raises:
ValueError – if no path is provided, and could not be inferred.
- classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None) PartitionInfo [source]#
Read partition info from a file within a hats directory.
This will look for a partition_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.
- Parameters:
catalog_base_dir – path to the root directory of the catalog
- Returns:
A PartitionInfo object with the data from the file
- Raises:
FileNotFoundError – if neither desired file is found in the catalog_base_dir
- classmethod read_from_file(metadata_file: str | pathlib.Path | upath.UPath) PartitionInfo [source]#
Read partition info from a _metadata file to create an object
- Parameters:
metadata_file (UPath) – path to the _metadata file
- Returns:
A PartitionInfo object with the data from the file
- classmethod _read_from_metadata_file(metadata_file: str | pathlib.Path | upath.UPath) list[hats.pixel_math.healpix_pixel.HealpixPixel] [source]#
Read partition info list from a _metadata file.
- Parameters:
metadata_file (UPath) – path to the _metadata file
- Returns:
The list of HealpixPixel extracted from the data in the metadata file
- classmethod read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo [source]#
Read partition info from a partition_info.csv file to create an object
- Parameters:
partition_info_file (UPath) – path to the partition_info.csv file
- Returns:
A PartitionInfo object with the data from the file
- classmethod _read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo [source]#
Read partition info from a partition_info.csv file to create an object
- Parameters:
partition_info_file (UPath) – path to the partition_info.csv file
- Returns:
A PartitionInfo object with the data from the file
- as_dataframe()[source]#
Construct a pandas dataframe for the partition info pixels.
- Returns:
Dataframe with order, directory, and pixel info.
- classmethod from_healpix(healpix_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel]) PartitionInfo [source]#
Create a partition info object from a list of constituent healpix pixels.
- Parameters:
healpix_pixels – list of healpix pixels
- Returns:
A PartitionInfo object with the same healpix pixels