hats.io.parquet_metadata

hats.io.parquet_metadata#

Utility functions for handling parquet metadata files

Functions#

`write_parquet_metadata`(catalog_path, *[, ...])	Write Parquet dataset-level metadata files (and optional thumbnail) for a catalog.
`aggregate_column_statistics`(metadata_file, *[, ...])	Read footer statistics in parquet metadata, and report on global min/max values.
`aggregate_column_statistics_from_cache`(metadata_file, *)	Using cached footer statistics in parquet metadata, and report on global min/max values.
`per_partition_statistics_from_cache`(metadata_file, *)	Read footer statistics in parquet metadata, and report on statistics about
`per_partition_statistics`(metadata_file, *[, ...])	Read footer statistics in parquet metadata, and report on statistics about
`write_per_partition_statistics_from_metadata`(...)	Reads the footer statistics from dataset/_metadata file, collects the per-pixel-statistics,
`pick_metadata_schema_file`(→ upath.UPath \| None)	Determines the appropriate file to read for parquet metadata
`nested_frame_to_vo_schema`(nested_frame, *[, verbose, ...])	Create VOTableFile metadata, based on the names and types of fields in the NestedFrame.
`write_voparquet_in_common_metadata`(catalog_base_dir, *)	Create VOTableFile metadata, based on the names and types of fields in the parquet files,

Module Contents#

write_parquet_metadata(catalog_path: str | pathlib.Path | upath.UPath, *, order_by_healpix=True, output_path: str | pathlib.Path | upath.UPath | None = None, create_thumbnail: bool = False, thumbnail_threshold: int = 1000000, create_metadata: bool = True, create_per_partition_stats: bool = False)[source]#

Write Parquet dataset-level metadata files (and optional thumbnail) for a catalog.

Creates files:

catalog/
├── data_thumbnail.parquet           (only if create_thumbnail=True)
├── per_partition_statistics.parquet (only if create_per_partition_stats=True)
├── ...
└── dataset/
    ├── _common_metadata             (always written)
    ├── _metadata                    (only if create_metadata=True)
    └──  ...

dataset/_common_metadata contains the full schema of the dataset. This file will know all of the columns and their types, as well as any file-level key-value metadata associated with the full Parquet dataset.

dataset/_metadata contains the combined row group footers from all Parquet files in the dataset, which allows readers to read the entire dataset without having to open each individual Parquet file. This file can be large for datasets with many files, so users may choose to omit it by setting create_metadata=False.

data_thumbnail.parquet gives the user a quick overview of the whole dataset. It is a compact file containing one row from each data partition, up to a maximum of thumbnail_threshold rows.

per_partition_statistics.parquet contains summary statistics from all columns in data partition files, e.g. column min/max values, count of null values, etc.

Parameters:

catalog_pathstr | Path | UPath: Base path for the catalog root.
order_by_healpixbool, default=True: If True, reorder combined metadata by breadth-first Healpix pixel ordering (e.g., secondary indexes). Set False for datasets that should not be reordered. Does not modify dataset files on disk.
output_pathstr | Path | UPath | None, default=None: Base path to write metadata files. If None, uses catalog_path.
create_thumbnailbool, default=False: If True, writes a compact data_thumbnail.parquet containing one row per sampled file.
thumbnail_thresholdint, default=1_000_000: Maximum number of rows in the thumbnail (or maximum number of files, if thumbnail_threshold exceeds the number of files). One row per partition.
create_metadatabool, default=True: If True, writes dataset/_metadata combining row group footers.
create_per_partition_statsbool, default=False: If True, writes per_partition_statistics.parquet containing summary statistics from all columns in data partition files.

Returns:

int: Total number of rows across all parquet files in the dataset.

Notes

For more information on the general Parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

For more information on HATS-specific metadata files and conventions, see https://www.ivoa.net/documents/Notes/HATS/

aggregate_column_statistics(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:

metadata_filestr | Path | UPath: path to _metadata file
exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
only_numeric_columnsbool: only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)
include_pixelslist[HealpixPixel]: if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

Returns:

pd.Dataframe: Pandas dataframe with global summary statistics

aggregate_column_statistics_from_cache(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)[source]#

Using cached footer statistics in parquet metadata, and report on global min/max values.

Parameters:

metadata_filestr | Path | UPath: path to _metadata file
exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
only_numeric_columnsbool: only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)
include_pixelslist[HealpixPixel]: if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

Returns:

pd.Dataframe: Pandas dataframe with global summary statistics

per_partition_statistics_from_cache(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None, per_row_group: bool = False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

The statistics gathered are a subset of the available attributes in the pyarrow.parquet.ColumnChunkMetaData:

min_value - minimum value seen in a single data partition
max_value - maximum value seen in a single data partition
null_count - number of null values
row_count - total number of values. note that this will only vary by column if you have some nested columns in your dataset
disk_bytes - Compressed size of the data in the parquet file, in bytes
memory_bytes - Uncompressed size, in bytes

Parameters:

metadata_filestr | Path | UPath: path to _metadata file
exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
only_numeric_columnsbool: only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)
include_statslist[str]: if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count, disk_bytes, memory_bytes). Defaults to None, and returns all values.
multi_indexbool: should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)
include_pixelslist[HealpixPixel]: if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.
per_row_groupbool: should the returned data be even more fine-grained and provide per row group (within each pixel) level statistics? Default is currently False.

Returns:

pd.Dataframe: Pandas dataframe with granular per-pixel statistics

per_partition_statistics(metadata_file: str | pathlib.Path | upath.UPath, *, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None, per_row_group: bool = False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

The statistics gathered are a subset of the available attributes in the pyarrow.parquet.ColumnChunkMetaData:

min_value - minimum value seen in a single data partition
max_value - maximum value seen in a single data partition
null_count - number of null values
row_count - total number of values. note that this will only vary by column if you have some nested columns in your dataset
disk_bytes - Compressed size of the data in the parquet file, in bytes
memory_bytes - Uncompressed size, in bytes

Parameters:

metadata_filestr | Path | UPath: path to _metadata file
exclude_hats_columnsbool: exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
exclude_columnslist[str]: additional columns to exclude from the statistics.
include_columnslist[str]: if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
only_numeric_columnsbool: only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)
include_statslist[str]: if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count, disk_bytes, memory_bytes). Defaults to None, and returns all values.
multi_indexbool: should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)
include_pixelslist[HealpixPixel]: if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.
per_row_groupbool: should the returned data be even more fine-grained and provide per row group (within each pixel) level statistics? Default is currently False.

Returns:

pd.Dataframe: Pandas dataframe with granular per-pixel statistics

write_per_partition_statistics_from_metadata(catalog_base_dir: str | pathlib.Path | upath.UPath)[source]#

Reads the footer statistics from dataset/_metadata file, collects the per-pixel-statistics, and writes out at per_partition_statistics.parquet

Parameters:

catalog_base_dirstr | Path | UPath: base path for the catalog

pick_metadata_schema_file(catalog_base_dir: str | pathlib.Path | upath.UPath) → upath.UPath | None[source]#

Determines the appropriate file to read for parquet metadata stored in the _common_metadata or _metadata files.

Parameters:

catalog_base_dirstr | Path | UPath: base path for the catalog

Returns:

UPath | None: path to a parquet file containing metadata schema.

nested_frame_to_vo_schema(nested_frame: nested_pandas.NestedFrame, *, verbose: bool = False, field_units: dict | None = None, field_ucds: dict | None = None, field_descriptions: dict | None = None, field_utypes: dict | None = None)[source]#

Create VOTableFile metadata, based on the names and types of fields in the NestedFrame. Add ancillary attributes to fields where they are provided in the optional dictionaries. Note on field names with nested columns: to include ancillary attributes (units, ucds, etc) for a nested sub-column, use dot notation (e.g. "lightcurve.band"). You can add ancillary attributes for the entire nested column group using the nested column name (e.g. "lightcurve").

Parameters:

nested_framenpd.NestedFrame: nested frame representing catalog data. this can be empty, as we only need to know about the column names and types.
verbose: bool: Should we print out additional debugging statements about the vo metadata?
field_units: dict | None: dictionary mapping column names to astropy units (or string representation of units)
field_ucds: dict | None: dictionary mapping column names to UCDs (Uniform Content Descriptors)
field_descriptions: dict | None: dictionary mapping column names to free-text descriptions
field_utypes: dict | None: dictionary mapping column names to utypes

Returns:

VOTableFile: VO object containing all relevant metadata (but no data)

Create VOTableFile metadata, based on the names and types of fields in the parquet files, and write to a catalog_base_dir/dataset/_common_metadata parquet file. Add ancillary attributes to fields where they are provided in the optional dictionaries. Note on field names with nested columns: to include ancillary attributes (units, ucds, etc) for a nested sub-column, use dot notation (e.g. "lightcurve.band"). You can add ancillary attributes for the entire nested column group using the nested column name (e.g. "lightcurve").

Parameters:

catalog_base_dirstr | Path | UPath: base path for the catalog
verbose: bool: Should we print out additional debugging statements about the vo metadata?
field_units: dict | None: dictionary mapping column names to astropy units (or string representation of units)
field_ucds: dict | None: dictionary mapping column names to UCDs (Uniform Content Descriptors)
field_descriptions: dict | None: dictionary mapping column names to free-text descriptions
field_utypes: dict | None: dictionary mapping column names to utypes