hats.io.parquet_metadata#

Utility functions for handling parquet metadata files

Functions#

write_parquet_metadata(catalog_path[, ...])

Generate parquet metadata, using the already-partitioned parquet files

read_row_group_fragments(metadata_file)

Generator for metadata fragment row groups in a parquet metadata file.

_nonemin(value1, value2)

Similar to numpy's nanmin, but excludes None values.

_nonemax(value1, value2)

Similar to numpy's nanmax, but excludes None values.

_pick_columns(first_row_group[, exclude_hats_columns, ...])

Convenience method to find the desired columns and their indexes, given

aggregate_column_statistics(metadata_file[, ...])

Read footer statistics in parquet metadata, and report on global min/max values.

per_pixel_statistics(metadata_file[, ...])

Read footer statistics in parquet metadata, and report on statistics about

Module Contents#

write_parquet_metadata(catalog_path: str | pathlib.Path | upath.UPath, order_by_healpix=True, output_path: str | pathlib.Path | upath.UPath | None = None)[source]#

Generate parquet metadata, using the already-partitioned parquet files for this catalog.

For more information on the general parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Parameters:
  • catalog_path (str) – base path for the catalog

  • order_by_healpix (bool) – use False if the dataset is not to be reordered by breadth-first healpix pixel (e.g. secondary indexes)

  • output_path (str) – base path for writing out metadata files defaults to catalog_path if unspecified

Returns:

sum of the number of rows in the dataset.

read_row_group_fragments(metadata_file: str)[source]#

Generator for metadata fragment row groups in a parquet metadata file.

Parameters:

metadata_file (str) – path to _metadata file.

_nonemin(value1, value2)[source]#

Similar to numpy’s nanmin, but excludes None values.

NB: If both values are None, this will still return None

_nonemax(value1, value2)[source]#

Similar to numpy’s nanmax, but excludes None values.

NB: If both values are None, this will still return None

_pick_columns(first_row_group, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#

Convenience method to find the desired columns and their indexes, given some conventional user preferences.

aggregate_column_statistics(metadata_file: str | pathlib.Path | upath.UPath, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:
  • metadata_file (str | Path | UPath) – path to _metadata file

  • exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

  • exclude_columns (List[str]) – additional columns to exclude from the statistics.

  • include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

  • include_pixels (list[HealpixPixel]) – if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

Returns:

dataframe with global summary statistics

per_pixel_statistics(metadata_file: str | pathlib.Path | upath.UPath, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_stats: list[str] = None, multi_index=False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

Parameters:
  • metadata_file (str | Path | UPath) – path to _metadata file

  • exclude_hats_columns (bool) – exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

  • exclude_columns (List[str]) – additional columns to exclude from the statistics.

  • include_columns (List[str]) – if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

  • include_pixels (list[HealpixPixel]) – if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

  • include_stats (List[str]) – if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.

  • multi_index (bool) – should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination.

Returns:

dataframe with granular per-pixel statistics