API Reference#

This section documents the core modules of esrf-data-compressor.

esrf_data_compressor.finder#

esrf_data_compressor.finder.finder.discover_datasets(path_components, base_root)[source]#
Parameters:
  • path_components (List[str])

  • base_root (str)

Return type:

List[str]

esrf_data_compressor.finder.finder.find_vds_files(path_components, base_root, filter_expr, *, max_workers=None)[source]#
Discover each dataset HDF5, then for each top-level group (e.g. “1.1”):
  • treat each filter key “A/B/C” as a dataset path under that group, i.e. grp[“A”][“B”][“C”][()].

  • if any filter’s desired substring is found in the dataset’s value, classify that group’s VDS sources into TO COMPRESS, reason=”grp/A/B/C contains ‘val’”.

  • otherwise into REMAINING, reason=”grp/A/B/C=<actual>”.

Adds a check for datasets already compressed with the JP2KCompressor’s Blosc2/Grok filter (ID 32026) and classifies those files as REMAINING with reason “<already compressed>”.

Returns two lists of (vds_source_path, reason).

Parameters:
  • path_components (List[str])

  • base_root (str)

  • filter_expr (Optional[str])

  • max_workers (Optional[int])

Return type:

Tuple[List[Tuple[str, str]], List[Tuple[str, str]]]

esrf_data_compressor.finder.finder.write_report(to_list, rem_list, output_path)[source]#
Parameters:
  • to_list (List[Tuple[str, str]])

  • rem_list (List[Tuple[str, str]])

  • output_path (str)

esrf_data_compressor.checker#

esrf_data_compressor.checker.run_check.run_ssim_check(raw_files, method, report_path, layout='sibling')[source]#
Given a list of raw HDF5 file paths, partitions into:

to_check → those with an expected compressed counterpart according to layout missing → those without one

Writes a report to report_path:
  • ‘=== NOT COMPRESSED FILES ===’ listing each missing

  • then for each to_check pair, computes SSIM in parallel and appends per‐dataset SSIM lines under ‘=== <stem> ===’ with full paths

Parameters:
  • raw_files (list[str])

  • method (str)

  • report_path (str)

  • layout (str)

Return type:

None

esrf_data_compressor.checker.ssim.compute_ssim_for_dataset_pair(orig_path, comp_path, dataset_relpath)[source]#

Given two HDF5 files and the relative 3D dataset path (e.g., ‘entry_0000/ESRF-ID11/marana/data’), compute SSIM on the first (z=0) and last (z=Z-1) slices. Returns (ssim_first, ssim_last). If a slice is constant, SSIM = 1.0.

Parameters:
  • orig_path (str)

  • comp_path (str)

  • dataset_relpath (str)

Return type:

tuple[float, float]

esrf_data_compressor.checker.ssim.compute_ssim_for_file_pair(orig_path, comp_path)[source]#

Compute SSIM for every 3D dataset under orig_path vs. comp_path. Returns (basename, [report_lines…]), where each line is either: “<dataset_relpath>: SSIM_first=… SSIM_last=…” or an error message.

Parameters:
  • orig_path (str)

  • comp_path (str)

Return type:

tuple[str, list[str]]

esrf_data_compressor.compressors#

class esrf_data_compressor.compressors.base.Compressor[source]#

Bases: object

Abstract base class. Subclasses must implement compress_file().

compress_file(input_path, output_path, **kwargs)[source]#
Parameters:
  • input_path (str)

  • output_path (str)

class esrf_data_compressor.compressors.base.CompressorManager(workers=None, cratio=10, method='jp2k', layout='sibling')[source]#

Bases: object

Manages parallel compression and overwrite.

Each worker process is given up to 2 Blosc2 threads (or fewer if the machine has fewer than 4 cores). The number of worker processes is then total_cores // threads_per_worker (at least 1). If the user explicitly passes workers, we cap it to total_cores, then recompute threads_per_worker = min(2, total_cores // workers).

Usage:

mgr = CompressorManager(cratio=10, method=’jp2k’) mgr.compress_files([…]) mgr.overwrite_files([…])

Parameters:
  • workers (int | None)

  • cratio (int)

  • method (str)

  • layout (str)

compress_files(file_list)[source]#

Compress each .h5 in file_list in parallel. - sibling layout: produce <basename>_<method>.h5 next to each source. - mirror layout: write compressed files to RAW_DATA_COMPRESSED with same file names. Does not overwrite originals. At the end, prints total elapsed time and data rate in MB/s.

Parameters:

file_list (list[str])

Return type:

None

overwrite_files(file_list)[source]#

Overwrites files only if they have a compressed sibling:

  1. Rename <file>.h5 → <file>.h5.bak

  2. Rename <file>_<method>.h5 → <file>.h5

After processing all files, removes the backup .h5.bak files.

Parameters:

file_list (list[str])

Return type:

None

remove_backups(file_list)[source]#
Parameters:

file_list (list[str])

Return type:

None

restore_backups(file_list)[source]#
Parameters:

file_list (list[str])

Return type:

None

class esrf_data_compressor.compressors.jp2k.JP2KCompressor[source]#

Bases: object

Uses Blosc2+Grok (JPEG2000) to compress each z‐slice of a 3D HDF5 dataset.

compress_file(input_path, output_path, cratio=10, nthreads=None)[source]#
Parameters:
  • input_path (str)

  • output_path (str)

  • cratio (int)

  • nthreads (int | None)

Return type:

None

class esrf_data_compressor.compressors.jp2k.JP2KCompressorWrapper(cratio=10, nthreads=None)[source]#

Bases: object

Wraps JP2KCompressor so we can pass cratio and nthreads from higher‐level code.

Parameters:
  • cratio (int)

  • nthreads (int | None)

compress_file(input_path, output_path, **kwargs)[source]#
Parameters:
  • input_path (str)

  • output_path (str)

esrf_data_compressor.utils#

esrf_data_compressor.utils.hdf5_helpers.copy_attrs(src, dst)[source]#

Copy all attributes from src to dst.

Parameters:
  • src (AttributeManager)

  • dst (AttributeManager)

esrf_data_compressor.utils.hdf5_helpers.exit_with_error(msg)[source]#

Print an error message to stderr and exit(1).

Parameters:

msg (str)

esrf_data_compressor.utils.utils.parse_report(report_path)[source]#

Read a report with sections ‘## TO COMPRESS ##’ / ‘## REMAINING ##’ and return just the list of file paths under TO COMPRESS.

Parameters:

report_path (str)

Return type:

list[str]