tokio.connectors.hdf5 module

Provide a TOKIO-aware HDF5 class that knows how to interpret schema versions encoded in a TOKIO HDF5 file and translate a universal schema into file-specific schemas. Also supports dynamically mapping static HDF5 datasets into new derived datasets dynamically.

class tokio.connectors.hdf5.Hdf5(*args, **kwargs)[source]

Bases: h5py._hl.files.File

Hdf5 file class with extra hooks to parse different schemas

Provides an h5py.File-like class with added methods to provide a generic API that can decode different schemata used to store file system load data.

Variables:
  • always_translate (bool) – If True, looking up datasets by keys will always attempt to map that key to a new dataset according to the schema even if the key matches the name of an existing dataset.
  • dataset_providers (dict) – Map of logical dataset names (keys) to dicts that describe the functions used to convert underlying literal dataset data into the format expected when dereferencing the logical dataset name.
  • schema (dict) – Map of logical dataset names (keys) to the literal dataset names in the underlying file (values)
  • _version (str) – Defined and used at initialization time to determine what schema to apply to map the HDF5 connector API to the underlying HDF5 file.
  • _timesteps (dict) – Keyed by dataset name (str) and has values corresponding to the timestep (in seconds) between each sampled datum in that dataset.
__getitem__(key)[source]

Resolve dataset names into actual data

Provides a single interface through which standard keys can be dereferenced and a semantically consistent view of data is returned regardless of the schema of the underlying HDF5 file.

Passes through the underlying h5py.Dataset via direct access or a 1:1 mapping between standardized key and an underlying dataset name, or a numpy array if an underlying h5py.Dataset must be transformed to match the structure and semantics of the data requested.

Can also suffix datasets with special meta-dataset names (e.g., “/missing”) to access data that is related to the root dataset.

Parameters:key (str) – The standard name of a dataset to be accessed.
Returns:
  • h5py.Dataset if key is a literal dataset name
  • h5py.Dataset if key maps directly to a literal dataset name given the file schema version
  • numpy.ndarray if key maps to a provider function that can calculate the requested data
Return type:h5py.Dataset or numpy.ndarray
__init__(*args, **kwargs)[source]

Initialize an HDF5 file

This is just an HDF5 file object; the magic is in the additional methods and indexing that are provided by the TOKIO Time Series-specific HDF5 object.

Parameters:ignore_version (bool) – If true, do not throw KeyError if the HDF5 file does not contain a valid version.
_get_columns_h5lmt(dataset_name)[source]

Get the column names of an h5lmt dataset

_get_missing_h5lmt(dataset_name, inverse=False)[source]

Return the FSMissingGroup dataset from an H5LMT file

Encodes a hot mess of hacks to return something that looks like what get_missing() would return for a real dataset.

Parameters:
  • dataset_name (str) – name of dataset to access
  • inverse (bool) – return 0 for missing and 1 for present if True
Returns:

Array of numpy.int8 of 1 and 0 to indicate the presence or absence of specific elements

Return type:

numpy.ndarray

_resolve_schema_key(key)[source]

Given a key, either return a key that can be used to index self directly, or return a provider function and arguments to generate the dataset dynamically

_to_dataframe(dataset_name)[source]

Convert a dataset into a dataframe via TOKIO HDF5 schema

_to_dataframe_h5lmt(dataset_name)[source]

Convert a dataset into a dataframe via H5LMT native schema

commit_timeseries(timeseries, **kwargs)[source]

Writes contents of a TimeSeries object into a group

Parameters:
  • timeseries (tokio.timeseries.TimeSeries) – the time series to save as a dataset within self
  • kwargs (dict) – Extra arguments to pass to self.create_dataset()
get_columns(dataset_name)[source]

Get the column names of a dataset

Parameters:dataset_name (str) – name of dataset whose columns will be retrieved
Returns:Array of column names, or empty if no columns defined
Return type:numpy.ndarray
get_index(dataset_name, target_datetime)[source]

Turn a datetime object into an integer that can be used to reference specific times in datasets.

get_missing(dataset_name, inverse=False)[source]

Convert a dataset into a matrix indicating the abscence of data

Parameters:
  • dataset_name (str) – name of dataset to access
  • inverse (bool) – return 0 for missing and 1 for present if True
Returns:

Array of numpy.int8 of 1 and 0 to indicate the presence or absence of specific elements

Return type:

numpy.ndarray

get_timestamps(dataset_name)[source]

Return timestamps dataset corresponding to given dataset name

This method returns a dataset, not a numpy array, so you can face severe performance penalties trying to iterate directly on the return value! To iterate over timestamps, it is almost always better to dereference the dataset to get a numpy array and iterate over that in memory.

Parameters:dataset_name (str) – Logical name of dataset whose timestamps should be retrieved
Returns:The dataset containing the timestamps corresponding to dataset_name.
Return type:h5py.Dataset
get_timestep(dataset_name, timestamps=None)[source]

Cache or calculate the timestep for a dataset

get_version(dataset_name=None)[source]

Get the version attribute from an HDF5 file dataset

Parameters:dataset_name (str) – Name of dataset to retrieve version. If None, return the global file’s version.
Returns:The version string for the specified dataset
Return type:str
set_version(version, dataset_name=None)[source]

Set the version attribute from an HDF5 file dataset

Provide a portable way to set the global schema version or the version of a specific dataset.

Parameters:
  • version (str) – The new version to be set
  • dataset_name (str) – Name of dataset to set version. If None, set the global file’s version.
to_dataframe(dataset_name)[source]

Convert a dataset into a dataframe

Parameters:dataset_name (str) – dataset name to convert to DataFrame
Returns:DataFrame indexed by datetime objects corresponding to timestamps, columns labeled appropriately, and values from the dataset
Return type:pandas.DataFrame
to_timeseries(dataset_name, light=False)[source]

Creates a TimeSeries representation of a dataset

Create a TimeSeries dataset object with the data from an existing HDF5 dataset.

Responsible for setting timeseries.dataset_name, timeseries.columns, timeseries.dataset, timeseries.dataset_metadata, timeseries.group_metadata, timeseries.timestamp_key

Parameters:
  • dataset_name (str) – Name of existing dataset in self to convert into a TimeSeries object
  • light (bool) – If True, don’t actually load datasets into memory; reference them directly into the HDF5 file
Returns:

The in-memory representation of the given dataset.

Return type:

tokio.timeseries.TimeSeries

tokio.connectors.hdf5.get_insert_indices(my_timestamps, existing_timestamps)[source]

Given new timestamps and an existing series of timestamps, find the indices overlap so that new data can be inserted into the middle of an existing dataset

tokio.connectors.hdf5.missing_values(dataset, inverse=False)[source]

Identify matrix values that are missing

Because we initialize datasets with -0.0, we can scan the sign bit of every element of an array to determine how many data were never populated. This converts negative zeros to ones and all other data into zeros then count up the number of missing elements in the array.

Parameters:
  • dataset – dataset to access
  • inverse (bool) – return 0 for missing and 1 for present if True
Returns:

Array of numpy.int8 of 1 and 0 to indicate the presence or absence of specific elements

Return type:

numpy.ndarray