tokio.cli.archive_collectdes module

Dumps a lot of data out of ElasticSearch using the Python API and native scrolling support. Output either as native json from ElasticSearch or as serialized TOKIO TimeSeries (TTS) HDF5 files.

Can use PYTOKIO_ES_USER and PYTOKIO_ES_PASSWORD environment variables to pass on to the Elasticsearch connector for http authentication.

tokio.cli.archive_collectdes.dataset2metadataset_key(dataset_key)[source]

Return the metadataset name corresponding to a dataset name

Parameters:dataset_name (str) – Name of a dataset
Returns:Name of corresponding metadataset name
Return type:str
tokio.cli.archive_collectdes.main(argv=None)[source]

Entry point for the CLI interface

tokio.cli.archive_collectdes.metadataset2dataset_key(metadataset_name)[source]

Return the dataset name corresponding to a metadataset name

Metadatasets are not ever stored in the HDF5 and instead are only used to store data needed to correctly calculate dataset values. This function maps a metadataset name to its corresponding dataset name.

Parameters:metadataset_name (str) – Name of a metadataset
Returns:Name of corresponding dataset name, or None if metadataset_name does not appear to be a metadataset name.
Return type:str
tokio.cli.archive_collectdes.normalize_cpu_datasets(inserts, datasets)[source]

Normalize CPU load datasets

Divide each element of CPU datasets by the number of CPUs counted at each point in time. Necessary because these measurements are reported on a per-core basis, but not all cores may be reported for each timestamp.

Parameters:
  • inserts (list of tuples) – list of inserts that were used to populate datasets
  • datasets (dict of TimeSeries) – all of the datasets being populated
Returns:

Nothing

tokio.cli.archive_collectdes.pages_to_hdf5(pages, output_file, init_start, init_end, query_start, query_end, timestep, num_servers, devices_per_server, threads=1)[source]

Stores a page from Elasticsearch query in an HDF5 file Take pages from ElasticSearch query and store them in output_file

Parameters:
  • pages (list) – A list of page objects (dictionaries)
  • output_file (str) – Path to an HDF5 file in which page data should be stored
  • init_start (datetime.datetime) – Lower bound of time (inclusive) to be stored in the output_file. Used when creating a non-existent HDF5 file.
  • init_end (datetime.datetime) – Upper bound of time (inclusive) to be stored in the output_file. Used when creating a non-existent HDF5 file.
  • query_start (datetime.datetime) – Retrieve data greater than or equal to this time from Elasticsearch
  • query_end (datetime.datetime) – Elasticsearch
  • timestep (int) – Time, in seconds, between successive sample intervals to be used when initializing output_file
  • num_servers (int) – Number of discrete servers in the cluster. Used when initializing output_file.
  • devices_per_server (int) – Number of SSDs per server. Used when initializing output_file.
  • threads (int) – Number of parallel threads to utilize when parsing the Elasticsearch output
tokio.cli.archive_collectdes.process_page(page)[source]

Go through a list of docs and insert their data into a numpy matrix. In the future this should be a flush function attached to the CollectdEs connector class.

Parameters:page (dict) – A single page of output from an Elasticsearch scroll query. Should contain a hits key.
tokio.cli.archive_collectdes.reset_timeseries(timeseries, start, end, value=-0.0)[source]

Zero out a region of a tokio.timeseries.TimeSeries dataset

Parameters:
  • timeseries (tokio.timeseries.TimeSeries) – data from a subset should be zeroed
  • start (datetime.datetime) – Time at which zeroing of all columns in timeseries should begin
  • end (datetime.datetime) – Time at which zeroing all columns in timeseries should end (exclusive)
  • value – value which should be set in every element being reset
Returns:

Nothing

tokio.cli.archive_collectdes.update_datasets(inserts, datasets)[source]

Insert list of tuples into a dataset

Insert a list of tuples into a tokio.timeseries.TimeSeries object serially

Parameters:
  • inserts (list of tuples) –

    List of tuples which should be serially inserted into a dataset. The tuples can be of the form

    or

    • dataset name (str)
    • timestamp (datetime.datetime)
    • column name (str)
    • value
    • reducer name (str)

    where

    • dataset name is the key used to retrieve a target tokio.timeseries.TimeSeries object from the datasets argument
    • timestamp and column name reference the element to be udpated
    • value is the new value to insert into the given (timestamp, column name) location within dataset.
    • reducer name is None (to just replace whatever value currently exists in the (timestamp, column name) location, or ‘sum’ to add value to the existing value.
  • datasets (dict) – Dictionary mapping dataset names (str) to tokio.timeseries.TimeSeries objects
Returns:

number of elements in inserts which were not inserted because their timestamp value was out of the range of the dataset to be updated.

Return type:

int