Datasets

A Dataset is a Python class providing a curated set of data with specific helper functions.

When working on data, we are often writing code to transform, preprocess, adapt the data to our needs. While it may be very nice to understand deeply to deep magic under the hood, this process could very time consuming. Once somebody else did the hard work of preparing the data for a given purpose and designing relevant functions to process it, their code can be integrated into a dataset plugin and made available to others through a CliMetLab plugin.

CliMetLab has build-in datasets (as examples) and most of the datasets are available as plugins.

Accessing data in a dataset

First, the relevant plugin has been installed, generally using pip

pip install --quiet climetlab-demo-dataset

Second, the dataset can be loaded with load_dataset() as follows:

>>> import climetlab as cml
>>> ds = cml.load_dataset("demo-dataset")

Notice that the relevant plugin package must be installed to access the dataset, with pip (such as pip install climetlab-demo-dataset). If the package is not installed, CliMetLab will fail with a NameError exception.

>>> ds = climetlab.load_dataset("demo-dataset")
NameError: Cannot find dataset 'demo-dataset' (values are ...),

When dataset some-dataset appears to be unavailable, this could be due to a typo in the dataset name (such as confusing some-dataset with somedataset).

Note

When sharing a python notebook, it is a good practice to add !pip install climetlab-... at the top of the notebook.

The plugin name does not have to match the dataset name, and one plugin usually provides several datasets. As an example, the plugin climetlab_s2s_ai_challenge provides the datasets s2s-ai-challenge-training-input and s2s-ai-challenge-training-output:

>>> !pip install climetlab_s2s_ai_challenge
>>> climetlab.load_dataset("s2s-ai-challenge-training-input")
>>> climetlab.load_dataset("s2s-ai-challenge-training-output")

There is no need to import the plugin package to enable load the dataset:

>> import climetlab_demo_dataset  # Not needed

Currently, the best way to know which plugin needs to be installed to access a given dataset is to look at list of plugins (non-exhaustive).

Todo

Design a streamlined way to register and publish plugins.

Xarray for gridded data

Gridded data typically are field data such as temperature or wind from climate or weather models or satellite images.

dsc = climetlab.load_dataset("dataset-name", **options)
dsc.to_xarray()

Pandas for non-gridded data

None-gridded data typically is tabular non-structured data sucha as observations. It often includ a column for the latitude and longitude of the data.

>>> dsc = climetlab.load_dataset("dataset-name", **options)
>>> dsc.to_pandas()

Generic options

Some arguments in the options dictionary are always included in climetlab.load_dataset or climetlab.Dataset.to_xarray() or climetlab.Dataset.to_pandas() (see developer/dataset-options).

Todo

Currently no options are added by CliMetLab.

Other arguments are defined by the plugin maintainer, and are be documented in the plugin documentation.

The plugin documentation url is provided by the plugin with :

>>> dsc = climetlab.load_dataset("dataset-name")
>>> dsc = climetlab.dataset("dataset-name")
>>> dsc.documentation