How to create a dataset plugin

From the end-user’s perspective, a Dataset is a object created using cml.load_dataset(name, *args) with the appropriate name and arguments, which provides data.

From the plugin’s developer perspective, a Dataset is a Python class that inherits from the CliMetLab class climetlab.Dataset. This class contains the Python code providing specific helper functions and curated access to the data. Dataset can also be defined from yaml files if they have no specific Python code and rely on (yet to defined) standard conventions.

CliMetLab has build-in example datasets for demo purposes. And more examples can be found in the non-exhaustive list of CliMetLab plugins.

Note

Naming convention: A plugin package name should preferably starts with climetlab- and use “-”. The Python package to import should starts with climetlab_ and use “_”.

With a Python package

Here is a minimal example of pip package defining a dataset plugin : https://github.com/ecmwf/climetlab-demo-dataset

The plugin mechanism relies on using entry_points. The three lines highlighted below are registering the class climetlab_demo_dataset.DemoDataset with entry_points. Then as seen in the example notebook, the end-user can use this external plugin to access the class cml.load_dataset("demo-dataset").

This is exhaustively described in the Python reference documentation and here are more details about how on CliMetLab uses it.

  setuptools.setup(
      name="climetlab-demo-dataset",
      version="0.0.1",
      description="Example climetlab external dataset plugin",

      entry_points={"climetlab.datasets":
              ["demo-dataset = climetlab_demo_dataset:DemoDataset"]
      },

  )

With a Python package (automated)

While creating the package manually from the documentation and from the example above is possible, there is also a semi-automated way relying on cookiecutter to generate a pip package from a template. The generated package has a predefined dataset and is ready to be shared on Github and distributed.

For detailed information, please see the README file).

pip install cookiecutter
cookiecutter https://github.com/ecmwf-lab/climetlab-cookiecutter/dataset

With a YAML file

YAML file definitions can be used for simple datasets which rely on existing built-in data source, and cannot be as flexible to end-users. The following example shows how to use a source when the data consists of a single file downloadable from a URL.

---
dataset:
  source: url
  args:
    url: http://download.ecmwf.int/test-data/metview/gallery/temp.bufr

  metadata:
    documentation: Sample BUFR file containing TEMP messages

Todo

Document the YAML file way to create a dataset. Choose a good way to implement the workflow.

  • Create a dataset YAML file.

  • distribute it.