Datasets

The following documentation is intended for Cannlytics Data Scientists who wish to use or publish datasets on Hugging Face. The documentation covers:

Cloning a Dataset
Dataset Components
Publishing Datasets on Hugging Face
Testing a Dataset

Cloning a Dataset

Install Git Large File Storage (LFS) and clone your repository, for example cannabis_licenses:

# Make sure that you have git-lfs installed.
git lfs install

git clone https://huggingface.co/datasets/cannlytics/cannabis_licenses

Dataset Components

A dataset is composed of:

A README.md describing the dataset's contents. You can see a template or read more about dataset cards.
Your data, either:
- A .csv file and / or a .json file for your data.
- URLs to your .csv and / or .json file(s).
Note that it is helpful to split your dataset into training and test datafiles.
An optional loading script, for example cannabis_licenses.py.

Publishing Datasets on Hugging Face

First you will want to create your dataset repository. Make sure that you are in the virtual environment where you have installed Hugging Face's datasets package, then run the following command to login using your Hugging Face Hub credentials:
```
huggingface-cli login
```
Then you can create a new dataset repository:
```
huggingface-cli repo create your_new_dataset --type dataset --organization cannlytics
```
Second, you can create your metadata file, dataset_infos.json, and test your new dataset loading script with:
```
datasets-cli test path/to/<your-dataset-folder> --save_infos --all_configs
```

Finally, you can upload your data files through the Hugging Face user interface, commit through VS Code, or commit with the command line:

cp /datasets/your_new_dataset/*.csv .
git lfs track *.csv
git add .gitattributes
git add *.csv
git commit -m "add csv files"
cp /datasets/your_new_dataset/dataset_infos.csv .
cp /datasets/your_new_dataset/load_script.py .
git add --all
git status
git commit -m "First version of the your_new_dataset dataset."
git push

Making a Dataset Pull Request

When you want to make a pull request to a specific dataset, first, create a pull request branch on Hugging Face, then checkout the branch:

git fetch origin refs/pr/2:pr/2
git checkout pr/2

Next, do your modifications and track all of your changes, including any large data files, e.g. .csv files:

git lfs track *.csv
git add *.csv
git commit -m "Added `csv` files"
git add --all
git status
git commit -m "Updated `cannabis_licenses` dataset."
git push

Finally, make the pull request:

git push origin pr/2:refs/pr/2

🎉 Congratulations, your pull request is now ready to be reviewed and merged by the repository admin!

Testing a Dataset

You can create dummy data for the dataset with:

datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate

You can also load the dataset locally in Python, for example:

from datasets import load_dataset

# Load the dataset.
dataset = load_dataset('cannabis_licenses.py', 'ca')
data = dataset['data']
assert len(data) > 0
print('Read %i data points.' % len(data))