Working with Hugging Face Datasets | by Wei-Meng Lee | Jun, 2023


Learn how to access the datasets on Hugging Face Hub and how you can load them remotely using DuckDB and the Datasets library

Photo by Lars Kienle on Unsplash

As an AI platform, Hugging Face builds, trains and deploys state of the art open source machine learning models. In addition to hosting all these trained models, Hugging Face also hosts datasets (https://huggingface.co/datasets), where you can make use of them for your own projects.

In this article, I will show you how you can access the datasets in Hugging Face, and how you can programmatically download them onto your local computer. Specifically, I will show you how to:

  • load the datasets remotely using DuckDB’s support for httpfs
  • stream the datasets using the Datasets library by Hugging Face

Hugging Face Datasets server is a lightweight web API for visualizing all the different types of dataset stored on the Hugging Face Hub. You can use the provided REST API to query datasets stored on the Hugging Face Hub. The following sections provide a short tutorial on the things you could do with the API at https://datasets-server.huggingface.co/.

Getting a list of datasets hosted on the Hub

To get a list of datasets that you can retrieve from Hugging Face, use the following statement with the valid endpoint:

$ curl -X GET "https://datasets-server.huggingface.co/valid"

You will see a JSON result as shown below:

The datasets that can work without errors are listed in the value of the valid key in the result. An example of a valid dataset above is 0-hero/OIG-small-chip2.

Validating a dataset

To validate a dataset, use the following statement with the is-valid endpoint together with the dataset parameter:

$ curl -X GET "https://datasets-server.huggingface.co/is-valid?dataset=0-hero/OIG-small-chip2"

If the dataset is valid, you will see the following result:

{"valid":true}

Getting the list of configurations and splits of a dataset

A dataset typically have splits (training set, validation set, and testing set). They may also have configurations — sub-dataset within a larger dataset.

Configurations are common for multilingual speech datasets. For more details on splits, visit: https://huggingface.co/docs/datasets-server/splits.

To get the splits of a dataset, use the following statement with the splits endpoint and the dataset parameter:

$ curl -X GET "https://datasets-server.huggingface.co/splits?dataset=0-hero/OIG-small-chip2"

The following result will be returned:

{
"splits": [
{
"dataset":"0-hero/OIG-small-chip2",
"config":"0-hero--OIG-small-chip2",
"split":"train"
}
],
"pending":[],
"failed":[]
}

For this dataset, there is only a single train split.

Here is an example of a dataset (“duorc”) that has multiple splits and configurations:

{
"splits": [
{
"dataset": "duorc",
"config": "SelfRC",
"split": "train",
"num_bytes": 239852925,
"num_examples": 60721
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "validation",
"num_bytes": 51662575,
"num_examples": 12961
},
{
"dataset": "duorc",
"config": "SelfRC",
"split": "test",
"num_bytes": 49142766,
"num_examples": 12559
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "train",
"num_bytes": 496683105,
"num_examples": 69524
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "validation",
"num_bytes": 106510545,
"num_examples": 15591
},
{
"dataset": "duorc",
"config": "ParaphraseRC",
"split": "test",
"num_bytes": 115215816,
"num_examples": 15857
}
]
}

Obtaining the first 100 rows

Often, you might want to retrieve a subset of the dataset for examination purposes, instead of downloading everything. The following statement uses the first-rows endpoint with the dataset, config, and split parameters to extract the first 100 rows of the specified dataset:

$ curl -X GET "https://datasets-server.huggingface.co/first-rows?dataset=0-hero/OIG-small-chip2&config=0-hero--OIG-small-chip2&split=train"

The JSON result will return the first 100 rows of the specified dataset:

{
"dataset": "0-hero/OIG-small-chip2",
"config": "0-hero--OIG-small-chip2",
"split": "train",
"features": [{
"feature_idx": 0,
"name": "user",
"type": {
"dtype": "string",
"_type": "Value"
}
}, {
"feature_idx": 1,
"name": "chip2",
"type": {
"dtype": "string",
"_type": "Value"
}
}],
"rows": [{
"row_idx": 0,
"row": {
"user": "I've heard that it's a good idea to have a will. What is a will?nn",
"chip2": "A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."
},
"truncated_cells": []
}, {
"row_idx": 1,
"row": {
"user": "How do I find a job?nn",
"chip2": "The best way to find a job is to create a strong resume and cover letter that highlights your skills, experience, and qualifications. You can also search online for job openings in your area of expertise or contact local employers directly to inquire about available positions. Networking with people in the same field or industry can also be helpful in finding employment opportunities."
},
"truncated_cells": []
}, {
"row_idx": 2,
"row": {
"user": "Produce a detailed written description of a gloomy scene inside of a mansion with a domed ceiling, pillars, and a door.nn",
"chip2": "Circular in shape, the floor is completely made from square-cut sections of black marble. Large, blackened marble pillars line the walls, ending with a domed glass ceiling high overhead. A narrow gap on the far side of the room leads to a sweeping stairwell with a mithglin-plated door beneath it."
},
"truncated_cells": []
}, {

...

}, {
"row_idx": 98,
"row": {
"user": "I need to learn how to budget my money better. What advice can you give me?nn",
"chip2": "One of the best ways to budget your money better is to make a budget and track your spending. Start by writing down all your sources of income and all your expenses. Go over each item closely to determine what you can cut back on. Prioritize the things that need to be paid first, such as rent and utilities, and try to make sure you’re setting aside some money for savings. Additionally, setting spending limits for yourself can help you stay on track. Finally, regularly review your budget to make sure you are staying within your budgeting goals."
},
"truncated_cells": []
}, {
"row_idx": 99,
"row": {
"user": "Produce a detailed written description of a serene scene outside a mountain road with monir trees and a boulder.nn",
"chip2": "A large granite boulder, apparently fallen from the mountain's craggy cliffs in some distant age, lies beside the cobblestoned road. Surrounded by a thick stand of monir trees, the boulder provides a convenient resting place for travellers, and a brief respite from the rigors of the journey."
},
"truncated_cells": []
}]
}

Getting a slice of the dataset

You can also get a slice of the dataset. The following statement uses the rows endpoint with the dataset, config, split, offset, and length parameters to extract three rows of the specified dataset starting from the third row:

$ curl -X GET "https://datasets-server.huggingface.co/rows?dataset=0-hero/OIG-small-chip2&config=0-hero--OIG-small-chip2&split=train&offset=2&length=3"

The result will look like this:

{
"features": [{
"feature_idx": 0,
"name": "user",
"type": {
"dtype": "string",
"_type": "Value"
}
}, {
"feature_idx": 1,
"name": "chip2",
"type": {
"dtype": "string",
"_type": "Value"
}
}],
"rows": [{
"row_idx": 2,
"row": {
"user": "Produce a detailed written description of a gloomy scene inside of a mansion with a domed ceiling, pillars, and a door.nn",
"chip2": "Circular in shape, the floor is completely made from square-cut sections of black marble. Large, blackened marble pillars line the walls, ending with a domed glass ceiling high overhead. A narrow gap on the far side of the room leads to a sweeping stairwell with a mithglin-plated door beneath it."
},
"truncated_cells": []
}, {
"row_idx": 3,
"row": {
"user": "I have a sprained ankle and need to know what I can do to help it heal faster.nn",
"chip2": "You may want to try some ice therapy, which will help reduce swelling and inflammation. Additionally, you could use an elastic bandage or compression wrap to provide additional support for your ankle. It's also important to stay off of it as much as possible, so that it can heal properly."
},
"truncated_cells": []
}, {
"row_idx": 4,
"row": {
"user": "How do you roast a turkey?nn",
"chip2": "To roast a turkey, start by preheating your oven to 350 degrees F. Place the turkey in a roasting pan, and rub it with butter or olive oil. Sprinkle with salt and pepper, and any other herbs or seasonings you like. Place the turkey in the oven and cook for approximately 20 minutes per pound, or until the internal temperature registers 165 degrees. Once the turkey is cooked, remove it from the oven and let it sit for 10-15 minutes before carving."
},
"truncated_cells": []
}]
}

Getting the Parquet files of a dataset

While the datasets in Hugging Face Hub can be published in a wide variety of formats (CSV, JSONL, etc), the Datasets server automatically converts all public datasets to the Parquet format. The Parquet format offers significant performance improvements, especially for large datasets. Later sections will demonstrate that.

Apache Parquet is a file format that is designed to support fast data processing for complex data. For more information on Parquet, read my earlier article:

To load the dataset in Parquet format, use the following statement with the parquet endpoint and the dataset parameter:

$ curl -X GET "https://datasets-server.huggingface.co/parquet?dataset=0-hero/OIG-small-chip2"  

The above statement returns the following JSON result:

{
"parquet_files": [{
"dataset": "0-hero/OIG-small-chip2",
"config": "0-hero--OIG-small-chip2",
"split": "train",
"url": "https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet",
"filename": "parquet-train.parquet",
"size": 51736759
}],
"pending": [],
"failed": []
}

In particular, the value of the url key specifies the location where you can download the dataset in Parquet format, which is https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet in this example.

Now that you have seen how to use the Datasets server REST API, let’s see how you can download the datasets programmatically.

In Python, the easiest way is to use the requests library:

import requests

r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=0-hero/OIG-small-chip2")
j = r.json()

print(j)

The result of the json() function is a Python dictionary:

{
'parquet_files': [
{
'dataset': '0-hero/OIG-small-chip2',
'config': '0-hero--OIG-small-chip2',
'split': 'train',
'url': 'https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet',
'filename': 'parquet-train.parquet',
'size': 51736759
}
],
'pending': [],
'failed': []
}

Using this dictionary result, you can use list comprehension to find the URL for the dataset in Parquet format:

urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls

The urls variable is a list containing a list of URLs for the dataset under the training set:

['https://huggingface.co/datasets/0-hero/OIG-small-chip2/resolve/refs%2Fconvert%2Fparquet/0-hero--OIG-small-chip2/parquet-train.parquet']

Downloading the Parquet file using DuckDB

If you use DuckDB, you can actually use DuckDB to remotely load a dataset.

If you are new to DuckDB, you can read up on the basics from this article:

First, ensure you install DuckDB if you have not done so:

!pip install duckdb 

Then, create a DuckDB instance and install httpfs:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

The httpfs extension is a loadable extension implementing a file system that allows reading remote/writing remote files.

Once httpfs is installed and loaded, you can load the Parquet dataset from Hugging Face Hub by using a SQL query:

con.sql(f'''
SELECT * from '{urls[0]}'
''').df()

The df() function above converts the result of the query to a Pandas DataFrame:

Image by author

One great feature of Parquet is that Parquet stores files in columnar format. And so if your query only requests for only a single column, only that requested column is downloaded to your computer:

con.sql(f'''
SELECT "user" from '{urls[0]}'
''').df()

In the above query, only the “user” column is downloaded:

Image by author

This Parquet feature is especially useful for large dataset — imagine the time and space you can save by only downloading the columns you need.

In some cases, you don’t even need to download the data at all. Consider the following query:

con.sql(f'''
SELECT count(*) from '{urls[0]}'
''').df()
Image by author

No data needs to be downloaded as this request can be fulfilled simply by reading the metadata of the dataset.

Here is another example of using DuckDB to download another dataset (“mstz/heart_failure”):

import requests

r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=mstz/heart_failure")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']

con.sql(f'''
SELECT "user" from '{urls[0]}'
''').df()

This dataset has 299 rows and 13 columns:

Image by author

We could perform some aggregation on the age column:

con.sql(f"""
SELECT
SUM(IF(age<40,1,0)) AS 'Under 40',
SUM(IF(age BETWEEN 40 and 49,1,0)) AS '40-49',
SUM(IF(age BETWEEN 50 and 59,1,0)) AS '50-59',
SUM(IF(age BETWEEN 60 and 69,1,0)) AS '60-69',
SUM(IF(age BETWEEN 70 and 79,1,0)) AS '70-79',
SUM(IF(age BETWEEN 80 and 89,1,0)) AS '80-89',
SUM(IF(age>89,1,0)) AS 'Over 89',
FROM '{urls[0]}'
"""
).df()

Here’s the result:

Image by author

Using the result, we could also plot a bar plot:

con.sql(f"""
SELECT
SUM(IF(age<40,1,0)) AS 'Under 40',
SUM(IF(age BETWEEN 40 and 49,1,0)) AS '40-49',
SUM(IF(age BETWEEN 50 and 59,1,0)) AS '50-59',
SUM(IF(age BETWEEN 60 and 69,1,0)) AS '60-69',
SUM(IF(age BETWEEN 70 and 79,1,0)) AS '70-79',
SUM(IF(age BETWEEN 80 and 89,1,0)) AS '80-89',
SUM(IF(age>89,1,0)) AS 'Over 89',
FROM '{urls[0]}'
"""
).df().T.plot.bar(legend=False)
Image by author

Using the Datasets library

To make working with data from Hugging Face easy and efficient, Hugging Face has its own Datasets library (https://github.com/huggingface/datasets).

To install the datasets library, use the pip command:

!pip install datasets

The load_dataset() function loads the specified dataset:

from datasets import load_dataset

dataset = load_dataset('0-hero/OIG-small-chip2',
split='train')

When you load the dataset for the first time, the entire dataset (in Parquet format) is downloaded to your computer:

Image by author

The type of data of the returned dataset is datasets.arrow_dataset.Dataset. So what can you do with it? First, you can convert it to a Pandas DataFrame:

dataset.to_pandas()
Image by author

You can also get the first row of the dataset by using an index:

dataset[0]

This will return the first row of the data:

{
'user': "I've heard that it's a good idea to have a will. What is a will?nn",
'chip2': "A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."
}

There are a bunch of other things you can do with this datasets.arrow_dataset.Dataset object. I will leave it to you to explore further.

Streaming the dataset

Again, when dealing with large datasets, it is not feasible to download the entire dataset to your computer before you do anything with it. In the previous section, calling the load_dataset() function downloads the entire dataset onto my computer:

Image by author

This particular dataset it took up 82.2MB of disk space. You can imagine the time and disk space needed for larger datasets.

Fortunately, the Datasets library supports streaming. Dataset streaming lets you work with a dataset without downloading it — the data is streamed as you iterate over the dataset. To use streaming, set the streaming parameter to True in the load_dataset() function:

from datasets import load_dataset

dataset = load_dataset('0-hero/OIG-small-chip2',
split='train',
streaming=True)

The type of dataset is now datasets.iterable_dataset.IterableDataset, instead of datasets.arrow_dataset.Dataset. So how do you use it? You can use the iter() function on it, which returns an iterator object:

i = iter(dataset)

To get a row, call the next() function, which returns the next item in an iterator:

next(i)

You will now see first row as a dictionary:

{
'user': "I've heard that it's a good idea to have a will. What is a will?nn",
'chip2': "A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."
}

Calling the next() function on i again will return the next row:

{
'user': 'How do I find a job?nn',
'chip2': 'The best way to find a job is to create a strong resume and cover letter that highlights your skills, experience, and qualifications. You can also search online for job openings in your area of expertise or contact local employers directly to inquire about available positions. Networking with people in the same field or industry can also be helpful in finding employment opportunities.'
}

And so on.

Shuffling the dataset

You can also shuffle the dataset by using the shuffle() function on the dataset variable, like this:

shuffled_dataset = dataset.shuffle(seed = 42, 
buffer_size = 500)

In the above example, say your dataset has 10,000 rows. The shuffle() function will randomly select examples from the first five hundred rows in the buffer.

By default, the buffer size is 1,000.

Other tasks

You can perform more tasks using streaming, such as:

  • splitting the dataset
  • Interleaving the dataset — combining two datasets by alternating rows between each dataset
  • Modifying the columns of a dataset
  • Filtering a dataset

Check out https://huggingface.co/docs/datasets/stream for more details.

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

In this article, I have shown you how you can access the datasets stored on Hugging Face Hub. Since the datasets are stored in Parquet format, it allows you to remotely access the datasets remotely without needing to download the entire bulk of the dataset. You can access the datasets either using DuckDB, or using the Datasets library provided by Hugging Face.



Source link

Leave a Comment