Those who are developing machine learning (ML) models or just getting into ML for the first time have it good, because never before have so many open source datasets been freely-available to get you started.
Access to open source datasets brings a number of benefits. For starters, it allows you to focus on the development of your model, not data management in which you would first gather up large collections of data. Using existing datasets makes it easy to see what goes into a dataset, as these have already been wrangled and labeled for classification. Thanks to the efforts of other ML practitioners, these datasets have proven themselves to be effective for training and validating models.
Where do you begin? To help answer this question, we've put together our top five picks of open source datasets that you can use in your ML projects today.
1. MNIST Image Database
Although somewhat of a cliché nowadays, the MNIST (Modified National Institute of Standards and Technology) database contains 60,000 images of handwritten digits from 0 through 9 for training your models, and an additional 10,000 for testing and validation. This database is popular with those learning how to solve the classical problem of identifying digits from images (e.g., via a neural network).
The database actually contains two sets of data: one with the images themselves, and a second containing labels corresponding to the data. The image data is a two-dimensional array of 28x28 pixel anti-aliased images, where each pixel is a grayscale value from 0 through 255.
2. Kaggle Datasets
Here at PerceptiLabs we're huge fans of Jupyter Notebook. So it should be no surprise that we're also huge fans of Kaggle which has a collection of notebooks for ML and has made numerous datasets available. So this top pick isn't so much about a single dataset, but rather, Kaggle's vast collection of freely-available datasets.
Kaggle's datasets span a huge range of areas including everything from stock market analysis to sports, and are available in all different dimensions and sizes. Currently with the Covid-19 Pandemic, you will find that the majority of the most recent datasets on Kaggle are related to different aspects of the virus.
Numerous datasets are associated with Kaggle Kernels – a Kernel is a combination of code, input, and output as described here – containing notebooks with all sorts of information, and even code for working with the dataset.
3. IMDb Dataset
The IMDb or Internet Movie Database, is an online database with information and reviews about movies, games, and other media. An IMDb dataset has been created containing the text for 50,000 movie reviews. The dataset has been split into 25,000 reviews for training and 25,000 reviews for testing and validation.
This dataset is useful for building models which analyze and classify text, a common application for ML. And as demonstrated in this tutorial, the IMDb dataset can be used as the basis for a model that performs transfer learning and binary classification from qualitative data (i.e., the sentences describing the reviews of movies).
4. Auto MPG Dataset
Regression problems (i.e., predicting the output of a continuous value such as a person's age) are another area where ML can really shine.
For those getting started with building an ML model for regression, check out the Auto MPG Dataset from the UCI Machine Learning Repository. The Auto MPG Dataset contains the following information about cars from the 1970's and 1980's:
- Model year
- Vehicle name
The data is provided via three ascii text files that you can open in any text editor:
- auto-mpg.data: contains the updated version of the dataset.
- auto-mpg.names: contains documentation describing the format of auto-mpg.data.
- auto-mpg.data-original: contains the original dataset in which a number of MPG values were unknown.
Of these files, you will likely want to build a model using auto-mpg.names as the data source. Using this data you could, for example, train a model that predicts the MPG of a vehicle as described in this TensorFlow tutorial. And as you can easily write Python code and work with TensorFlow in PerceptiLabs, you can easily create a similar model in PerceptiLabs.
5. Gym Environments
Reinforcement learning has become popular as ML practitioners seek to teach machines to learn in more general ways like humans do. The Gym toolkit provides a number of environments that can be used for control theory, continuous control, and rewards-based reinforcement learning.
There are a wide variety of environments ranging from Atari game RAM observations and RGB screen images, to simple text-based "grids" composed of letters representing safe and unsafe places to navigate.
Currently PerceptiLabs allows you to select from three Atari game Gym environments (Breakout, Bank Heist, and Demon Attack) that you can include in your model using our Environment component. Selecting other environments is as simple as modifying the component's code.
Using Datasets in PerceptiLabs
Datasets are easy to use in PerceptiLabs and you have a few options for importing them.
For dataset files that you download, simply drag and drop a Data component on your workspace, double click on it, and select the file as described in Step 1 of our Quickstart Guide and click Apply.
For programmatically loading datasets there are two options:
- select and apply a file to a Data component in your workspace as described above, and then modify the component's Python code, to invoke APIs to read data; or
- drag and drop a Custom component into your workspace and write Python code to invoke APIs which read data.
We hope these top picks provide a good starting point for finding datasets that you can use in your projects. And, we encourage you to visit our community resources to share your experiences with datasets and to learn how others are working with their datasets in PerceptiLabs.