Benchmarking DL Models: How to do it Right

Check this out to discuss the importance of benchmarking models to derive an optimal version and show how PerceptiLabs makes it easy for you.

Benchmarking DL Models: How to do it Right
Merriam-Webster defines benchmark as "something that serves as a standard by which others may be measured or judged." For DL practitioners, that something is a benchmark DL model against which you compare other models.

In Using Evaluations and Comparisons to Make Awesome DL Models, we touched on how comparing models with different settings helps you find the one with the best predictive performance. By developing a benchmark DL model, you can experiment and iterate on several new DL models to see how much each increases predictive performance.

So, let's dive into model comparisons using benchmarks a bit further.

Not That Kind of Benchmarking

Before we go in-depth, note that benchmarking can also refer to the process of benchmarking a state-of-the-art (SOTA) model (e.g., an experimental model) against a baseline model, which is considered by the DL community to be the accepted solution for a specific task. This type of benchmarking is often done using the same established dataset(s) used for the baseline model (e.g., the Microsoft COCO dataset).

While you can certainly use PerceptiLabs for that kind of benchmarking, what we're talking about in this blog, is creating your own benchmark and experimental models in your quest to find the most optimal model to solve your problem.

Prepare to Benchmark

When you first start benchmarking, you may have an existing model (e.g., from a past project deployment), or you may be creating a new model from scratch. Either way, there are a number of considerations.

First, ensure you're working with quality, curated data samples. This means using a dataset of sufficient size and one void of different types of biases. Biased datasets can significantly impact your model's predictive performance, especially those with imbalanced classifications. If you need further convincing, and have a Netflix subscription, check out Coded Bias, a documentary that discusses the consequences of biased data. Also, be sure to read our blog: Working with Unbalanced Datasets.

Next up, ensure you maintain consistency across your tests, especially the following:

  • Use the same modeling and testing tool(s) and version(s) across your tests. This ensures that your test apparatus remains the same and avoids introducing variables that different tools or tests could throw into the mix.
  • Use the same dataset during training.
  • Ensure your tool can facilitate results which are measurable, comparable, reproducible, and, ideally, offers some form of explainability.
  • Consider keeping each test's randomizer seed the same when you want to reproduce previous test results, or if you are working on marginal improvements.

Finally, it's best practice to only change one parameter at a time (e.g., don't iterate on both the model architecture and number of iterations at the same time). This makes it easier to identify what effect each change has.

Thankfully, PerceptiLabs makes it easy to adhere to these.

Prepare to Document Your Benchmarking

Much of your time will be spent making changes to your models and observing changes in training and test results. Having a formal note taking process to record this information will be key to making comparisons.

Soon, note taking functionality will be added in PerceptiLabs so you will be able to take notes and keep them stored with your model. Until then, one workaround is to add non-connected Custom Components to your model and store notes via code comments as suggested in this forum posting. Alternatively, a good old fashioned pen and notepad (or text document) will also work.

As well, most of PerceptiLabs' Training View panes include a button to take a screenshot of the respective output. These screenshots can be invaluable for capturing data points during training.

Benchmark it with PerceptiLabs

Once you’ve created your benchmark model, the benchmarking process involves creating new DL models and then adjusting their training and data settings, modifying their topologies, and even changing dataset training/test splits. Since each model stores all of your model parameters and settings, you can effectively treat each new model you create as a new version. All of this is very easy to do in PerceptiLabs.

To quickly create new models from the same dataset, use PerceptiLabs' Overview screen. The Overview screen helps you manage your models and also displays information for each model including their training statuses and the date and time when they were trained, as shown in Figure 1:

Figure 1: Overview screen for managing your models.
Figure 1: Overview screen for managing your models.

One tip here is to use descriptive model names that hint at what the model comprises. You can rename a model at any time in the Overview screen and even use really long model names. You may even want to encode the version number into the name as well, for example:

  • MNIST Modified Conv v1

Note: Additional version control features will soon be added to PerceptiLabs.

As you experiment, you will generally rely on PerceptiLabs' Training and Evaluate views to ensure that performance during training is meeting your expectations.

The Training View provides real-time statistics during training. In general, focus on the Precision, Recall, and F1 Score in the view's Performance tab, as well as the Gradients in the View box as shown in Figure 2:

Figure 2: PerceptiLabs' Training View showing real-time statistics.
Figure 2: PerceptiLabs' Training View showing real-time statistics.

Use these to quickly determine if training is progressing in the right direction, or if it should be stopped early to make adjustments. When model training completes, be sure to take note of these, as well as the accuracy and global loss for training and validation.

After training completes, PerceptiLabs' Evaluate View can run tests and display a Confusion matrix, classification metrics (e.g., Accuracy, Precision, and Recall), Segmentation metrics (e.g., IoT), and Visualizations. Take note of these and compare them against tests in your other DL models. These will further corroborate your training observations to help determine if your model is better or worse than your other DL models and your benchmark model.

You can select all of your trained models on the Evaluate View and run tests on them in a single operation. PerceptiLabs then displays their results together so you can compare them as shown in Figure 3:

Figure 3: PerceptiLabs' Evaluate View showing test results for three models.
Figure 3: PerceptiLabs' Evaluate View showing test results for three models.

Once you've trained and compared models using PerceptiLabs, it's time to try them out on real-world data. Set up a sandbox testing environment and run the models through a test suite to see how they perform, especially on specific edge cases. One option is to deploy to Gradio from PerceptiLabs and try a few key samples to see how your models behave.

Ship It!

Once you think you've found the optimal model, deploy it for real-world inference and consider it your new benchmark. However, do plan for continual benchmarking post deployment because new issues from real-world data will occur, and at some point the model will likely succumb to model decay or drift.

Benchmark Your DL Models Today!

If you're not already using PerceptiLabs – what are you waiting for? Our Quickstart Guide will get you quickly running the free version, then put some of the advice from this blog into practice to benchmark your own DL models.