AI Benchmark to Measure Machine Learning Performance
I install and run the AI Benchmark to measure machine learning performance on Windows Native and Windows Subsystem for Linux.
Table of Contents
Introduction
Over the past two years, I’ve consistently included the AI Benchmark in almost all SkatterBencher overclocking guides. My first use of this benchmark was overclocking the UHD Graphics 750 in SkatterBencher #28. The reason I use this benchmark so often is quite apparent to anyone following recent trends: AI and deep learning have become significant topics.
AI Benchmark Alpha is an open-source Python library designed to assess the AI performance of different hardware platforms, including CPUs, GPUs, and TPUs. It relies on the TensorFlow machine learning library, offering a lightweight solution to measure inference and training speed for essential Deep Learning models.
Installing and running AI Benchmark is slightly more complicated than you’d expect. Since the process has evolved since I first introduced the benchmark, I thought providing an updated installation and benchmarking guide would be helpful.
In this post I explain how to install and run the AI Benchmark on Windows Native and Windows Subsystem for Linux. I’ll assume you already satisfied the basic Windows machine learning requirements, such as a compatible Windows 10 or 11 operating system, suitable hardware, and the necessary drivers installed. At the end of the blog post , I’ll also provide some quick benchmark numbers of the EK Flat PC featuring the NVIDIA RTX A5000 GPU.
Windows ML & Direct ML
Historically, TensorFlow and Machine learning was easiest to set up on Linux. However, much progress has been made to enable it on Windows machines. Nowadays, Windows is very ready for machine learning activities.
Windows ML is a high-performance, reliable API for deploying hardware-accelerated ML inferences on Windows devices. It is available for Windows 8.1 or higher and is included with the Windows installation since Windows 10 version 1809.
Direct Machine Learning, or Direct ML, is a component under the Windows Machine Learning umbrella. The higher-level WinML API is primarily model-focused, with its load-bind-evaluate workflow. But if you’re counting milliseconds, and squeezing frame times, then DirectML will meet your machine learning needs. So for reliable real-time, high-performance, low-latency, and resource-constrained scenarios, such as measuring benchmark performance, we should use DirectML rather than Windows ML.
DirectML requires a DirectX 12 capable device. Almost all commercially-available graphics cards released in the last several years support DirectX 12.
I first used Direct ML in 2021 when I first tried AI Benchmark. The reason wasn’t better accuracy, but compatibility since the TensorFlow machine learning library did not support Intel graphics. Nowadays, I use it because TensorFlow 2.10 was the last TensorFlow release that supported GPU on Windows Native. Starting with TensorFlow 2.11, you must use TensorFlow in WSL 2 or the TensorFlow-DirectML-Plugin in Windows Native.
A couple of additional notes before we get started. First, we’ll need to set up a Python environment since AI Benchmark is a Python application. I find that the most convenient method is to use Anaconda. Second, we’ll be using TensorFlow 2 for this installation guide.
Installing AI Benchmark on Windows Native
- First, download and install Anaconda for Windows.
- After completing the installation, run the Anaconda Prompt.
- Create a new Python environment for the benchmark. Make sure to specify Python version 3.10 as the tensorflow-directml-plugin supports only versions 3.7, 3.8, 3.9, and 3.10
- conda create -n aibench python=3.10
- Activate your newly created environment
- conda activate aibench
- Download and install the base TensorFlow-CPU
- pip install tensorflow-cpu
- Download and install the tensorflow-directml package
- pip install tensorflow-directml-plugin
- Download and install the numpy 1.23 package
- pip install numpy==1.23
- Download and install the ai_benchmark package
- pip install ai_benchmark
Installing AI Benchmark on Windows Subsystem for Linux
- First, make sure WSL is correctly installed on your Windows PC.
- Open PowerShell and type wsl –install
- Follow the installation instructions
- If that’s the case, then open the WSL prompt.
- Now download the Anaconda for Linux 64-Bit (x86) installer.
- Browse to ./Users/{your_pc_name}/Downloads (or any folder you have write permissions on)
- wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh (or latest Linux package)
- Next, install Anaconda
- bash ~/Downloads/Anaconda3-2020.05-Linux-x86_64.sh
- Then, close and reopen Windows Subsystem for Linux
- Create a new Python environment for the benchmark. Make sure to specify Python version 3.10 as the tensorflow-directml-plugin supports only versions 3.7, 3.8, 3.9, and 3.10
- conda create -n aibench python=3.10
- Activate your newly created environment
- conda activate aibench
- Download and install the base TensorFlow-CPU
- pip install tensorflow-cpu
- Download and install the tensorflow-directml package
- pip install tensorflow-directml-plugin
- Download and install the numpy 1.23 package
- pip install numpy==1.23
- Download and install the ai_benchmark package
- pip install ai_benchmark
Running the AI Benchmark
Starting the AI Benchmark is the same on Windows Native and Windows Subsystem for Linux.
- Open the Anaconda prompt
- Activate the conda environment with AI Benchmark
- conda activate aibench
- Start Python
- python
- Import the AI Benchmark package
- from ai_benchmark import AIBenchmark
- Specify the benchmark parameters
- benchmark = AIBenchmark(use_CPU=None, verbose_level=3)
- Use_CPU=True runs the benchmark on the CPU
- Use_CPU=None runs the benchmark on the GPU
- Verbose_level=0 runs the test silently
- Verbose_level=1 runs the test with short summary
- Verbose_level=2 provides information about each run
- Verbose_level=3 provides the tensorflow logs during the run
- benchmark = AIBenchmark(use_CPU=None, verbose_level=3)
- Then, lastly, start the benchmark
- benchmark.run()
AI Benchmark Optimizations and Tricks
Since deep neural network and machine learning performance is a big selling point, companies work hard to release performance-optimizing software packages for their hardware. I tend to use these packages in my SkatterBencher guides.
OneDNN for Intel CPU Architectures
Intel oneDNN is an open-source, high-performance library designed to accelerate deep learning applications on Intel architecture CPUs. It provides optimized primitives for various deep learning operations, such as convolutions, inner products, and other key operations used in neural networks.
The oneAPI Deep Neural Network Library (oneDNN) optimizations are available in the official x86-64 TensorFlow after v2.5. The feature is off by default before v2.9, but users can enable those CPU optimizations by configuring the environment variable TF_ENABLE_ONEDNN_OPTS.
- Windows Native: set TF_ENABLE_ONEDNN_OPTS=1
- Windows Subsystem for Linux: export TF_ENABLE_ONEDNN_OPTS=1
Since TensorFlow v2.9, the oneAPI Deep Neural Network Library (oneDNN) optimizations are enabled by default.
ZenDNN for AMD CPU Architectures
I also came across an AMD equivalent library called ZenDNN. However, I’ve yet to try this on an AMD system, so I’ll leave you a link for now.
CuDNN or TensorRT for NVIDIA GPU Architectures
You can also rely on the NVIDIA CUDA Deep Neural Network library (cuDNN) for NVIDIA GPUs. CuDNN is a GPU-accelerated library of primitives for deep neural networks. The installation requires a different version of TensorFlow – not TensorFlow-DirectML – since TensorFlow 2.10 is no longer available on Windows Native.
The installation is pretty straightforward. After installing Anaconda for Linux on WSL, do the following
- First, create and activate a new Anaconda environment
- conda create -n aibenchNV
- conda activate aibenchNV
- Then, install the appropriate Cuda Toolkit. You can find a support matrix on NVIDIA’s website
- conda install -c conda-forge cudatoolkit=11.8
- Next, install the CuDNN package
- pip install nvidia-cudnn-cu11==8.9.4.25
- Then, install TensorFlow
- pip install tensorflow
- Then, install numpy 1.23
- pip install numpy==1.23
- Lastly, install AI Benchmark
- pip install ai_benchmark
Now you can run the AI Benchmark, as I explained earlier in the video.
Instead of CuDNN, you can also consider installing the TensorRT Python library. Similarly, TensorRT is a deep-learning library powered by CUDA. TensorRT provides APIs and parsers to import trained models from all major deep learning frameworks. It then generates optimized runtime engines deployable in the data center, automotive, and embedded environments.
RocM for AMD GPU Architectures
Lastly, I also want to mention AMD’s ROCm software package. ROCm is an open-source software stack for GPU computation featuring a collection of drivers, development tools, and APIs enabling GPU programming from the low-level kernel to end-user applications.
While ROCm is fully integrated into ML frameworks such as PyTorch and TensorFlow, it’s currently unavailable on Windows Native, so I’ve yet to use ROCm optimizations for AMD graphics cards. So, just like with ZenDNN, I’ll leave you with a link to the documentation.
Multiple GPUs
These days it’s common to have multiple graphics devices in a single system. Usually, that’s the integrated graphics of the CPU and a high-performance discrete graphics card. If you want to switch between the graphics device, you can set an environment variable before starting Python.
- Windows Native: set DML_VISIBLE_DEVICES=0,1
- Windows Subsystem for Linux: export DML_VISIBLE_DEVICES=0,1
Where the number indicates the specific device. Device 0 is the NVIDIA dGPU on the EK Flat PC, and device 1 is the Intel iGPU. By default, AI Benchmark will run the first available device. So, if I want to run on the iGPU, I’d have to set DML_VISIBLE_DEVICES=1.
DXGI_ERROR_DEVICE_REMOVED
The AI Benchmark is a pretty tough benchmark that can take a long time. Sometimes, you may run into an error called DXGI_ERROR_DEVICE_REMOVED while running the benchmark. That happens when there’s a timeout when the device takes too long to complete a workload.
You can increase the timeout with the registry entry “TdrDelay” to solve the issue. This registry entry will extend the time a software application waits for the IGP. https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
- KeyPath: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
- KeyValue: TdrDelay
- ValueType: REG_DWORD
- ValueData: Number of seconds to delay. The default value is 2 seconds.
AI Benchmark Performance of EK Flat PC
To end the blog post, I’d like to show you how I use the AI Benchmark performance to characterize the machine learning performance of a system. The system I have on hand is the EK Flat PC, a small form factor computer targeted at the High-Performance Embedded Compute market (HPEC).
The High-Performance Embedded Computing (HPEC) market represents a convergence of two distinct yet related domains: High-Performance Computing (HPC) and the embedded systems market. Let’s break down each component and how they come together in the HPEC market:
- High-Performance Computing (HPC): HPC players use supercomputers and high-end computing clusters to solve complex and compute-intensive problems. These systems are designed to handle massive amounts of data and perform calculations at high speeds. HPC is commonly used in scientific research, simulations, weather forecasting, financial modeling, and other applications that require significant computational power.
- Embedded Systems: Embedded systems are specialized computing systems designed to perform specific tasks within a larger system. Embedded systems are often deployed in resource-constrained scenarios, such as limited space or unique environmental characteristics.
HPEC (High-Performance Embedded Computing) emerges at the intersection of these two markets by bridging the gap between the computational capabilities of HPC and the practical constraints of embedded systems. It enables embedded devices to handle demanding tasks traditionally found in high-performance computing, expanding the range of applications that can benefit from advanced computational capabilities.
The EK Flat PC comes in an HPEC-size package measuring only 2.3×19.8×40.2cm or 2070 cm3 (= 2 liter), weighing less than 5KG. The weight is a consequence of the internal water block, which cools the various compute hardware, including a CPU and a GPU. The Flat PC connects to an external cooling unit. Liquid cooling helps keep the CPU and GPU cool enough to boost its maximum frequencies.
In my case, I hooked up the Flat PC to an external 360 radiator for testing.
The EK Flat PC has three Direct ML capable hardware components: the Core i9-11900H Tiger Lake CPU, the Intel® UHD Graphics for 11th Gen Intel® Processors integrated graphics, and a discrete NVIDIA RTX A5000 Laptop GPU. Combined, they have an approximate compute performance of about 20 TFLOPS. Or, for this Flat PC, about 10 GFLOPs per cm2 (10 TFLOPS per liter).
When we put these components through their paces in AI Benchmark, we can see why NVIDIA is the current King of AI. The GPU trumps the CPU regarding AI Benchmark performance many times over.
Conclusion
Anyway, that’s all for today! I hope you enjoyed this short blog post and maybe get eager to run AI Benchmark on your system!
I want to thank my Patreon supporters for supporting my work. If you have any questions or comments, please drop them in the comment section below.
See you next time!
SkatterBencher #70: AMD Radeon 780M Overclocked to 3150 MHz - SkatterBencher
[…] AI Benchmark, I rely again on the TensorFlow-DirectML […]