Benchmarks are vital tools for measuring the performance of Central Processing Units (CPUs) in the most popular consumer devices writes Ajay Joshi. With the advent of more complicated use cases and compute workloads, benchmarks are becoming increasingly important in the home device market, which includes digital television (DTV) and set-top box (STB)/ over-the-top (OTT) devices.

In the past 20 years, benchmarks have evolved at a rapid pace, moving from standalone measurements to reflecting real-world use cases to gain a more accurate representation of performance.

As a result, these more modern benchmarks are likely to be more beneficial to modern Home devices where the user experience is paramount. While many of the older standalone benchmarks could be considered “too simple” for more complex compute workloads taking place on DTV and STB/ OTT devices, they could still find a role in the early development stages of CPU IP.

A variety of CPUs are used in the Home device market.

A variety of CPUs are used in the home device market

The home device market

In the home device market, a variety of CPUs that vary in terms of performance and efficiency are used. These cover Arm’s Cortex-M products that are often used in sensors within home devices right through to our Cortex-A products that enable more complex use cases on DTV and STB/ OTT devices, like multi-stream video or advanced user interfaces.

Therefore, it is important to understand the spectrum of different benchmarks and how they measure overall system performance.

The different CPU benchmarks

At Arm, we use a spectrum of benchmarks to measure performance. The key categories are:

  • Synthetic
  • Micro
  • Kernel
  • Application
  • Use case-based benchmarks

Application and use case-based benchmarks have a higher correlation with the end user experience. However, the cost and complexity to set them up, maintain, and use for pre-silicon analysis also starts significantly increasing. The opposite is true for synthetic and micro benchmarks.

Application benchmarks comprise of complete and real programs that are widely used to solve various compute challenges, but could be complex to port and setup. One example is SPEC CPU, which is the most popular benchmark for measuring CPU performance.

Use case-based benchmarks – which includes Speedometer to measure the responsiveness of web applications – are very representative of end usage, but challenging to port, run, maintain, and require full-system platforms for projecting scores.

Also, the metrics to measure the performance of use cases are difficult to measure in pre-silicon platforms. This makes it challenging to use them for performance exploration in the early stages of the design cycle.

Moreover, use case-based benchmarks can often be impractical as a measure of IP performance. For example, a web browsing benchmark will exercise a complex software stack, system, and CPU IP – not just the CPU as a standalone item to measure.

Synthetic benchmarks, such as Dhrystone and CoreMark, are useful for measuring core CPU performance during development. They are also quite easy to port, run, maintain, and do not require a full system setup for projecting scores.

Micro benchmarks are typically used to measure specific aspects of IP performance, such as memory latency and bandwidth. For example, the STREAM benchmark is used to measure sustained memory bandwidth. 

Kernel benchmarks comprise of extracted algorithms from a range of algorithms. These can help assess performance on key hotspots from larger applications. Geekbench is one example of a kernel benchmark, broadly used in mobile and desktop applications.

Each benchmark category has its own place and value.

Each benchmark category has its own place and value.

The value of different CPU benchmarks

Each of the categories of benchmarks has its own place and value during various stages of the hardware and software product design stage.

For example, synthetic benchmarks such as Dhrystone can be used as a proxy to represent the power of longer running benchmarks; micro benchmarks are used to measure peak achievable bandwidth; kernel benchmarks provide an estimate of performance of specific algorithms; and application/use case-based benchmarks run late in the design cycle and provide an accurate measure of system level performance.

However, it is also important to note that the choice of benchmarks used for pre-silicon projections and post-silicon benchmarking, particularly for CPUs, can be driven by factors beyond benchmark measurements alone. These include specific requests from partners, popularity in the ecosystem, competitive pressures, and the technical merit of benchmarks.

More complex use cases

To be useful, a benchmark must be representative of the characteristics of the target real world applications and use cases. In the past when there were fewer, more straightforward use cases, we had Dhrystone and Coremark to measure system performance. However, today we have a greater variety of more complex use cases on home devices, so using a more advanced benchmark, such as SPEC CPU or Geekbench, is considered to be more relevant to effectively measure the range of complex workloads.

Different roles for CPU benchmarks

At the same time, it should be noted that any benchmark suite will only represent a portion of the performance spectrum of real-world usages. It is still important to use a wide range of benchmark suites to measure a wide range of applications and performance points.

In fact, there can still be a role for benchmarks such as Dhrystone and Coremark that purely measure the compute-bound performance of synthetic programs, but these are best deployed in the early stages of processor and IP development. Using these benchmarks combined with more advanced kernel, application and use case-based benchmarks can provide useful performance insights during the early development stages of IP.

Indeed, benchmarks are not a ‘one-size-fits-all’ for all processors, devices, and use cases. It is worth reflecting on the compute requirements and capabilities of individual processors when selecting the most appropriate benchmark to use in home devices.

By Ajay Joshi, Distinguished Engineer, Arm