GPU Verification Challenges

Graphics Processing Units (GPUs) were originally designed for 3-D graphics rendering in applications such as PC gaming. However, their use has now been generalized to any process that requires parallel compute capability on big data sets, such as machine learning and artificial intelligence.

GPUs are massively parallelized compute engines (Fig. 1) that are optimized for throughput. They contain thousands of cores that are designed for heavy mathematical workloads, such as matrix multiplication, on streams of similar data. They have thousands of cores running thousands of parallel threads, which complicates resource sharing and increases the risk of bugs due to race conditions. In addition, GPU cores may include many of the performance optimizing architectural features of CPU cores that lead to complexity and the rise of simulation-resistant superbugs.

CPU's

Figure 1: Massive parallelism in GPUs

GPU computing makes it possible to use the GPU for both compute and graphics workloads simultaneously to improve performance. This results in increased complexity to manage sharing of resources between the two asynchronous compute and graphics instruction streams, each with different requirements.

GPU Blocks with Simulation-Resistant Superbugs

Many functional units in a GPU are commonly found to contain simulation-resistant superbugs, such as:

Resource manager

Instruction scheduler

Execution units

Load-Store unit

L2 cache

These types of blocks have too many combinations to test in simulation and to completely cover all temporal relations between events.

Oski Formal for GPUs

Oski’s Formal Sign-Off Methodology enables exhaustive analysis of all possible design states. Oski has developed the expertise required to anticipate where the GPU superbugs are most likely to occur and to know how to flush them out. Contact Oski to learn more.