Simulation-Resistant ‘Superbugs’ Are Here. Time to Rethink Your Verification Strategy

Simulation-resistant superbugs are on the rise and wreaking havoc in an array of design domains including CPUs, GPUs, networking, wireless, functional safety, and machine learning. These functional superbugs are resistant to simulation because extreme corner cases are required to activate and detect them. They are a side-effect of designer innovations in parallelism and concurrency to offset the slowing of frequency scaling in a Post-Moore’s Law Era [1]. Similarly, functional techniques like clock gating that work around the stall in process-driven power efficiency gains lead to a further class of superbugs. Emulation may sometimes find superbugs, but not until very late in the design cycle. Typically, superbugs escape to silicon and are found in the lab or by an end customer.

The Verification Crisis in a Post-Moore’s Law Era

Verification has never been harder. CPU developers are unable to rely solely on making individual processors go faster and instead pack more cores on a chip to meet performance and efficiency goals. This surge in functional complexity and the increasing adoption of parallelism and concurrency have contributed to an increase in insidious bugs that are nearly impossible to find pre-silicon. There are many other contributing factors such as how finding the root cause of superbugs can be elusive during the debug process and how they may even be hiding sister superbugs that are revealed only when the initial superbug is identified [2]. So, it is difficult to predict when all of these superbugs will be fixed and the chip will be blessed for tape-out.

Now imagine data centers and networking systems utilizing thousands of these multi-core chips and the non-deterministic bugs, and you start to understand the scale of the superbug epidemic and the seemingly insurmountable hurdle they present in verification.

Simulation-resistant superbugs affect many design domains and are typically application-specific. For example, CPU functionality related to cache coherence, speculative issues, data prefetching, and memory subsystems may foster superbugs. Whereas networking applications use of resource managers for multiple ports, linked list controllers, and XBARs create superbug exposure. In the wireless domain, multi-user MIMOs, Rx and Tx channel interference and aggressive 802.11 standard compliance are prime illustrations of functionality that contribute to superbugs. The chart below depicts common areas or functionalities that are susceptible to superbugs by domain. Applying application-specific verification methodologies is the key to combating these superbugs.

Employing the Right Verification Strategy

With mask costs reaching extraordinary levels, it’s never been more important to ensure that all bugs are identified and fixed before tape-out. However, the set of complex issues that parallelism and concurrency introduce, including non-determinism, race conditions, deadlock, and performance and scalability challenges, have forced a change in verification strategies.

Simulation has historically been the verification tool of choice because it provides high controllability and observability, but its execution speed limits how much can reasonably be tested. Simulation is strong in verifying designs that are more sequential in execution even if they implement complex functionality. However, simulation runs into a wall when trying to account for all the different scenarios brought on by parallelism and concurrency.

Emulation has been growing in popularity because its speed allows for the execution of low-level software that can stimulate the design in ways that simulation cannot. It is also powerful for exploring system-level performance and validating software. However, relying on emulation to discover corner-case superbugs can be detrimental to meeting schedule demands because emulation is not typically focused on verifying corner-cases and is implemented late in the design flow. Finding serious bugs at this late stage is better than a bug escape to silicon, but it can still derails a design schedule, delaying market release and sacrificing profits.

Is Deep Application-Specific Formal the Answer?

Formal sign-off methodologies have the power to prove the absence of bug, including superbugs brought on by parallelism and concurrency. Formal techniques require a thorough understanding of the low-level details of the design. When bugs are found, formal methods identify the conditions under which the bug occurs — even if those conditions are bizarre corner cases that no one would ever think of.

Deep application-specific formal methodologies are ideally suited to cover all corner-case scenarios since all scenarios are exhaustively verified no matter how unlikely they are in the presence of parallelism and concurrency.

Formal verification can zero in on specific application behavior, making it the most robust and efficient way to knock down superbugs. There are many examples of mission-critical functionality that have been formally signed off. All of these formal methods were developed in response to post-silicon bugs being overly common in these block types.

Formal sign-off can happen in parallel with other verification methods, but the testbenches can take a significant amount of time to think through and create. Formal experts like Oski have been through this drill numerous times and have accumulated patterns and best practices that can save months of verification time. Applications-specific abstraction techniques are often key to overcoming the exponential proof complexity that might happen through a naive usage of formal tools.

Such experts offload your verification team so that simulation engineers can focus on simulation-friendly blocks, and emulation engineers can focus on system-level integration and software bugs. This approach leverages the strengths of each verification domain to deliver chip-level functional sign-off within a project’s schedule.

While superbugs might seem like an implacable foe, raising the specter of deficient systems that fail after deployment, they need not be. Formal sign-off with the assistance of an experience team like Oski Technology can eliminate superbugs without delaying time-to-market.




Going to DAC? Sign-up for a complimentary Superbug Risk Assessment where we’ll rank order your blocks by superbug risk.

Not going to DAC? You can still get your complimentary Superbug Risk Assessment. Click here.