Home Summary Organizers Audience Bios References

HPCA 2025


Computing Systems Resilience to Hardware Faults:
Tackling Complexity and Scale

Sunday, March 2, 2025, Las Vegas, NV, USA
All-Day tutorial held in conjunction with HPCA 2025

Tutorial Summary

Errors and failures in systems are a significant concern in the era of advanced and ubiquitous computing. Faults in silicon chips exist because of imperfections of the fabrication process, incompleteness of the volume manufacturing testing flow, devices variability and marginalities, circuit aging, radiation and other environmental effects, and circuit design bugs. With CPU, GPU, AI accelerators (AIA) and memory chips getting constantly more complex and the scale of their deployment in all domains (HPC, cloud, edge, IoT) increasing dramatically, the rates at which computing systems failures or silent computational errors happen due to any of the above threats are frightening.

Chip manufacturers, system integrators, hyperscalers and, of course, the research community are joining forces to tackle the complexity and scale of the problem. This tutorial discusses recent industry disclosures and novel research approaches to tackle the problem of silicon defects at scale. The critical point is the effort to measure and quantify the scale of the problem through modeling, analysis, and experimentation in simulation and real chips and systems. The importance of early and accurate analysis is manyfold. It can guide effective decisions at circuit design, system design, and software design for the identification and correction of the root causes of faults and the minimization of their impact on the software execution and, through this, to the user perception about the computing systems correctness of operation.

The tutorial starts with a brief overview of key resilience terminology, followed by a review of techniques to model the vulnerability of a microprocessor to hardware faults. The tutorial then focuses on state-of-the-art simulation-based methods for fast assessment of the vulnerability of the entire system stack to different types of hardware faults (transient, permanent, timing, bugs) in the silicon compute units (CPUs, GPUs, AIAs). The simulation-based strategies are built on early microarchitectural models that allow sufficient hardware modeling cycle-level accuracy, and, most of all, very fast, complete execution of workloads. These are mandatory requirements for the measurement of SDC and other types of failure rates. They also facilitate a very broad design space exploration and assist diligent sensitivity studies for all microarchitectural parameters of modern computing engines that affect the rate of execution errors.

The simulation-based framework supports CPUs of different ISAs (x86, Arm, RISC-V), a wide range of microarchitectures with different complexities, as well as GPUs and domain-specific accelerators (such as AI accelerators – AIAs) for data-intensive workloads integrated in complex SoCs. Results and case studies will be presented for the analysis of different hardware fault types and architectures/microarchitectures, as well as on the validation of the findings of the strategy on real systems.

Apart from the analysis of the vulnerability of different compute chips (CPUs, GPUs, AIAs) the framework is utilized for (a) automatic generation of effective test programs that can be utilized at different stages of large scale systems (cloud data centers, supercomputers), and (b) the estimated rates of SDC incidents at scale, and (c) the relative likelihood of different microarchitectures, instruction classes, and hardware structures to produce SDCs.

Key topics to be discussed:

  • Effects of silicon defects at scale
  • Hardware faults modeling and simulation
  • Fault injection at microarchitecture level and speed up techniques
  • Resilience measurements for CPUs, GPUs, and AI Accelerators
  • Faults in on-chip arrays and functional units
  • Functional test programs generation
  • SDCs and failure rates estimation

Organizers/Presenters

Dimitris Gizopoulos, George Papadimitriou, Odysseas Chatzopoulos, Nikos Karystinos (University of Athens)

Tentative Agenda

08:00 - 08:30: Welcome and Introduction

  • Tutorial outline
  • Resilience basics: defects, faults, errors, failures

08:30 - 09:30: Silicon defects and errors at scale

  • Silent Data Corruptions at scale: industry reports
  • Suspected causes of errors and failures at scale

09.30 - 12:00: Modeling and measuring the effects of faults

  • Microarchitecture level fault modeling
  • Vulnerability factors revisited
  • Fault injection infrastructure for CPUs, GPUs, and AIAs
  • Faults in on-chip arrays and functional units

12.20 -13.00: Lunch

  • 13:00 - 14:00: Case Studies
  • Case studies for CPUs, GPUs, AIAs resilience

14:00 - 15:30: Use Cases

  • Automatic functional program generation
  • SDC incidents measurements at scale

15.30 - 16.30: Discussion and Wrap-up

  • Future directions and open discussion with attendees

Target audience

This tutorial is designed for students, researchers, engineers, and practitioners active in the fields of computer architecture and systems reliability. Attendees should have a basic understanding of microprocessor architecture/microarchitecture, and the basics of hardware faults and errors.

Short bios

Dimitris Gizopoulos (dgizop@di.uoa.gr) is Professor at the Department of Informatics & Telecommunications of the University of Athens leading the Computer Architecture Lab. The group's research focuses on the dependability, the energy-efficiency, and the performance of computer architectures. Gizopoulos has published more than 190 papers in conferences and journals, has served and is currently serving as Associate Editor for several IEEE and ACM Transactions and Magazines and as member of Program, Organizing and Steering Committees of IEEE and ACM conferences. Gizopoulos is an IEEE Fellow, a Golden Core member of the IEEE Computer Society and a Distinguished ACM member. He received the ACM SIGMICRO Distinguished Service Award 2024.


George Papadimitriou (georgepap@di.uoa.gr) is a Postdoctoral Researcher in the Department of Informatics & Telecommunications of the University of Athens. His research focuses on dependability and energy-efficient computer architectures, microprocessor reliability, functional correctness of hardware designs and design validation of microprocessors, and has published more than 60 papers in international conferences and journals. He was granted the 2022 Best Paper Award from IEEE Transactions on Computers for his journal publication at IEEE TC 2022, and the 2023 TTTC/ITC Gerald W. Gordon Student Award. He is an IEEE member.



Odysseas Chatzopoulos (Od.Chatzopoulos@di.uoa.gr) is a PhD student in the Department of Informatics & Telecommunications of the University of Athens. His research focuses on energy-efficient microprocessor design, and dependable computing modeling and assessment.


Nikos Karystinos (n.karystinos@di.uoa.gr) is a PhD student in the Department of Informatics & Telecommunications of the University of Athens. His research focuses on dependable computing modeling and assessment.


Related Projects


Research Supported by


Publications

HPCA 2025 - "Veritas: Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUs", O. Chatzopoulos, Nikos Karystinos, G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025), Las Vegas, NV, USA, March 2025.


DATE 2025 - "From Gates to SDCs: Understanding Fault Propagation Through the Compute Stack", O. Chatzopoulos, G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, Design, Automation, and Test in Europe, Lyon, France, March 2025.


SIGARCH Blog 2024 - "SDCs: A B C", D. Gizopoulos, Computer Architecture Today blog, September 16, 2024.


ISCA 2024 - "Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation", N.Karystinos, O.Chatzopoulos, G.Fragkoulis, G.Papadimitriou, D.Gizopoulos, and S.Gurumurthi, ACM/IEEE International Symposium on Computer Architecture (ISCA 2024), Buenos Aires, Argentina, June 2024


CARRV 2024 - "Advancing Cloud Computing Capabilities on gem5 by Implementing the RISC-V Hypervisor Extension", G. Fragkoulis, N. Karystinos, G. Papadimitriou, and D. Gizopoulos, Eighth Workshop on Computer Architecture Research with RISC-V (CARRV 2024) - in conjunction with the IEEE/ACM International Symposium on Microarchitecture (MICRO 2024), Austin, Texas, USA, November 2024


CLUSTER 2024 - "GPU Reliability Assessment: Insights Across the Abstraction Layers", L. Yang, G. Papadimitriou, D. Sartzetakis, A. Jog, E. Smirni, and D. Gizopoulos, IEEE International Conference on Cluster Computing (CLUSTER 2024), Kobe, Japan, September 2024


HPCA 2024 - "gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures", O. Chatzopoulos, G. Papadimitriou, V. Karakostas, and D. Gizopoulos, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2024), March 2024.


IOLTS 2024 - "Silent Data Corruptions in Computing: Understand and Quantify", T. Macieira, S. Gurumurthy, S. Gurumurthi, A. Haggag, G. Papadimitriou, and D. Gizopoulos, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2024), Rennes, Brittany, France, July 2024.


ETS 2024 - "Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements", D. Gizopoulos, G. Papadimitriou, O. Chatzopoulos, N. Karystinos, H. Dixit, and S. Sankar, IEEE European Test Symposium (ETS 2024), The Hague, Netherlands, May 2024.


HPCA 2023 - "AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment", G. Papadimitriou and D. Gizopoulos, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023), February 2023.


MICRO 2023 - "Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs", D. Agiakatsikas, G. Papadimitriou, V. Karakostas, D. Gizopoulos, M. Psarakis, C. Belanger-Champagne, and E. Blackmore, IEEE/ACM International Sympocium on Microarchitecture (MICRO 2023), October 2023.


IOLTS 2023 - "Silent Data Corruptions: The Stealthy Saboteurs of Digital Integrity", G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2023), July 2023.


TC 2023 - "Silent Data Corruptions: Microarchitectural Perspectives", G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Computers, Volume: 72, Issue: 11, pp. 3072-3085, November 2023.


TETC 2023 - "Anatomy of On-Chip Memory Hardware Fault Effects Across the Layers", G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Emerging Topics in Computing, Volume: 11, Issue: 2, pp. 420-431, June 2023.


ISPASS 2022 - "gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs", D. Sartzetakis, G. Papadimitriou, and D. Gizopoulos, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2022), May 2022.


TC 2022 - "Soft Error Effects on Arm Microprocessors: Early Estimations Versus Chip Measurements", P. Bodmann, G. Papadimitriou, R. L. Rech Jr, D. Gizopoulos, and P. Rech, IEEE Transactions on Computers, Volume: 71, Issue: 10, pp. 2358-2369, October 2022.


ISCA 2021 - "Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers", G. Papadimitriou, and D. Gizopoulos, ACM/IEEE International Symposium on Computer Architecture (ISCA 2021), June 2021.


DSN 2019 - "Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments", A. Chatzidimitriou, P. Bodmann, G. Papadimitriou, D. Gizopoulos, and P. Rech, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2019), June 2019.


DSN 2017 - "RT Level vs. Microarchitecture Level Reliability Assessment: Case Study on ARM Cortex-A9 CPU", A. Chatzidimitriou, M. Kaliorakis, D. Gizopoulos, M. Iacaruso, M. Pipponzi, R. Mariani, S. Di Carlo, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2017), June 2017.



Creative Commons License Created by Computer Architecture Lab @ UoA
This work is licensed under a CC License
University of Athens
Dept. of Informatics and Telecommunications

Address:
Panepistimiopolis, Ilissia
Athens, Greece, GR 157 84

Phone:
+30 210 727 5145
Email:
dgizop AT di DOT uoa DOT gr