• Home
  • PROOF OF CONCEPT
  • Solutions
    • AVIONICS
    • ELECTRIC UTILITIES
    • EP-OWNERS
    • EP-SUPPLIERS
    • HV TRANSFORMERS
    • NO FAULT FOUND
    • OPTIMIZED POWER MGT.
    • SCADA
  • Technology
    • AI-MSET TECHNOLOGY
    • ZENO'S CIRCULAR FILE
  • COMING SOON
  • ABOUT US
  • More
    • Home
    • PROOF OF CONCEPT
    • Solutions
      • AVIONICS
      • ELECTRIC UTILITIES
      • EP-OWNERS
      • EP-SUPPLIERS
      • HV TRANSFORMERS
      • NO FAULT FOUND
      • OPTIMIZED POWER MGT.
      • SCADA
    • Technology
      • AI-MSET TECHNOLOGY
      • ZENO'S CIRCULAR FILE
    • COMING SOON
    • ABOUT US
  • Home
  • PROOF OF CONCEPT
  • Solutions
    • AVIONICS
    • ELECTRIC UTILITIES
    • EP-OWNERS
    • EP-SUPPLIERS
    • HV TRANSFORMERS
    • NO FAULT FOUND
    • OPTIMIZED POWER MGT.
    • SCADA
  • Technology
    • AI-MSET TECHNOLOGY
    • ZENO'S CIRCULAR FILE
  • COMING SOON
  • ABOUT US

Electronic Prognostics-Enterprise Computing Infrastructure

Proactive Detection of Degradation in CPUs, GPUs, and System Boards

The HPC and AI Data Center Reliability Challenges


Modern enterprise data centers operate some of the most valuable computing assets in the world. Large-scale clusters composed of CPUs, GPUs, ASICs, and densely populated system boards support mission-critical AI training, inference, analytics, and cloud workloads. These assets are expected to operate continuously, often under extreme thermal, electrical, and duty-cycle stresses.


Despite extensive onboard instrumentation, unexpected hardware failures remain common and increasingly expensive. When CPUs, GPUs, or System Bards fail without warning, the consequences include lost availability, degraded cluster throughput, emergency maintenance, and significant operational disruption especially during long AI training runs.


The root cause is not a lack of telemetry. Modern servers and accelerators generate enormous volumes of data. The problem is that conventional monitoring approaches cannot interpret this data in a predictive way.


Quantized Signals Mask Early Degradation 


Many telemetry signals in modern enterprise hardware are quantized by design, with a leading source of quantization originating in low-resolution Analog-to-Digital (A/D) conversion chips embedded throughout servers, GPUs, system boards, and power delivery subsystems. For cost, bandwidth, and legacy reasons, most platforms rely on 8-bit (or similarly low-resolution) A/D converters, which fundamentally limit signal fidelity before the data ever reaches monitoring software.


As a result, many commonly used telemetry signals appear as:


  • Integer-valued temperatures
  • Discretized voltage and current readings
  • Coarse fan speed measurements
  • Counters, flags, and error registers


Early-stage degradation mechanisms rarely produce large excursions in any single signal. Instead, they manifest as subtle, correlated pattern changes across many signals. Quantization collapses these continuous dynamics into “stair-step” behavior, masking early degradation signatures and causing conventional anomaly detection algorithms to suffer from:


  • Elevated false-alarm probabilities (Type-I errors)
  • Elevated missed-alarm probabilities (Type-II errors) 
  • Elevated No Faults Found (NFFs) for returned Field Replaceable Units (FRUs)


TNP DeQuantize: Restoring Signal Fidelity Up-Stream of EP (NEW)


TNP solves this problem using DeQuantize, a signal-reconstruction capability applied up-stream of the core multivariate anomaly detection algorithm (AI-MSET™).


DeQuantize transforms low-resolution, quantized telemetry into high-fidelity signals that more accurately represent the underlying physical behavior of the hardware, without requiring any new sensors or hardware changes. Depending on the severity of quantization, DeQuantize reconstructs lost dynamics using proven spectral and statistical techniques derived from expired prior art.


By restoring meaningful signal structure before anomaly detection is performed, DeQuantize:


  • Dramatically reduces false alarms caused by quantization artifacts
  • Prevents missed detection of early degradation
  • Enables AI-MSET™ to fully exploit multivariate correlations that are otherwise hidden


This DeQuantize → AI-MSET™ architecture is essential for achieving reliable Electronic Prognostics in modern data centers and is a key reason why TNP’s EP outperforms conventional monitoring approaches.


Univariate Monitoring Does Not Scale


Conventional monitoring treats each signal independently, applying fixed high/low thresholds. In a large GPU cluster with hundreds of thousands or millions of telemetry signals, this approach is fundamentally unscalable and blind to system-level behavior.


Individual signals may remain “within limits” and not generate high/low threshold alerts while the system as a whole is already degrading (as AI-MSET™ catches from anomalous correlation patterns).


Thresholds Detect Failures Too Late


Threshold-based alarms typically fire only when damage is already significant and failure is imminent. At that point:


  • Emergency board replacement is required
  • MTTR increases
  • Spare inventory must be maintained on site
  • Cluster availability is directly impacted


What Is Electronic Prognostics (EP)?


Electronic Prognostics (EP) is a multivariate anomaly detection algorithm that monitors all telemetry metrics for a server, learns the patterns of correlation between and among all the signals while the server is in a healthy state, then in the surveillance mode, EP is very well proven with business-critical IT systems in data centers to achieve proactive detection of the incipience and early progression of degradation mechanisms that lead to electronic failures of chips, system boards, or anomalies that cause entire servers to crash.


EP does not wait for alarms -- IT PREDICTS FAILURES BEFORE THEY OCCUR.


TNP’s EP capability is powered by AI-MSET™, a multivariate machine-learning approach designed to model normal system behavior and detect subtle deviations in learned correlation patterns that indicate emerging faults in CPUs, GPUs, and system boards.


Statistical Detection with Extremely Low False Alarm Probabilities


Deviations from the learned multivariate baseline are evaluated using Sequential Probability Ratio Tests (SPRTs). When combined with DeQuantize telemetry, this provides:


  • Statistically rigorous change-point detection
  • Confidence-graded alerts
  • Extremely low false-alarm and missed-alarm probabilities, even in highly instrumented, quantized environments


Solving the Supplier’s NFF Problem


For system manufacturers and component suppliers, unexpected failures create an additional challenge: No-Fault-Found (NFF) board returns.


Quantization-induced false alarms and missed detections are a major contributor to NFFs, because boards are often removed from service without any reproducible failure signature. AI-MSET™ enabled logical Black Box Recorder (BBR) files — generated from DeQuantized multivariate telemetry — eliminate NFFs by providing quantitative, time-resolved degradation signatures. These signatures allow design engineers and component suppliers to:


  • Pinpoint the true onset of degradation
  • Distinguish real failure mechanisms from monitoring artifacts
  • Disambiguate sensor anomalies from anomalies in the assets
  • Provide fast, accurate, reproducible Root Cause Analysis to provide quantitative Pareto Analyses, enabling the supplier’s design teams to prioritize mods in the next platform upgrades to eliminate the most prominent (or most costly) fault mechanisms in future platform revisions


How AI-MSET™ Enables Predictive Detection


Multivariate Pattern Learning with AI-MSET™


AI-MSET™ models the normal multivariate behavior of complex electronic systems, including:


  • Processors
  • Power delivery networks
  • Thermal subsystem
  • Memory and interconnects
  • Board-level electronics


Rather than evaluating signals one at a time (traditional univariate anomaly detection), AI-MSET™ learns how dozens, hundreds, or thousands of dynamic signals behave together under healthy operating conditions. This multivariate-trained AI model becomes a high-fidelity baseline against which all future behavior is compared.


Because degradation mechanisms affect multiple subsystems simultaneously, pattern-based detection is essential.


Statistical Detection with Extremely Low False Alarms


Deviations from the learned multivariate baseline are evaluated using Sequential Probability Ratio Tests (SPRTs). This provides:


  • Statistically rigorous change-point detection
  • Confidence-graded alerts
  • Extremely low false-alarm and missed-alarm probabilities


The result is early, reliable predictive warnings, often days or weeks before failure, without overwhelming operators with noise.


Enabling Dynamic Reconfiguration (DR)


Predictive lead time enables a powerful operational capability: Dynamic Reconfiguration (DR).

When AI-MSET™ identifies a system board at risk:


  1. Workload mobility software migrates active workloads to healthy (idle-standby) board in a nearby machine
  2. The at-risk board is transitioned to an idle state.
  3. The board is replaced during planned maintenance.
  4. Workloads are dynamically reconfigured back onto the new board.


The end customer never experiences an outage. Failures are prevented rather than reacted to.


Black Box Recorders (BBRs) and Traceback Root Cause Analysis (RCA)


TNP’s EP solution includes Black Box Recorder (BBR) files.


BBRs are not physical devices. They are compact, logical flat files stored in the data historian DB alongside existing OS log files and telemetry archives (e.g., Prometheus™ and/or DCGM™ exports).

BBRs capture:


  • Pre-failure telemetry history
  • AI-MSET™ residuals
  • Exact SPRT alarm traces identifying exact individual or combinations of multiple signals are exhibiting degradation, with exact timestamps for the onset of early-warning anomalous behavior, and shows the dynamic evolution of a “severity index” for the asset, see publications in the Bibliography from real use-case examples of high-end enterprise servers in data centers. This enables fast, accurate traceback root cause analysis (RCA)


Proven Lineage, Failure Mechanisms, and Business Impact


Pioneering Experience in Electronic Prognostics


TNP’s technology innovators bring over two decades of pioneering experience in MSET- and SPRT-based Electronic Prognostics for complex engineered systems. Through extensive refereed journal and conference publications, they demonstrated how multivariate models can learn the correlation patterns present in normal systems and apply those models to hundreds of thousands of telemetry signals across large computing environments.


This work established the capability to proactively detect the incipience of degradation mechanisms leading to failures of CPUs, GPUs, system boards, and the diverse electronic and electromechanical components mounted on those boards.


The original MSET methodology revolutionized multivariate anomaly detection across industries ranging from nuclear energy and aerospace to enterprise IT. TNP’s third-generation AI-MSET™ builds on that foundation, achieving even lower false-alarm probabilities, scaling to million-sensor fleets, and operating out-of-the-box with modern telemetry frameworks including Prometheus™, Redfish™, Sentry™, and NVIDIA™ DCGM™.


Failure Precursor Mechanisms Detected by AI-MSET™


AI-MSET™ has demonstrated the ability to proactively detect the onset of a wide range of degradation mechanisms that ultimately cause enterprise computing failures. These include mechanical, thermal, electrical, and materials-related mechanisms affecting CPUs, GPUs, and system boards.


Below (Table 1)  provides representative examples of CPU and GPU package failure precursor mechanisms proactively detected by AI-MSET™, including solder joint degradation, thermal attach issues, power delivery degradation, interconnect wear-out, vibration-induced failures, capacitor aging, and sensor drift.                                          

                                                                                                                                                                                                              

 Solving the Supplier’s NFF Problem


For system manufacturers and component suppliers, unexpected failures create an additional challenge: No-Fault-Found (NFF) board returns.


Returned boards that cannot be reproduced in test environments:


  • Drive massive sparing inventories. If a platform has a 50% NFF rate for expensive System Boards, there needs to be 50% more spares in the sparing depots around the world
  • Complicate paring supply-chain logistics and challenges service-level-availability (SLA) contracts
  • Obscure true failure mechanisms
  • Slow platform improvement cycles


AI-MSET™ enabled BBRs eliminate NFFs by providing quantitative, time-resolved degradation signatures. These signatures allow design engineers and component suppliers to identify root causes precisely and eliminate failure mechanisms in future platform revisions.


Economic Impact and Summary


For enterprise Owners, EP enables:


  • Near-zero unplanned outages
  • Significantly reduced MTTRs
  • Smaller on-site spare inventories
  • Predictable cluster performance


For Suppliers, EP enables:


  • Elimination of NFFs
  • Reduced global spares
  • Faster service resolution
  • Accelerated design improvement
  • Improved reliability & customer satisfaction


Electronic Prognostics powered by AI-MSET™ transforms enterprise computing reliability from reactive monitoring to predictive prevention with confidence.
 

It is advanced reliability engineering for the AI era.

Electronic Prognostics Bibliography

Download PDF

Historical Perspective

  

                                                                              EP Predictive Detection of Incipient Degradation Modes in High-End Enterprise Servers

                                                                                                            “No Faults Found” Avoidance for Failed System Boards"

Original development of MSET-based Electronic Prognostics (proactive detection of all types of incipient degradation mechanisms that cause CPUs, GPUs, System Boards, and Enterprise Servers to be at elevated risk of failure, days and often weeks in advance of failure), and avoidance of “No Faults Found” (NFFs, also called “No Trouble Found” NTFs).

Originally published as Sun Microsystems Contrarian Minds Blog (2004)

Copyright © 2025 True North Prognostics - All Rights Reserved.



True North Prognostics, LLC

614 5th Ave. Ste D-1

San Diego, CA 92101

Phone: 844-565-2770

Fax:        866-476-9393

info@tnprognostics.com

Powered by

This website uses cookies.

We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.

Accept