
The HPC and AI Data Center Reliability Challenges
Modern enterprise data centers operate some of the most valuable computing assets in the world. Large-scale clusters composed of CPUs, GPUs, ASICs, and densely populated system boards support mission-critical AI training, inference, analytics, and cloud workloads. These assets are expected to operate continuously, often under extreme thermal, electrical, and duty-cycle stresses.
Despite extensive onboard instrumentation, unexpected hardware failures remain common and increasingly expensive. When CPUs, GPUs, or System Bards fail without warning, the consequences include lost availability, degraded cluster throughput, emergency maintenance, and significant operational disruption especially during long AI training runs.
The root cause is not a lack of telemetry. Modern servers and accelerators generate enormous volumes of data. The problem is that conventional monitoring approaches cannot interpret this data in a predictive way.
Quantized Signals Mask Early Degradation
Many telemetry signals in modern enterprise hardware are quantized by design, with a leading source of quantization originating in low-resolution Analog-to-Digital (A/D) conversion chips embedded throughout servers, GPUs, system boards, and power delivery subsystems. For cost, bandwidth, and legacy reasons, most platforms rely on 8-bit (or similarly low-resolution) A/D converters, which fundamentally limit signal fidelity before the data ever reaches monitoring software.
As a result, many commonly used telemetry signals appear as:
Early-stage degradation mechanisms rarely produce large excursions in any single signal. Instead, they manifest as subtle, correlated pattern changes across many signals. Quantization collapses these continuous dynamics into “stair-step” behavior, masking early degradation signatures and causing conventional anomaly detection algorithms to suffer from:
TNP DeQuantize: Restoring Signal Fidelity Up-Stream of EP (NEW)
TNP solves this problem using DeQuantize, a signal-reconstruction capability applied up-stream of the core multivariate anomaly detection algorithm (AI-MSET™).
DeQuantize transforms low-resolution, quantized telemetry into high-fidelity signals that more accurately represent the underlying physical behavior of the hardware, without requiring any new sensors or hardware changes. Depending on the severity of quantization, DeQuantize reconstructs lost dynamics using proven spectral and statistical techniques derived from expired prior art.
By restoring meaningful signal structure before anomaly detection is performed, DeQuantize:
This DeQuantize → AI-MSET™ architecture is essential for achieving reliable Electronic Prognostics in modern data centers and is a key reason why TNP’s EP outperforms conventional monitoring approaches.
Univariate Monitoring Does Not Scale
Conventional monitoring treats each signal independently, applying fixed high/low thresholds. In a large GPU cluster with hundreds of thousands or millions of telemetry signals, this approach is fundamentally unscalable and blind to system-level behavior.
Individual signals may remain “within limits” and not generate high/low threshold alerts while the system as a whole is already degrading (as AI-MSET™ catches from anomalous correlation patterns).
Thresholds Detect Failures Too Late
Threshold-based alarms typically fire only when damage is already significant and failure is imminent. At that point:
What Is Electronic Prognostics (EP)?
Electronic Prognostics (EP) is a multivariate anomaly detection algorithm that monitors all telemetry metrics for a server, learns the patterns of correlation between and among all the signals while the server is in a healthy state, then in the surveillance mode, EP is very well proven with business-critical IT systems in data centers to achieve proactive detection of the incipience and early progression of degradation mechanisms that lead to electronic failures of chips, system boards, or anomalies that cause entire servers to crash.
EP does not wait for alarms -- IT PREDICTS FAILURES BEFORE THEY OCCUR.
TNP’s EP capability is powered by AI-MSET™, a multivariate machine-learning approach designed to model normal system behavior and detect subtle deviations in learned correlation patterns that indicate emerging faults in CPUs, GPUs, and system boards.
Statistical Detection with Extremely Low False Alarm Probabilities
Deviations from the learned multivariate baseline are evaluated using Sequential Probability Ratio Tests (SPRTs). When combined with DeQuantize telemetry, this provides:
Solving the Supplier’s NFF Problem
For system manufacturers and component suppliers, unexpected failures create an additional challenge: No-Fault-Found (NFF) board returns.
Quantization-induced false alarms and missed detections are a major contributor to NFFs, because boards are often removed from service without any reproducible failure signature. AI-MSET™ enabled logical Black Box Recorder (BBR) files — generated from DeQuantized multivariate telemetry — eliminate NFFs by providing quantitative, time-resolved degradation signatures. These signatures allow design engineers and component suppliers to:
How AI-MSET™ Enables Predictive Detection
Multivariate Pattern Learning with AI-MSET™
AI-MSET™ models the normal multivariate behavior of complex electronic systems, including:
Rather than evaluating signals one at a time (traditional univariate anomaly detection), AI-MSET™ learns how dozens, hundreds, or thousands of dynamic signals behave together under healthy operating conditions. This multivariate-trained AI model becomes a high-fidelity baseline against which all future behavior is compared.
Because degradation mechanisms affect multiple subsystems simultaneously, pattern-based detection is essential.
Statistical Detection with Extremely Low False Alarms
Deviations from the learned multivariate baseline are evaluated using Sequential Probability Ratio Tests (SPRTs). This provides:
The result is early, reliable predictive warnings, often days or weeks before failure, without overwhelming operators with noise.
Enabling Dynamic Reconfiguration (DR)
Predictive lead time enables a powerful operational capability: Dynamic Reconfiguration (DR).
When AI-MSET™ identifies a system board at risk:
The end customer never experiences an outage. Failures are prevented rather than reacted to.
Black Box Recorders (BBRs) and Traceback Root Cause Analysis (RCA)
TNP’s EP solution includes Black Box Recorder (BBR) files.
BBRs are not physical devices. They are compact, logical flat files stored in the data historian DB alongside existing OS log files and telemetry archives (e.g., Prometheus™ and/or DCGM™ exports).
BBRs capture:
Proven Lineage, Failure Mechanisms, and Business Impact
Pioneering Experience in Electronic Prognostics
TNP’s technology innovators bring over two decades of pioneering experience in MSET- and SPRT-based Electronic Prognostics for complex engineered systems. Through extensive refereed journal and conference publications, they demonstrated how multivariate models can learn the correlation patterns present in normal systems and apply those models to hundreds of thousands of telemetry signals across large computing environments.
This work established the capability to proactively detect the incipience of degradation mechanisms leading to failures of CPUs, GPUs, system boards, and the diverse electronic and electromechanical components mounted on those boards.
The original MSET methodology revolutionized multivariate anomaly detection across industries ranging from nuclear energy and aerospace to enterprise IT. TNP’s third-generation AI-MSET™ builds on that foundation, achieving even lower false-alarm probabilities, scaling to million-sensor fleets, and operating out-of-the-box with modern telemetry frameworks including Prometheus™, Redfish™, Sentry™, and NVIDIA™ DCGM™.
Failure Precursor Mechanisms Detected by AI-MSET™
AI-MSET™ has demonstrated the ability to proactively detect the onset of a wide range of degradation mechanisms that ultimately cause enterprise computing failures. These include mechanical, thermal, electrical, and materials-related mechanisms affecting CPUs, GPUs, and system boards.
Below (Table 1) provides representative examples of CPU and GPU package failure precursor mechanisms proactively detected by AI-MSET™, including solder joint degradation, thermal attach issues, power delivery degradation, interconnect wear-out, vibration-induced failures, capacitor aging, and sensor drift.
Solving the Supplier’s NFF Problem
For system manufacturers and component suppliers, unexpected failures create an additional challenge: No-Fault-Found (NFF) board returns.
Returned boards that cannot be reproduced in test environments:
AI-MSET™ enabled BBRs eliminate NFFs by providing quantitative, time-resolved degradation signatures. These signatures allow design engineers and component suppliers to identify root causes precisely and eliminate failure mechanisms in future platform revisions.
Economic Impact and Summary
For enterprise Owners, EP enables:
For Suppliers, EP enables:
Electronic Prognostics powered by AI-MSET™ transforms enterprise computing reliability from reactive monitoring to predictive prevention with confidence.
It is advanced reliability engineering for the AI era.
EP Predictive Detection of Incipient Degradation Modes in High-End Enterprise Servers
“No Faults Found” Avoidance for Failed System Boards"
Original development of MSET-based Electronic Prognostics (proactive detection of all types of incipient degradation mechanisms that cause CPUs, GPUs, System Boards, and Enterprise Servers to be at elevated risk of failure, days and often weeks in advance of failure), and avoidance of “No Faults Found” (NFFs, also called “No Trouble Found” NTFs).
Copyright © 2025 True North Prognostics - All Rights Reserved.
True North Prognostics, LLC
614 5th Ave. Ste D-1
San Diego, CA 92101
Phone: 844-565-2770
Fax: 866-476-9393
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.