STIM: Predicting Memory Uncorrectable Errors with Spatio-Temporal Transformer
Zhexiong Liu, , Siduo Jiang
Research papers and technical reports on machine learning, natural language processing, and AI systems.
View all on Google ScholarZhexiong Liu, , Siduo Jiang
Siduo Jiang, , William Casey King
Zhexiong Liu, , Siduo Jiang
Siduo Jiang, , Andrew Fogarty, William Casey King, Alberto Todeschini, Hossein Vahabi
US-2025-0272192A1
Systems and methods are directed to training and using a spatial-temporal transformer to predict memory errors. The system aggregates historical data including error logs from data centers by time windows and generates, from the aggregated historical data, a spatial representation of the errors and a set of micro features for each time window in an observation period. A memory feature vector is generated for each time window by flattening the spatial representation and appending the corresponding set of micro features to an end of the flattened spatial representation. The spatial-temporal transformer is trained by applying the memory feature vector for each time window to a transformer encoder. This training process is repeat for each observation period within a data collection period. During inference time, a similar process is performed to generate inference memory feature vectors for an inference observation period, which are applied to the trained transformer to predict errors.
US-2025-0217177A1
A method, computer program product, and computing system for collecting data concerning interruptions associated with a plurality of virtual machines, and for collecting hardware information concerning one or more nodes hosting the plurality of virtual machines at a time generally contemporaneous with the interruptions of the plurality of virtual machines. A correlation is generated between interruptions of at least a subset of the plurality of virtual machines and one or more hardware component attributes of the one or more nodes.