\chapter{Background}\label{s:background} \section{Digital Twinning}\label{ss:digital-twinning} % To fix: remove the \gls commands for ExaDigiT. % This is getting silly. \gls{ed}~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is an open-source framework for developing digital twins of supercomputers. It consists of 3 modules: \begin{enumerate*}[label=(\arabic*)] \item resource allocator and power simulator \item thermal cooling model \item augmented reality 3D model \end{enumerate*} of the supercomputer. \gls{ed} has been used at the Frontier supercomputer at the Oak Ridge National Laboratory in the USA, successfully predicting potential energy losses at the supercomputer. Brewer \etal include alongside the framework architecture an open-source artifact and a set of extensive verification and validation experiments. The authors differentiate between different digital twins within \gls{ed}, such as \begin{enumerate*}[label=(\arabic*)] \item descriptive twin \item informative twin \item predictive twin \item comprehensive twin \item autonomous twin \end{enumerate*} that together form the \gls{ed}. The \emph{predictive twin} leverages data driven operational analytics to create \gls{ml} models. Authors argue that alongside simulation, \gls{ml} models should also have a significant role for modeling system workloads in \eg application fingerprinting. Within the \emph{autonomous twin} the authors use \gls{rl} to train agents that can be used to make control decisions in order to optimize different processes. In order to model the cooling system the authors use the Modelica software, and to predict energy power draw they coded a Python script. The authors provide a intuitive way to interact with the system using a visual dashboard, and an advanced augmented reality model. The authors posit that the best way to address the 3V's of data (velocity, volume and variety) is to use augmented reality coupled with dashboards. SmarDC~\cite{DBLP:conf/noms/ZhangZLZWC22} is a digital twin solution for optimization of power consumption in datacenters. Specifically, Zhang \etal propose that using \gls{ai} enhanced modeling paired with digital twinning can help make dynamic adjustments to the datacenter cooling subsystem. SmartDC has been proven to ensure efficient energy-saving rate of a China Telecom datacenter at 41\%. However, the main purpose of SmartDC is not to continuously interact with the facility, but to provide additional training data for a more accurate, \gls{ml} solution. The digital twin is designed to provide extra datasets for training \gls{ai} models. % This digital twin together with ExaDigiT use computational fluid dynamics (CFD). % ExaDigiT uses open-source Modelica software and SmartDC uses proprietary 6SigmaDC. % At this point it would make sense to create the distinction between _structural_ digital twinning and _behavioural_ digital twinning. % Link to 6SigmaDC: https://www5.cadence.com/trial_datacenter_insights_lp.html DyTwin~\cite{DBLP:conf/sc/TaheriBPRHDEWPM24} is an adaptive digital twin with visualization and anomaly detection features. Predictive modelling uses statistics to predict outcomes. When deployed commercially, for example in datacenters, predictive modelling is often referred to as predictive analytics~\cite{Wikipedia:PredictiveModelling}. Almost any statistical model can be used for prediction purposes, but nowadays predictive analysis is synonymous with machine learning. A primary example of popular analysis type is linear regression. A major limitation of predictive analytics is that history cannot always predict the future. Using historical data to predict outcomes works only under the assumption that there are certain long lasting patterns in the system. Additionally, no matter how extensive is the training data, there is always the possibility of new variables that have not been considered or even defined, yet are critical to the outcome of the prediction~\cite{Wikipedia:PredictiveModelling}. %Here you have to cite Deisenroth, 2024, chapter 8.1.4. An inference function is a machine learning model which uses probabilistic parameter estimation~\cite{}. A prime example of using probability to find a good machine learning model is Bayesian inference. % Stanford Encyclopedia of Philosophy, Douven 2017 The process of inference from data to provide the best explanation is called abduction. %Include something about data-preprocessing in the pipeline. %See the article by Fei Tao \subsection{Datacenter simulation}\label{sss:simulation} \input{sources/dt_features_comparison.tex} \section{Datacenters}\label{ss:datacenters} Explain the high risk phenomena that occur in datacenters, which includes failures. % Ask Jesse if you can have both of such tables in this section \input{sources/simulator_comparison.tex} One of the key arguments that speak for a datacenter digital twin is that datacenters already connect hundreds of monitoring sensors and data coming from them. Monitoring of server racks, VM's, CPU profiling and all that give us lots of data. Data analytics, such as ODA can give actual meaningful insights into what we are doing. Moreover, advanced technologies have made sensors, IoT give us much information. ODA can predict failures, help maintain the equipment, save bills, cut costs. But currently one of the key challenges is to somehow connect the physical and virtual spaces. The answer to how to do this is a digital twin. %[citation needed] As of 2026, there is a lack of consensus of what is a digital twin. By proxy, there is neither consensus on what is the definition of a datacenter digital twin. A generic definition is needed. \subsection{Failures}\label{sss:failures} %Why predictive analytics? Why predictive behaviour? %What is below here is true, but nonetheless the argumentation should be slightly changed. And a citation is needed. However, there has been little effor made to integrate analytics that enable consistent and relaible prediction of datacenter behaviour into a holistic digital twin of a datacenter. Nor has the fidelity of failure modeling inside a datacenter simulation increased. The failure model is still a linear model. % Since a datacenter simulator is quite different from a digital twin, we cannot use the same computation methods (not as they are right now, at least) -- we must adapt them. The prediciton models are the same ones for the digital twin as the ones used for the datacenter simulator. Since a digital twin is not a standalone simulator, a change to how we both predict and model failures is necessary. The longer the DT is working, the more accurate its predictions. All the results are aggregated. % Why has not anyone done this before? It is also the case that currently this is possible only and only because of the recent development in High Performance Computing. Between 2003 and 2011 the compute needed to run a Digital Twin was simply not there. As such, while the concept existed, the hardware did not catch up yet. However, in the last decade, multicore computing paradigms and the advent of GPU computing has finally enabled computation needed to run a Digital Twin. This is what has changed, so that today running a digital twin is relevant, much more relevant than it was 10 years ago. This is also why nobody has done a Digital Twin of a datacenter before. The current widespread availability of HPC makes this possible. Because of judgement born out of experience, evolution of existing datacenters is fairly successful; however the development of a new, modern datacenters is fraught with unexpected problems that results in weight growth, schedule delays and cost overruns. Optimal datacenter management is characterized by high service availability and low downtime. Achieving this in a 21\textsuperscript{st} century datacenter requires revolutionary changes in the way datacenters are operated and maintained. A concept that creates just such a revolutionary change is the \gls{dcdt}.