content/background.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146

\chapter{Background}\label{s:background}

\section{Overview}\label{s:background_overview}

\section{Datacenters}\label{ss:datacenters}

\subsection{Computing Infrastructure}\label{sss:failures}

\subsection{Datacenter Simulation}\label{sss:simulation}

Predictive modelling uses statistics to predict outcomes.
When deployed commercially, for example in datacenters, predictive modelling is often referred to as predictive analytics~\cite{Wikipedia:PredictiveModelling}.
Almost any statistical model can be used for prediction purposes, but nowadays predictive analysis is synonymous with machine learning.
A primary example of popular analysis type is linear regression.
A major limitation of predictive analytics is that history cannot always predict the future.
Using historical data to predict outcomes works only under the assumption that there are certain long lasting patterns in the system.
Additionally, no matter how extensive is the training data, there is always the possibility of new variables that have not been considered or even defined, yet are critical to the outcome of the prediction~\cite{Wikipedia:PredictiveModelling}.

%Here you have to cite Deisenroth, 2024, chapter 8.1.4.
An inference function is a machine learning model which uses probabilistic parameter estimation~\cite{}.
A prime example of using probability to find a good machine learning model is Bayesian inference.
% Stanford Encyclopedia of Philosophy, Douven 2017
The process of inference from data to provide the best explanation is called abduction.


%What is below here is true, but nonetheless the argumentation should be slightly changed. And a citation is needed.
However, there has been little effort made to integrate analytics that enable consistent and reliable prediction of datacenter behaviour into a holistic digital twin of a datacenter.
Nor has the fidelity of failure modeling inside a datacenter simulation increased.
The failure model is still a linear model.
Since a datacenter simulator is quite different from a digital twin, we cannot use the same computation methods (not as they are right now, at least) -- we must adapt them.
The prediciton models are the same ones for the digital twin as the ones used for the datacenter simulator.
Since a digital twin is not a standalone simulator, a change to how we both predict and model failures is necessary.

\ipsum[1-2]

\input{sources/simulator_comparison.tex}
\section{Digital Twinning}\label{ss:digital-twinning}
% To fix: remove the \gls commands for ExaDigiT.
% This is getting silly.
\subsection{What is Digital Twinning?}\label{sss:what_is_digital_twinning}
% Here talk a bit about different types of data analytics that are performed in a digital twin.
``A \emph{digital twin} is a set of virtual information constructs that mimics the structure, context and behaviour of a natural, engineered or social system, is dynamically updated with data from its physical twin, has predictive capability, and informs decisions that realize value''~\cite{DBLP:usdoe/report/AP26894}.
A crucial characteristic that differentiates digital twinning from simulation and statistical modelling is the \emph{digital thread}: a bi-directional channel that enables continuous interaction between the virtual and physical entities.
The longer the \gls{dt} is working, the more accurate its predictions, because a holistic twin aggregates historical patterns together with up-to-date monitoring data.
A generic \gls{dt} architecture is depicted in Figure \ref{fig:five_dimensional_dt} Section \ref{s:intro} from Tao \etal~\cite{DBLP:conf/cirp/TAO2018169}.

% Why has not anyone done this before?
Digital twinning has only recently become feasible because of the developments in \gls{hpc}.
Between 2003 and 2011 the compute needed to run a Digital Twin was simply not present.
As such, while the concept existed, the hardware did not catch up yet.
However, in the last decade, multicore computing paradigms and the advent of GPU computing has finally enabled computation needed to run digital twins.
As a result, digital twins have become more relevant today than 10 years ago~\cite{DBLP:conf/cirp/TAO2018169}.

\begin{figure}
	\centering
	\includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf}
	\caption{A basic framework for the \gls{dt}. Four core elements of a \gls{dt} are defined: The physical entity \one and the simulated virtual twin \two. A service for out-of-band data analytics \three and a persistent storage of historical data \four are crucial to the \gls{dt} because they are necessary to gain meaningful monitoring insights. Adapted from Tao \etal ~\cite{DBLP:conf/cirp/TAO2018169}.}
	%Fei Tao is a renowned figure with over 62k citations. He is a figure of authority on digital twins.%
	\label{fig:five_dimensional_dt}
\end{figure}
% (3) in the original paper by Fei Tao is referenced to just `Services`.
% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics.
\subsection{Digital Twins across Domains}\label{sss:digital_twins_across_domains}

\subsection{Digital Twins for Datacenters}\label{sss:digital_twins_for_datacenters}

In this section, we survey the work related to datacenter digital twinning.
We summarize our results in Table \ref{tab:dt_features_comparison} to compare and contrast the features of existing datacenter digital twins.
We select only the digital twins that adhere closest to the \gls{nasem} definition~\cite{DBLP:usdoe/report/AP26894}.

\input{sources/dt_features_comparison.tex}

ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is an open-source framework for developing digital twins of supercomputers.
It consists of 3 modules:
\begin{enumerate*}[label=(\arabic*)]
	\item resource allocator and power simulator
	\item thermal cooling model
	\item augmented reality 3D model
\end{enumerate*}
of the supercomputer.
ExaDigiT has been used at the Frontier supercomputer at the Oak Ridge National Laboratory in the USA, successfully predicting potential energy losses at the supercomputer.
Brewer \etal include alongside the framework architecture an open-source artifact and a set of extensive verification and validation experiments.
The authors differentiate between different digital twins within ExaDigiT, such as \begin{enumerate*}[label=(\arabic*)]
	\item descriptive twin
	\item informative twin
	\item predictive twin
	\item comprehensive twin
	\item autonomous twin
\end{enumerate*}
that together form the system.
The \emph{predictive twin} leverages data driven operational analytics to create \gls{ml} models. Authors argue that alongside simulation, \gls{ml} models should also have a significant role for modeling system workloads in \eg application fingerprinting.
Within the \emph{autonomous twin} the authors use \gls{rl} to train agents that can be used to make control decisions in order to optimize different processes.
In order to model the cooling system the authors use the Modelica software, and to predict energy power draw they coded a Python script.
The authors provide a intuitive way to interact with the system using a visual dashboard, and an advanced augmented reality model.
The authors posit that the best way to address the 3V's of data (velocity, volume and variety) is to use augmented reality coupled with dashboards.

SmartDC~\cite{DBLP:conf/noms/ZhangZLZWC22} is a digital twin solution for optimization of power consumption in datacenters.
Specifically, Zhang \etal propose that using \gls{ai} enhanced modeling paired with digital twinning can help make dynamic adjustments to the datacenter cooling subsystem.
SmartDC has been proven to ensure efficient energy-saving rate of a China Telecom datacenter at 41\%.
However, the main purpose of SmartDC is not to continuously interact with the facility, but to provide additional training data for a more accurate, \gls{ml} solution.
The digital twin is designed to provide extra datasets for training \gls{ai} models.
% This digital twin together with ExaDigiT use computational fluid dynamics (CFD).
% ExaDigiT uses open-source Modelica software and SmartDC uses proprietary 6SigmaDC.
% At this point it would make sense to create the distinction between _structural_ digital twinning and _behavioural_ digital twinning.
% Link to 6SigmaDC: https://www5.cadence.com/trial_datacenter_insights_lp.html

DyTwin~\cite{DBLP:conf/sc/TaheriBPRHDEWPM24} is an adaptive digital twin with visualization and anomaly detection features.

% What is more, Microsoft already offers digital twinning as a service https://azure.microsoft.com/en-us/products/digital-twins/
% Documentation: https://learn.microsoft.com/en-us/azure/digital-twins/
% Moreover, NVIDIA is doing too as well https://www.nvidia.com/en-sg/omniverse/

Many \gls{dcdt}'s model the cooling systems inside the warehouse, because in a typical datacenter cooling accounts for more than 40\% of total electricity usage~\cite{DBLP:conf/AppliedEnergy/Zhao20}.
Since the cooling subsystem is mainly airflow-based, \gls{dt} designers often opt for a \gls{cfd} approach to model the facility.
The reason why a digital twin might be needed for a cooling subsystem is primarily because of inefficient operational strategy.
The cooling system parameters are often set constant, regardless of outdoor temperature \etc~\cite{DBLP:conf/AppliedEnergy/Zhao20}.
%Zhang \etal argues that their system is akin to an IoT sensor, essentially.
% This is an important consideration -- DT is not simply a sensor, it must have predictive capabilities and be able to simulate the future.
% Zhang argues that ``digital twin services'' are enabled by simulation monitoring \etc.
% Nonetheless, I dub that they are primarily data analysis services.

\gls{oda} can be performed in-band (real-time) and out-of-band (from historical data).
Likewise, Zhao \etal shows that crucial to the digital twin system are ``always-on'' analytics (akin to in-band \gls{oda}) and ``on-demand`` analytics (akin to out-of-band \gls{oda}).

%Include something about data-preprocessing in the pipeline.
%See the article by Fei Tao

\begin{figure}
	\centering
	\includegraphics[width=0.7\linewidth]{images/system_model.pdf}
	\caption{A generic system model for data center digital twin deployments.}
	\label{fig:system_model}
\end{figure}

The design of DyTwin~\cite{DBLP:conf/sc/TaheriBPRHDEWPM24}  incorporates in its architecture a ``virtual-to-virtual`` digital thread between different digital twins.
Zhao \etal include this element in their architecture too~\cite{DBLP:conf/AppliedEnergy/Zhao20}.
Moreover, a crucial parallel between the work of Zhao \etal and ExaDigiT is the concept of multiple models within a single digital twin.
Brewer \etal argue ExaDigiT is compromised of 5 ``smaller'' twins too.

%In Zhang \etal the digital twin can communicate with different other digital twins, as in the work of Taheri \etal.
%To do this, the working program has an API, with a specific API endpoint to communicate with other Digital Twins.
%In your work, consider adding such an endpoint, albeit explain in future work that you envision \emph{implementing} this endpoint in the future.