summaryrefslogtreecommitdiff
path: root/content/intro.tex
blob: 6af41d734844e004bcbef87d16e160bc999b3d59 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
\chapter{Introduction}\label{s:intro}
Presently, computer and network systems play a crucial part in the digital industry.
The transport, education and government sectors largely depend on digital services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}.
To address the recent rise in demand for computation, due to the advancements in Artificial Intelligence, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
However, in return datacenter complexity increases significantly.
To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

% Why is it important?
Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity.
Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters.
Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.

% What is changing?
In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogeneous server components (GPUs, TPUs, NPUs \etc) to perform well.
Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers.

The computational requirements of \gls{ai} are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Datacenter complexity will continue to grow, and it will become more difficult to manage.
Future servers will include even more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict.
Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
%Moreover, datacenter outages can have catastrophic consequences, cite Fabian.

In short, the high computational demand of \gls{ai} and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services.
To address this new problem a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

\begin{figure}
	\centering
	\includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf}
	\caption{A basic framework for the \gls{dt}. Four core elements of a \gls{dt} are defined: The physical entity \one and the simulated virtual twin \two. A service for out-of-band data analytics \three and a persistent storage of historical data \four are crucial to the \gls{dt} because they are necessary to gain meaningful monitoring insights. Adapted from Tao \etal ~\cite{DBLP:conf/cirp/TAO2018169}.}
	%Fei Tao is a renowned figure with over 62k citations. He is a figure of authority on digital twins.%
	\label{fig:five_dimensional_dt}
\end{figure}
% (3) in the original paper by Fei Tao is referenced to just `Services`.
% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics.

\section{Context}\label{s:context}

% A digital twin is often called a virtual twin.
% The communication between a physical entity and the digital twin is referred to as a digital thread.
A \gls{dt} is a virtual model of an intended or actual real-world system that serves as its counterpart for purposes such as simulation, integration, testing, monitoring and maintenance.
The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the behaviour of the system~\cite{WIKI:page/DigitalTwin}.

Most modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}.
For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail.
The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}.
This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
Forecasting future maintenance and managing the physical health of an object or facility are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}.

The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}).
Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged.
Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}.

A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
There already exist \gls{dcdt} deployments.
For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers.
It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization.

Nonetheless, existing \gls{dcdt}'s are still very limited in their capabilities as the definition and scope of a \gls{dcdt} concept is shallow and unclear.
After all, only recently did the hardware capabilities needed to continuously simulate a datacenter become available~\cite{DBLP:conf/cirp/TAO2018169}.
Many \gls{dcdt} frameworks still lack critical data analysis components, fault detection mechanisms, profiling techniques \etc~\cite{DBLP:conf/wosp/SumanCNTMI24}, rendering them unusable in large-scale systems.
Such limitations gravely reduce the applicability of \gls{dcdt}'s in real world scenarios~\cite{DBLP:journals/corr/IosupKLVG22}.
\gls{dcdt}'s are urgently needed, because datacenters exhibit hundreds unexpected events every day,such as \eg service failures or hardware faults.
Downtime, which is the result of failures, disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}.
% On the operational side, two main areas have been instrumental for improving datacenter efficiency: simulations and analysis of system telemetry. Additional improvements necessitate innovative tools that focus on end-to-end improvement, such as digital twins~\cite{DBLP:ExaDigiT}.
% DT's merge both simulation and telemetry to develop a holistic virtual representation of the system, bridging both the physical and virtual worlds.
However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24} and deployments~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24}.


\section{Problem statement}\label{s:problem-statement}



We envision \gls{dcdt}'s as systems indispensable in future datacenters, actively interacting with the real-world facility, lowering operational costs and predicting hardware failure and software faults.
In this work, we address the lack of a unified \gls{dcdt} definition and the absence of predictive capabilities in existing \gls{dcdt} system designs.
We argue that the current state-of-the-art \gls{dcdt}'s lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter.
A \gls{dt} without predictive capabilities cannot maintain the health of the datacenter effectively.
We posit that including holistic predictive analysis in \gls{dcdt} design can aid in efficient datacenter management and prevent missing \gls{sla}'s.
We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}.

\section{Research Questions}\label{s:research-questions}

\emph{Main Research Question:} How to define, design and evaluate a predictive \gls{dcdt}?\\

\noindent We divide the problem of designing a predictive \gls{dcdt} into three research questions:
\begin{enumerate}[label=\emph{RQ\textsubscript{\arabic*}}, align=left, itemsep=0pt]
	% First research question stolen from Capelin by Georgios Andreadis and adapted to my work.
	\item \emph{How capture and assess the current state-of-the-art of digital twinning for datacenters?}\\
	      There is currently a lack of a unified definition of what constitutes a \gls{dcdt}, and the differences between a \gls{dcdt} and a generic \gls{dt}.
	      It is necessary that we establish a common definition of a \gls{dcdt} in the research community.
	      We must develop a holistic \gls{dcdt} model that factors in the necessary components of a \gls{dt}.
	      This is very challenging, because the \gls{dcdt} definition must address many kinds of operational and technical requirements, compatible with the existing background on \gls{dt}'s.
	\item \emph{How to design a \gls{dcdt} system model using discrete-event simulation and predictive data analysis?}\\
	      Existing \gls{dcdt} frameworks lack the necessary predictive capabilities to prevent unplanned behaviour in datacenters.
	      In this work, we aim to explore the design space of a predictive \gls{dcdt} and the different design trade-offs.
	      Through discrete-event simulation, we aim provide the foundation for the system model to interact with a physical datacenter.
	      This is a very challenging task, because there are many functional and non-functional requirements of a \gls{dcdt} that need careful consideration.
	      The architecture must comply with the generic \gls{dt} model and address the non-trivial challenges in operating a modern datacenter.
	\item \emph{How to evaluate and validate a \gls{dcdt} model in relation to system requirements}?\\
	      To understand the operation of the proposed system and whether it meets its design goals we need to measure it's performance.
	      This is a challenging and non-trivial task that requires a careful design of a set of experiments that realistically show datacenter digital twin workings.

\end{enumerate}

\section{Research Methodology}\label{s:research-methodology}
% Alternative formulation in case there is no time to format the results as the literature survey, taken from Mastenbroek et al.
% Toward addressing RQ1 and RQ2 we survey in Chapter 2 the existing state of the art in risk analysis.
%We conduct a review of literature of closely-related fields as well as separate engineering science such as aerospace engineering.
% This will aid in identifying the most important use-cases for digital twins and in return, the crucial functional and non-functional requirements.
% We analyze the found use-cases in the context of datacenters or brainstorm how we can adapt them to datacenters.
To answer \emph{RQ\textsubscript{1}} we conduct a standard literature review as proposed by \textit{Kitchenham et al.} \cite{DBLP:journals/infsof/KitchenhamPBBTNL10} along with the guidance of the supervisor.
Firstly, we specify the review research questions and determine the right review method.
Secondly, we conduct the review and find the potential datacenter digital twin use-cases.
This includes summaries of a selection of relevant articles and their findings.
Lastly, based on the found use-cases, we formulate the functional and non-functional requirements for the predictive \gls{dcdt} reference architecture.

To answer \emph{RQ\textsubscript{2}} we closely follow the \textit{AtLarge Design Process} \cite{DBLP:conf/icdcs/IosupVTETBFMT19} under the guidance of the supervisor, and propose a simulation-based \gls{dcdt} system that meets the requirements listed as a part of  \emph{RQ\textsubscript{1}}.
Firstly, following the literature review, we list the functional and non-functional requirements of a predictive \gls{dcdt}.
We specify the pragmatic and innovative design possibilities to include in the reference architecture.
The designed system builds upon the OpenDC platform for datacenter simulation~\cite{DBLP:conf/ccgrid/MastenbroekAJLB21}, extending it with predictive analysis capabilities.
Lastly, we ensure that the design is scientific and testable and can be evaluated with comprehensive experiments.

To answer \emph{RQ\textsubscript{3}} we implement a prototype of the designed reference architecture.
We design comprehensive experiments that evaluate and validate the prototype based on the reference architecture.
We first gather a set of questions worth asking about the performance and impact of the predictive \gls{dcdt} and then set out to answer them with the prototype.
We define the correct experiment setup(s) and perform the experiments on a specified hardware, considering different usage scenarios.

\section{Thesis Contributions}\label{s:thesis-contributions}


\begin{enumerate}[label=\emph{C\textsubscript{\arabic*}}, align=left, labelsep=0pt]
	\item An open-source \gls{dcdt} prototype for predictive facility maintenance, with data analysis supported by in-band and out-of-band telemetry and discrete-event simulation.
	\item Extensive evaluation and validation experiments of the system.
	\item Demonstration of the \gls{dcdt} in pair with a simulated datacenter.
\end{enumerate}

\section{Plagiarism Declaration}\label{s:plagiarism-declaraion}
I hereby declare that this thesis is my own independent work and writing.
The thesis does not contain any material copied from other sources (person, Internet, or \gls{ai}), and has not been submitted for assessment elsewhere.
I acknowledge that the usage of material from other works or paraphrase of such material without proper citations or credit will be treated as plagiarism.
I declare that this thesis is free from \gls{ai} generated content and has been written without the help of any \gls{ai} tools.

\section{Societal Impact}\label{s:societal-impact}

\section{Open Science}\label{s:open-science}
Abiding the FAIR data principles, the entire source code of the prototype and related work has been made available at the \url{https://git.denounce.ai/opendc.git} repository.
The reuse and reproduction of experiments is explained in a detailed guide at the root of the repository, along with the necessary dependencies and experimental setup.

\section{Thesis Structure}\label{s:thesis-structure}
The remainder of the thesis is structured as depicted in Figure \ref{fig:thesis_structure}.
In Chapter \ref{s:background}, we describe the relevant background information.
In Chapter \ref{s:design}, we present the design of \gls{dcdt}.
In Chapter \ref{s:evaluation} we evaluate a prototype of the system and validate it against the set of functional and non-functional requirements.
In Chapter \ref{s:conclusion} we conclude the thesis with a summary of contributions and potential future work.
\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{images/thesis_structure.pdf}
	\caption{Structure of this thesis, with suggested reading flows.}
	\label{fig:thesis_structure}
\end{figure}