summaryrefslogtreecommitdiff
path: root/content/intro.tex
blob: 57d30030ee881859e9afc1bfccfd10b467bfacd5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
\chapter{Introduction}\label{s:intro}
Presently, computer and network systems play a crucial part in the digital industry.
The transport, education and government sectors largely depend on digital services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}.
To address the recent rise in demand for computation, due to the advancements in Artificial Intelligence, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
However, in return datacenter complexity increases significantly.
To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

% Why is it important?
Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity.
Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters.
Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.

% What is changing?
In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogeneous server components (GPUs, TPUs, NPUs \etc) to perform well.
Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers.

The computational requirements of \gls{ai} are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Datacenter complexity will continue to grow, and it will become more difficult to manage.
Future servers will include even more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict.
Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
%Moreover, datacenter outages can have catastrophic consequences, cite Fabian.

In short, the high computational demand of \gls{ai} and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services.
To address this new problem a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

\begin{figure}
	\centering
	\includegraphics[width=0.8\linewidth]{images/simple_dt.pdf}
	\caption{Elements of the digital twin ecosystem~\cite{DBLP:modsim24/presentation/Iosup2024}.}
	\label{fig:simple_dt}
\end{figure}

\section{Context}\label{s:context}

% A digital twin is often called a virtual twin.
% The communication between a physical entity and the digital twin is referred to as a digital thread.
A \gls{dt} is a virtual model of an intended or actual real-world system that serves as its counterpart for purposes such as simulation, integration, testing, monitoring and maintenance.
The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the behaviour of the system~\cite{WIKI:page/DigitalTwin}.

Most modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}.
For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail.
The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}.
This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
Forecasting future maintenance and managing the physical health of an object or facility are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}.

The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:simple_dt}).
Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged.
Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}.

A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
There already exist \gls{dcdt} deployments.
For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers.
It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization.

Nonetheless, existing \gls{dcdt}'s are still very limited in their capabilities as the definition and scope of a \gls{dcdt} concept is shallow and unclear.
After all, only recently did the hardware capabilities needed to continuously simulate a datacenter become available~\cite{DBLP:conf/cirp/TAO2018169}.
Many \gls{dcdt} frameworks still lack critical data analysis components, fault detection mechanisms, profiling techniques \etc~\cite{DBLP:conf/wosp/SumanCNTMI24}, rendering them unusable in large-scale systems.
Such limitations gravely reduce the applicability of \gls{dcdt}'s in real world scenarios~\cite{DBLP:journals/corr/IosupKLVG22}.
\gls{dcdt}'s are urgently needed, because datacenters exhibit hundreds unexpected events every day,such as \eg service failures or hardware faults.
Downtime, which is the result of failures, disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}.
% On the operational side, two main areas have been instrumental for improving datacenter efficiency: simulations and analysis of system telemetry. Additional improvements necessitate innovative tools that focus on end-to-end improvement, such as digital twins~\cite{DBLP:ExaDigiT}.
% DT's merge both simulation and telemetry to develop a holistic virtual representation of the system, bridging both the physical and virtual worlds.
However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24} and deployments~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24}.


\section{Problem statement}\label{s:problem-statement}



We envision \gls{dcdt}'s as systems indispensable in future datacenters, actively interacting with the real-world facility, lowering operational costs and predicting hardware failure and software faults.
In this work, we address the lack of a unified \gls{dcdt} definition and the absence of predictive capabilities in existing \gls{dcdt} system designs.
We argue that the current state-of-the-art \gls{dcdt}'s lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter.
A \gls{dt} without predictive capabilities cannot maintain the health of the datacenter effectively.
We posit that including holistic predictive analysis in \gls{dcdt} design can aid in efficient datacenter management and prevent missing \gls{sla}'s.
We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}.

\section{Research Questions}\label{s:research-questions}

\emph{Main Research Question:} How to enable predictive analytics in datacenters through digital twinning?\\

\noindent We divide the problem of designing a predictive \gls{dcdt} into three research questions:
\begin{enumerate}[label=\emph{RQ\textsubscript{\arabic*}}, align=left, itemsep=0pt]
	% First research question stolen from Capelin by Georgios Andreadis and adapted to my work.
	\item \emph{How to assess the current state-of-the-art of digital twinning for datacenters?}\\
	      There is currently a lack of a unified system model of what constitutes a \gls{dcdt}, and the differences between existing \gls{dcdt} deployments.
	      It is necessary that we establish a common model of a \gls{dcdt} in the research community.
	      We must develop a holistic \gls{dcdt} model that factors in the necessary components of a \gls{dt}.
	      This is very challenging, because the \gls{dcdt} system model must address many kinds of operational and technical requirements, compatible with the existing background on \gls{dt}s.
	\item \emph{How to design a \gls{dcdt} system model using discrete-event simulation and predictive data analysis?}\\
	      % You should start referring to my_system as a framework, rather than a standalone system.
	      Existing \gls{dcdt} frameworks lack the necessary predictive capabilities to prevent unplanned behaviour in datacenters.
	      In this work, we aim to explore the design space of a predictive \gls{dcdt} and the different design trade-offs.
	      Through discrete-event simulation, we aim provide the foundation for the system model to interact with a physical datacenter.
	      This is a very challenging task, because there are many functional and non-functional requirements of a \gls{dcdt} that need careful consideration.
	      The architecture must comply with the generic \gls{dt} model and address the non-trivial challenges in operating a modern datacenter.
	\item \emph{How to evaluate and validate a \gls{dcdt} model in relation to system requirements}?\\
	      To understand the operation of the proposed system and whether it meets its design goals we need to measure it's performance.
	      This is a challenging and non-trivial task that requires a careful design of a set of experiments that realistically show datacenter digital twin workings.

\end{enumerate}

\section{Research Methodology}\label{s:research-methodology}
% Alternative formulation in case there is no time to format the results as the literature survey, taken from Mastenbroek et al.
% Toward addressing RQ1 and RQ2 we survey in Chapter 2 the existing state of the art in risk analysis.
%We conduct a review of literature of closely-related fields as well as separate engineering science such as aerospace engineering.
% This will aid in identifying the most important use-cases for digital twins and in return, the crucial functional and non-functional requirements.
% We analyze the found use-cases in the context of datacenters or brainstorm how we can adapt them to datacenters.
To answer \emph{RQ\textsubscript{1}} we conduct a literature review as proposed by \textit{Kitchenham et al.} \cite{DBLP:journals/infsof/KitchenhamPBBTNL10} along with the guidance of the supervisor.
Firstly, we determine the right review method.
Secondly, we identify the various works related to \gls{dcdt}'s using various search strings
(\eg ``Datacenter Digital Twinning'', ``ICT Virtual Twin'').
To search for the results we use the digital libraries of Google Scholar, DBLP, ACM Digital Library, IEEExplore, Springer \etc
Thirdly, we select work relevant to our research and organize the details of each article.
A potential outcome of this could be a system model for \gls{dcdt}'s.
We envision the literature review can supply us with potential use-cases for the predictive \gls{dcdt}.
Based on the found use-cases, we formulate the functional and non-functional requirements for the predictive \gls{dcdt} reference architecture.

To answer \emph{RQ\textsubscript{2}} we closely follow the \textit{AtLarge Design Process} \cite{DBLP:conf/icdcs/IosupVTETBFMT19} under the guidance of the supervisor, and propose a simulation-based \gls{dcdt} system that meets the requirements listed as a part of  \emph{RQ\textsubscript{1}}.
Firstly, following the literature review, we list the functional and non-functional requirements of a predictive \gls{dcdt}.
We specify the pragmatic and innovative design possibilities to include in the reference architecture.
The designed system builds upon the OpenDC platform for datacenter simulation~\cite{DBLP:conf/ccgrid/MastenbroekAJLB21}, extending it with predictive analysis capabilities.
Lastly, we ensure that the design is scientific and testable and can be evaluated with comprehensive experiments.

To answer \emph{RQ\textsubscript{3}} we implement a prototype of the designed reference architecture.
We design comprehensive experiments that evaluate and validate the prototype based on the reference architecture.
We first gather a set of questions worth asking about the performance and impact of the predictive \gls{dcdt} and then set out to answer them with the prototype.
We define the correct experiment setup(s) and perform the experiments on a specified hardware, considering different usage scenarios.

\section{Thesis Contributions}\label{s:thesis-contributions}

\begin{enumerate}
	\item \textbf{Conceptual}:

	      \begin{enumerate}[label=\emph{C\textsubscript{\arabic*}}, align=left, labelsep=0pt]
		      \item We conduct a systematic literature review and detailed analysis of existing works on digital twinning in the scientific research community.
		            We collect and organize the \gls{dcdt}'s characteristics and based on our findings we propose a unified system model of the  design space.
		      \item We propose the design of \gls{my_system}, a discrete-event \gls{dcdt} for reliable and timely failure prediction in datacenters.
		            \gls{my_system} includes a set of novel system components which leverage \gls{oda} and discrete-event simulation.

		      \item We evaluate \gls{my_system} using a novel experimentation technique and datacenter workload traces from the industry.
		            We design a method to evaluate \gls{dcdt}s without expensive and costly real-world experimentation.
		            We conduct a set of experiments and analyse the results.
	      \end{enumerate}
	\item \textbf{Technical:}

	      \begin{enumerate}[label=\emph{C\textsubscript{\arabic*}}, align=left, labelsep=0pt]
		      \item We prototype \gls{my_system} following the established \gls{dt} design principles using discrete-event simulation and \gls{oda}.
		            We include the code as an Open Science artifact and ensure the prototype remains accessible to the broader scientific community including exhaustive project documentation.
		      \item We provide the experiment setup, validation and evaluation of \gls{my_system} for predicting datacenter failures in real-time as an Open Science artifact.
	      \end{enumerate}

\end{enumerate}
\section{Plagiarism Declaration}\label{s:plagiarism-declaraion}
I hereby declare that this thesis is my own independent work and writing.
The thesis does not contain any material copied from other sources (person, Internet, or \gls{ai}), and has not been submitted for assessment elsewhere.
I acknowledge that the usage of material from other works or paraphrase of such material without proper citations or credit will be treated as plagiarism.
I declare that this thesis is free from \gls{ai} generated content and has been written without the help of any \gls{ai} tools.

\section{Societal Impact}\label{s:societal-impact}
Any program that is difficult to understand and reason about is sure to accumulate technical debt.
However, sometimes large-scale systems can be complex and hard to comprehend inherently.
% Cite Frederick P Brooks here.
In such scenario, software that can aid system management is necessary.
Computer Systems, more complex now than ever before, must remain accessible to be beneficial to the digital society.
This work addresses the four grand societal challenges related to this goal: \begin{enumerate*}[label=(\arabic*)]
	\item manageability
	\item responsibility
	\item sustainability
	\item usability
\end{enumerate*}~\cite{DBLP:journals/corr/IosupKLVG22}.
\gls{my_system} addresses (1) directly by making large-scale datacenter management easier.
We address (2) by ensuring our work adheres to the FAIR principles of Open Science.
Moreover, in this thesis we try to make \gls{dcdt} systems more understandable to the broader scientific community by providing a unified system model.
Additionally, we contribute to responsible software design by adhering to best software engineering practices in the design of the prototype.
(3) is addressed indirectly, as the consequences of the insights provided by a holistic, \gls{oda} powered \gls{dcdt} can help datacenter managers make decisions that are more sustainable in the future.
We contribute to (4) by helping predict unexpected failures and lowering operational costs, ensuring datacenters can continue to be usable in the future.
We believe this work has a strong societal impact due to addressing the four grand societal challenges described by Iosup \etal and we hope through this work we can advance the scientific research community towards a more sustainable future.

\section{Open Science}\label{s:open-science}
Abiding the FAIR data principles, the entire source code of the prototype and related work has been made available at the \url{https://git.denounce.ai/opendc.git} repository.
The reuse and reproduction of experiments is explained in a detailed guide at the root of the repository, along with the necessary dependencies and experimental setup.

\section{Thesis Structure}\label{s:thesis-structure}
The remainder of the thesis is structured as depicted in Figure \ref{fig:thesis_structure}.
In Chapter \ref{s:background}, we describe the relevant background information.
In Chapter \ref{s:design}, we present the design of \gls{dcdt}.
In Chapter \ref{s:evaluation} we evaluate a prototype of the system and validate it against the set of functional and non-functional requirements.
In Chapter \ref{s:conclusion} we conclude the thesis with a summary of contributions and potential future work.
\begin{figure}[b!]
	\centering
	\includegraphics[width=\linewidth]{images/thesis_structure.pdf}
	\caption{Structure of this thesis, with suggested reading flows.}
	\label{fig:thesis_structure}
\end{figure}