summaryrefslogtreecommitdiff
path: root/content/intro.tex
blob: 97935d8ebb14c27d55aca2d931c34dedbee64f38 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
\chapter{Introduction}\label{s:intro}
Presently, computer and network ecosystems play a crucial part in the digital industry.
The transport, education and government sectors largely depend on server-side services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}.
To address the recent rise in demand for computation, due to the Artificial Intelligence revolution, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
However, in return datacenter complexity increases significantly.
To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

% Why is it important?
Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity.
Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters.
Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.

% What is changing?
In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well.
Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers.

The \gls{ai} computational requirements are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Datacenter complexity will continue to grow, and it will become more difficult to manage.
Future servers will include more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict.
Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
%Moreover, datacenter outages can have catastrophic consequences, cite Fabian.

In short, the high computational demand and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services.
Specific goals that can help satisfy these needs involve maximising the monitoring insights to help make better informed decisions and minimizing the downtime caused by maintenance and hardware failures.
To address these problems a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.

\begin{figure}
	\centering
	\includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf}
	\caption{A basic framework for the \gls{dt}. Four core elements of a \gls{dt} are defined: The physical entity \one and the simulated virtual twin \two. A service for out-of-band data analytics \three and a persistent storage of historical data \four are crucial to the \gls{dt} because they are necessary to gain meaningful monitoring insights. Adapted from Tao \etal ~\cite{DBLP:conf/cirp/TAO2018169}.}
	%Fei Tao is a renowned figure with over 62k citations. He is a figure of authority on digital twins.%
	\label{fig:five_dimensional_dt}
\end{figure}
% (3) in the original paper by Fei Tao is referenced to just `Services`.
% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics.

\section{Context}\label{s:context}

% A digital twin is often called a virtual twin.
% The communication between a physical entity and the digital twin is referred to as a digital thread.
A \gls{dt} is a digital model of an intended or actual real-world system that serves as a digital counterpart of it for purposes such as simulation, integration, testing, monitoring and maintenance.
The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the profile of the system~\cite{WIKI:page/DigitalTwin}.

Most of modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}.
For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail.
The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}.
This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
A forecast of future maintenance and virtual health management are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}.

The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}).
Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged.
Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}.

A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
There already exist digital twin deployments.
For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers.
It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization.
Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task.
Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}.
However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24}.

\section{Problem statement}\label{s:problem-statement}

In this work we argue that the current state-of-the-art Datacenter Digital Twins lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter.
We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}.

\section{Research Questions}\label{s:research-questions}

\begin{enumerate}[label=\textbf{RQ\arabic*.}, align=left]
	\item \textbf{How to define 5 \gls{dcdt} use-cases and their functional and non-functional requirements?}
	\item \textbf{How to design a \gls{dcdt} system model using discrete-event simulation and operational data analysis?}
	\item \textbf{How to validate if the \gls{dcdt} system meets the functional and non-functional requirements?}
\end{enumerate}

\section{Research Methodology}\label{s:research-methodology}

\section{Thesis Contributions}\label{s:thesis-contributions}

\section{Plagiarism Declaration}\label{s:plagiarism-declaraion}
I hereby declare that this thesis is my own independent work and writing.
The thesis does not contain any material copied from other sources (person, Internet, or AI), and has not been submitted for assessment elsewhere.

\section{Societal Impact}\label{s:societal-impact}

\section{Thesis Structure}\label{s:thesis-structure}