summaryrefslogtreecommitdiff
path: root/content/intro.tex
diff options
context:
space:
mode:
Diffstat (limited to 'content/intro.tex')
-rw-r--r--content/intro.tex84
1 files changed, 45 insertions, 39 deletions
diff --git a/content/intro.tex b/content/intro.tex
index 246d8bf..97935d8 100644
--- a/content/intro.tex
+++ b/content/intro.tex
@@ -1,29 +1,32 @@
\chapter{Introduction}\label{s:intro}
-Modern society is a technological society.
-Presently, computer and network ecosystems play a crucial part not only in the digital industry, but also in everyone's daily lives.
-Today, the transport, education and government sectors largely depend on server-side services, which are hosted in datacentres~\cite{DBLP:journals/corr/IosupKLVG22}.
-To address the recent rise in demand due to the \gls{ai} revolution managers expand datacenters with new components and more heterogenous architectures (e.g., GPUs and NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
+Presently, computer and network ecosystems play a crucial part in the digital industry.
+The transport, education and government sectors largely depend on server-side services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}.
+To address the recent rise in demand for computation, due to the Artificial Intelligence revolution, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
However, in return datacenter complexity increases significantly.
-To make better operational decisions despite the massive scale, promising technologies arise such as \gls{dcdt}.
+To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
-\section{Context}\label{s:context}
% Why is it important?
Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity.
-76\% of large companies worldwide spend more than 5 million USD\$ on hosted services each month, making datacenters one of the most important components of the digital society~\cite{DBLP:report/Flexera2026}.
-Additionally, in Netherlands alone over 25\% of professionals depend on cloud services in their everyday work.
-Faced with growing demand, this fraction will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.
-% Why is this a problem now?
-
-The increasing popularity of \gls{genai} and monthly releases of powerful \gls{llm} have driven the demand for datacenter services for the past 4 years.
-In the \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well.
-As such, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
+Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters.
+Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.
+
+% What is changing?
+In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well.
+Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers.
-Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task.
-Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}.
-What is more, the rapid expansion of datacenters promotes increased presence of failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
-Currently, preventing service outages in advance could help datacenter operators reduce substantial operational costs, as over 20\% of all reported failure-caused outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
-However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that still remains insufficiently unaddressed~\cite{DBLP:conf/wosp/SumanCNTMI24}.
+The \gls{ai} computational requirements are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
+Datacenter complexity will continue to grow, and it will become more difficult to manage.
+Future servers will include more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict.
+Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
+Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
+%Moreover, datacenter outages can have catastrophic consequences, cite Fabian.
+
+In short, the high computational demand and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
+Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services.
+Specific goals that can help satisfy these needs involve maximising the monitoring insights to help make better informed decisions and minimizing the downtime caused by maintenance and hardware failures.
+To address these problems a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
+
\begin{figure}
\centering
\includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf}
@@ -34,36 +37,39 @@ However, predicting datacenter behaviour quickly and reliably is a non-trivial p
% (3) in the original paper by Fei Tao is referenced to just `Services`.
% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics.
-The expanding \gls{ai} economy and the end of Moore's law have resulted in the rise of more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
-This means that in modern datacenters there are more server racks and each rack may contain multiple different hardware architectures.
-These events have created a need for:
-\begin{enumerate}
- \item More careful datacenter management to tackle the unprecedented complexity
- \item Greater availability of cloud services
- \item Lesser downtime and lower electricity cost
-\end{enumerate}
-Specific goals that can help satisfy these needs are:
-\begin{enumerate}
- \item Reducing the downtime of failured-caused outages
- \item Maximising the monitoring insights that can help make better informed operational decisions
- \item Minimizing the downtime caused by server maintenance and hardware inspections
-\end{enumerate}
+\section{Context}\label{s:context}
+
+% A digital twin is often called a virtual twin.
+% The communication between a physical entity and the digital twin is referred to as a digital thread.
+A \gls{dt} is a digital model of an intended or actual real-world system that serves as a digital counterpart of it for purposes such as simulation, integration, testing, monitoring and maintenance.
+The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the profile of the system~\cite{WIKI:page/DigitalTwin}.
+
+Most of modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}.
+For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail.
+The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}.
+This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
+A forecast of future maintenance and virtual health management are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}.
+
+The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}).
+Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged.
+Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}.
A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
-Crucial to \gls{dt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
+Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
There already exist digital twin deployments.
For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers.
-It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization, however it lacks core \gls{dt} functions, such as reliable predictive analytics.
+It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization.
+Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task.
+Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}.
+However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24}.
\section{Problem statement}\label{s:problem-statement}
-In this work we argue that the current state-of-the-art ICT Digital Twins lack predictive capabilities that are essential to real-time facility management.
-We propose that digital twinning can be enhanced by integrating \gls{oda} through predictive analytics.
+In this work we argue that the current state-of-the-art Datacenter Digital Twins lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter.
+We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}.
\section{Research Questions}\label{s:research-questions}
-We divide the problem of enabling predictive analytics using digital twinning into three research questions:
-
\begin{enumerate}[label=\textbf{RQ\arabic*.}, align=left]
\item \textbf{How to define 5 \gls{dcdt} use-cases and their functional and non-functional requirements?}
\item \textbf{How to design a \gls{dcdt} system model using discrete-event simulation and operational data analysis?}