summaryrefslogtreecommitdiff
path: root/content
diff options
context:
space:
mode:
Diffstat (limited to 'content')
-rw-r--r--content/background.tex60
-rw-r--r--content/intro.tex84
2 files changed, 67 insertions, 77 deletions
diff --git a/content/background.tex b/content/background.tex
index e65dc17..55cfe16 100644
--- a/content/background.tex
+++ b/content/background.tex
@@ -13,22 +13,6 @@ A prime example of using probability to find a good machine learning model is Ba
% Stanford Encyclopedia of Philosophy, Douven 2017
The process of inference from data to provide the best explanation is called abduction.
-
-
-A \gls{dt} is a digital model of an intended or actual real-world system that serves as a digital counterpart of it for purposes such as simulation, integration, testing, monitoring and maintenance %cite the Wikipedia page here!.
-The system requires real-time synchronization with the actual system.
-A closed loop of continuous feedback exists between the digital twin and physical object.
-
-The digital twin replicates the physical system to predict failures and opportunities for changing, to prescribe real-time actions for optimizing and/or mitigating unexpected events, observing and evaluating the profile of the system.
-
-A digital twin is often called a virtual twin.
-
-The communication between a physical entity and the digital twin is referred to as a digital thread.
-
-One key application is predictive maintenance, where the digital twin analyzes operational data (e.g., temperature, vibration) to predict when a component is likely to fail.
-
-This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
-
%Include something about data-preprocessing in the pipeline.
%See the article by Fei Tao
@@ -41,14 +25,7 @@ ODA can predict failures, help maintain the equipment, save bills, cut costs.
But currently one of the key challenges is to somehow connect the physical and virtual spaces.
The answer to how to do this is a digital twin.
-Since DT's are relatively a new concept, I think they require a short introduction to their history.
-It's enough to mention that the first presentation was done by Grieves in 2003, from 2003 to 2018 we have seen a slow incline in numbers of papers (around 50) and now DT's are re-emerging.
-
-You must include the DT white paper from 2014.
-The concept of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}).
-Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018~\cite{DBLP:conf/cirp/TAO2018169}, and it is only with the rapid growth of cloud computing, \gls{iot} and big data analytics that \gls{dt}s have re-emerged.
-Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}.
%[citation needed]
As of 2026, there is a lack of consensus of what is a digital twin.
@@ -56,23 +33,30 @@ By proxy, there is neither consensus on what is the definition of a datacenter d
A generic definition is needed.
-Most of \gls{dt} usages are related to prognostics and health management.
-
-
-One of the many applications of \gls{dt} is timely system maintenance.
-In aerospace engineering, the \gls{dt} can reliably manage the health of the physical entity by detecting \eg fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:conf/cirp/TAO2018169}.
-A forecast of future maintenance and virtual health management are the prime purpose of many \gls{dt}s~\cite{DBLP:conf/AIAA/Teugel2012}.
-
-Optimal datacenter management is characterized by high service availability and low downtime.
-However, achieving this in a 21\textsuperscript{st} century datacenter requires revolutionary changes in the way datacenters are operated and maintained.
-A concept that creates just such a revolutionary change is the \gls{dcdt}.
-% This sentence is stolen from an article.
-% Make sure to paraphrase it.
-
-% This is stolen from the AIAA article.
-% Make sure to paraphrase this.
+%Why predictive analytics? Why predictive behaviour?
+%What is below here is true, but nonetheless the argumentation should be slightly changed. And a citation is needed.
+However, there has been little effor made to integrate analytics that enable consistent and relaible prediction of datacenter behaviour into a holistic digital twin of a datacenter.
+Nor has the fidelity of failure modeling inside a datacenter simulation increased.
+The failure model is still a linear model.
+% Since a datacenter simulator is quite different from a digital twin, we cannot use the same computation methods (not as they are right now, at least) -- we must adapt them.
+The prediciton models are the same ones for the digital twin as the ones used for the datacenter simulator.
+Since a digital twin is not a standalone simulator, a change to how we both predict and model failures is necessary.
+The longer the DT is working, the more accurate its predictions.
+All the results are aggregated.
+% Why has not anyone done this before?
+It is also the case that currently this is possible only and only because of the recent development in High Performance Computing.
+Between 2003 and 2011 the compute needed to run a Digital Twin was simply not there.
+As such, while the concept existed, the hardware did not catch up yet.
+However, in the last decade, multicore computing paradigms and the advent of GPU computing has finally enabled computation needed to run a Digital Twin.
+This is what has changed, so that today running a digital twin is relevant, much more relevant than it was 10 years ago.
+This is also why nobody has done a Digital Twin of a datacenter before.
+The current widespread availability of HPC makes this possible.
+Because of judgement born out of experience, evolution of existing datacenters is fairly successful; however the development of a new, modern datacenters is fraught with unexpected problems that results in weight growth, schedule delays and cost overruns.
+Optimal datacenter management is characterized by high service availability and low downtime.
+Achieving this in a 21\textsuperscript{st} century datacenter requires revolutionary changes in the way datacenters are operated and maintained.
+A concept that creates just such a revolutionary change is the \gls{dcdt}.
diff --git a/content/intro.tex b/content/intro.tex
index 246d8bf..97935d8 100644
--- a/content/intro.tex
+++ b/content/intro.tex
@@ -1,29 +1,32 @@
\chapter{Introduction}\label{s:intro}
-Modern society is a technological society.
-Presently, computer and network ecosystems play a crucial part not only in the digital industry, but also in everyone's daily lives.
-Today, the transport, education and government sectors largely depend on server-side services, which are hosted in datacentres~\cite{DBLP:journals/corr/IosupKLVG22}.
-To address the recent rise in demand due to the \gls{ai} revolution managers expand datacenters with new components and more heterogenous architectures (e.g., GPUs and NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
+Presently, computer and network ecosystems play a crucial part in the digital industry.
+The transport, education and government sectors largely depend on server-side services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}.
+To address the recent rise in demand for computation, due to the Artificial Intelligence revolution, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}.
However, in return datacenter complexity increases significantly.
-To make better operational decisions despite the massive scale, promising technologies arise such as \gls{dcdt}.
+To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
-\section{Context}\label{s:context}
% Why is it important?
Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity.
-76\% of large companies worldwide spend more than 5 million USD\$ on hosted services each month, making datacenters one of the most important components of the digital society~\cite{DBLP:report/Flexera2026}.
-Additionally, in Netherlands alone over 25\% of professionals depend on cloud services in their everyday work.
-Faced with growing demand, this fraction will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.
-% Why is this a problem now?
-
-The increasing popularity of \gls{genai} and monthly releases of powerful \gls{llm} have driven the demand for datacenter services for the past 4 years.
-In the \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well.
-As such, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
+Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters.
+Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}.
+
+% What is changing?
+In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well.
+Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity.
In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers.
-Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task.
-Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}.
-What is more, the rapid expansion of datacenters promotes increased presence of failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
-Currently, preventing service outages in advance could help datacenter operators reduce substantial operational costs, as over 20\% of all reported failure-caused outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
-However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that still remains insufficiently unaddressed~\cite{DBLP:conf/wosp/SumanCNTMI24}.
+The \gls{ai} computational requirements are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
+Datacenter complexity will continue to grow, and it will become more difficult to manage.
+Future servers will include more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict.
+Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}.
+Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}.
+%Moreover, datacenter outages can have catastrophic consequences, cite Fabian.
+
+In short, the high computational demand and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
+Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services.
+Specific goals that can help satisfy these needs involve maximising the monitoring insights to help make better informed decisions and minimizing the downtime caused by maintenance and hardware failures.
+To address these problems a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
+
\begin{figure}
\centering
\includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf}
@@ -34,36 +37,39 @@ However, predicting datacenter behaviour quickly and reliably is a non-trivial p
% (3) in the original paper by Fei Tao is referenced to just `Services`.
% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics.
-The expanding \gls{ai} economy and the end of Moore's law have resulted in the rise of more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}.
-This means that in modern datacenters there are more server racks and each rack may contain multiple different hardware architectures.
-These events have created a need for:
-\begin{enumerate}
- \item More careful datacenter management to tackle the unprecedented complexity
- \item Greater availability of cloud services
- \item Lesser downtime and lower electricity cost
-\end{enumerate}
-Specific goals that can help satisfy these needs are:
-\begin{enumerate}
- \item Reducing the downtime of failured-caused outages
- \item Maximising the monitoring insights that can help make better informed operational decisions
- \item Minimizing the downtime caused by server maintenance and hardware inspections
-\end{enumerate}
+\section{Context}\label{s:context}
+
+% A digital twin is often called a virtual twin.
+% The communication between a physical entity and the digital twin is referred to as a digital thread.
+A \gls{dt} is a digital model of an intended or actual real-world system that serves as a digital counterpart of it for purposes such as simulation, integration, testing, monitoring and maintenance.
+The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the profile of the system~\cite{WIKI:page/DigitalTwin}.
+
+Most of modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}.
+For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail.
+The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}.
+This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures.
+A forecast of future maintenance and virtual health management are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}.
+
+The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}).
+Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged.
+Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}.
A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}.
-Crucial to \gls{dt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
+Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter.
There already exist digital twin deployments.
For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers.
-It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization, however it lacks core \gls{dt} functions, such as reliable predictive analytics.
+It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization.
+Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task.
+Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}.
+However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24}.
\section{Problem statement}\label{s:problem-statement}
-In this work we argue that the current state-of-the-art ICT Digital Twins lack predictive capabilities that are essential to real-time facility management.
-We propose that digital twinning can be enhanced by integrating \gls{oda} through predictive analytics.
+In this work we argue that the current state-of-the-art Datacenter Digital Twins lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter.
+We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}.
\section{Research Questions}\label{s:research-questions}
-We divide the problem of enabling predictive analytics using digital twinning into three research questions:
-
\begin{enumerate}[label=\textbf{RQ\arabic*.}, align=left]
\item \textbf{How to define 5 \gls{dcdt} use-cases and their functional and non-functional requirements?}
\item \textbf{How to design a \gls{dcdt} system model using discrete-event simulation and operational data analysis?}