From a5a140c6286e8b113ca8d371f88e3ed54e731cea Mon Sep 17 00:00:00 2001 From: mjkwiatkowski Date: Sun, 17 May 2026 14:21:09 +0200 Subject: feat: added lots of citations and slowly finishing the introduction --- content/intro.tex | 76 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 59 insertions(+), 17 deletions(-) (limited to 'content/intro.tex') diff --git a/content/intro.tex b/content/intro.tex index 58a759a..246d8bf 100644 --- a/content/intro.tex +++ b/content/intro.tex @@ -1,39 +1,81 @@ \chapter{Introduction}\label{s:intro} -Today's transportation systems, education and government largely depend on server-side services, which are hosted in datacentres~\cite{DBLP:journals/corr/IosupKLVG22}. -To facilitate the rising demand managers expand datacenters with new components and more heterogenous architectures (e.g., GPUs and NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}. +Modern society is a technological society. +Presently, computer and network ecosystems play a crucial part not only in the digital industry, but also in everyone's daily lives. +Today, the transport, education and government sectors largely depend on server-side services, which are hosted in datacentres~\cite{DBLP:journals/corr/IosupKLVG22}. +To address the recent rise in demand due to the \gls{ai} revolution managers expand datacenters with new components and more heterogenous architectures (e.g., GPUs and NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}. However, in return datacenter complexity increases significantly. -To make better operational decisions despite the massive scale, new, promising technologies arise, such as datacenter Digital Twins. +To make better operational decisions despite the massive scale, promising technologies arise such as \gls{dcdt}. \section{Context}\label{s:context} -Datacenters are one of the most important components of the digital society. -For example, over 25\% of professionals in the Netherlands depend on cloud services in their everyday work. -Faced with growing demand, this fraction will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}. -What is more, the surge of AI and Machine Learning workloads opens the need for versatile server architectures, pushing datacenter managers to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}. -In return, operating a modern datacenter with thousands of diversified servers presents a yet unsolved, non-trivial challenge that requires fast and well-informed decisions from on-site engineers. +% Why is it important? +Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity. +76\% of large companies worldwide spend more than 5 million USD\$ on hosted services each month, making datacenters one of the most important components of the digital society~\cite{DBLP:report/Flexera2026}. +Additionally, in Netherlands alone over 25\% of professionals depend on cloud services in their everyday work. +Faced with growing demand, this fraction will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}. +% Why is this a problem now? -To aid in datacenter management, operators turn to \gls{oda}, which is the process of analyzing monitoring data to gain insights into the system behavior. -For example, OMNI at \gls{nersc} and Wintermute at \gls{lrz} employ descriptive analytics to optimize power usage effectivenes~\cite{DBLP:conf/icppw/BourassaJBCJVS19} and prescriptive analysis for energy efficient scheduling~\cite{DBLP:conf/hpdc/NettiMGOTO020}. -Nonetheless, we observe a critical lack of predictive analysis capabilities~\cite{DBLP:conf/wosp/SumanCNTMI24} among the existing \gls{oda} frameworks. -In result, datacenter operators are often confronted with operational decisions with limited time to react, which can lead to missed \gls{sla}. +The increasing popularity of \gls{genai} and monthly releases of powerful \gls{llm} have driven the demand for datacenter services for the past 4 years. +In the \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well. +As such, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity. +In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers. -``Lab-built, preproduction, or early hardware does \textit{not} work as defined, does \textit{not} work reliably and does \textit{not} stay the same from day to day'', -according to Frederick P. Brooks. -A solution is a dependable simulator of the system~\cite{DBLP:books/daglib/Brooks0080747}. -A novel improvement on simulation is a datacenter \gls{dt}~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. +Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task. +Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}. +What is more, the rapid expansion of datacenters promotes increased presence of failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}. +Currently, preventing service outages in advance could help datacenter operators reduce substantial operational costs, as over 20\% of all reported failure-caused outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}. +However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that still remains insufficiently unaddressed~\cite{DBLP:conf/wosp/SumanCNTMI24}. +\begin{figure} + \centering + \includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf} + \caption{A basic framework for the \gls{dt}. Four core elements of a \gls{dt} are defined: The physical entity \one and the simulated virtual twin \two. A service for out-of-band data analytics \three and a persistent storage of historical data \four are crucial to the \gls{dt} because they are necessary to gain meaningful monitoring insights. Adapted from Tao \etal ~\cite{DBLP:conf/cirp/TAO2018169}.} + %Fei Tao is a renowned figure with over 62k citations. He is a figure of authority on digital twins.% + \label{fig:five_dimensional_dt} +\end{figure} +% (3) in the original paper by Fei Tao is referenced to just `Services`. +% Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics. +The expanding \gls{ai} economy and the end of Moore's law have resulted in the rise of more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}. +This means that in modern datacenters there are more server racks and each rack may contain multiple different hardware architectures. +These events have created a need for: +\begin{enumerate} + \item More careful datacenter management to tackle the unprecedented complexity + \item Greater availability of cloud services + \item Lesser downtime and lower electricity cost +\end{enumerate} +Specific goals that can help satisfy these needs are: +\begin{enumerate} + \item Reducing the downtime of failured-caused outages + \item Maximising the monitoring insights that can help make better informed operational decisions + \item Minimizing the downtime caused by server maintenance and hardware inspections +\end{enumerate} + +A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. +Crucial to \gls{dt} operation are predictive capabilities and the continuous interaction with the real-world datacenter. +There already exist digital twin deployments. +For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers. +It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization, however it lacks core \gls{dt} functions, such as reliable predictive analytics. \section{Problem statement}\label{s:problem-statement} +In this work we argue that the current state-of-the-art ICT Digital Twins lack predictive capabilities that are essential to real-time facility management. +We propose that digital twinning can be enhanced by integrating \gls{oda} through predictive analytics. + \section{Research Questions}\label{s:research-questions} +We divide the problem of enabling predictive analytics using digital twinning into three research questions: +\begin{enumerate}[label=\textbf{RQ\arabic*.}, align=left] + \item \textbf{How to define 5 \gls{dcdt} use-cases and their functional and non-functional requirements?} + \item \textbf{How to design a \gls{dcdt} system model using discrete-event simulation and operational data analysis?} + \item \textbf{How to validate if the \gls{dcdt} system meets the functional and non-functional requirements?} +\end{enumerate} \section{Research Methodology}\label{s:research-methodology} \section{Thesis Contributions}\label{s:thesis-contributions} \section{Plagiarism Declaration}\label{s:plagiarism-declaraion} -I hereby declare that this thesis is my own independent work and writing. +I hereby declare that this thesis is my own independent work and writing. The thesis does not contain any material copied from other sources (person, Internet, or AI), and has not been submitted for assessment elsewhere. \section{Societal Impact}\label{s:societal-impact} -- cgit v1.2.3