diff options
Diffstat (limited to 'content/intro.tex')
| -rw-r--r-- | content/intro.tex | 84 |
1 files changed, 45 insertions, 39 deletions
diff --git a/content/intro.tex b/content/intro.tex index 246d8bf..97935d8 100644 --- a/content/intro.tex +++ b/content/intro.tex @@ -1,29 +1,32 @@ \chapter{Introduction}\label{s:intro} -Modern society is a technological society. -Presently, computer and network ecosystems play a crucial part not only in the digital industry, but also in everyone's daily lives. -Today, the transport, education and government sectors largely depend on server-side services, which are hosted in datacentres~\cite{DBLP:journals/corr/IosupKLVG22}. -To address the recent rise in demand due to the \gls{ai} revolution managers expand datacenters with new components and more heterogenous architectures (e.g., GPUs and NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}. +Presently, computer and network ecosystems play a crucial part in the digital industry. +The transport, education and government sectors largely depend on server-side services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}. +To address the recent rise in demand for computation, due to the Artificial Intelligence revolution, managers expand datacenters with new components and more heterogeneous architectures (\eg GPUs, NPUs)~\cite{DBLP:conf/date/MilojicicFDR21}. However, in return datacenter complexity increases significantly. -To make better operational decisions despite the massive scale, promising technologies arise such as \gls{dcdt}. +To make better operational decisions despite the massive scale, promising technologies arise such as Datacenter Digital Twins~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. -\section{Context}\label{s:context} % Why is it important? Datacenters house large volume of computers for processing and storage of data from various organizations and fields of activity. -76\% of large companies worldwide spend more than 5 million USD\$ on hosted services each month, making datacenters one of the most important components of the digital society~\cite{DBLP:report/Flexera2026}. -Additionally, in Netherlands alone over 25\% of professionals depend on cloud services in their everyday work. -Faced with growing demand, this fraction will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}. -% Why is this a problem now? - -The increasing popularity of \gls{genai} and monthly releases of powerful \gls{llm} have driven the demand for datacenter services for the past 4 years. -In the \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well. -As such, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity. +Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters. +Since many public services continue to move online (\eg online administration and taxation, education), the fraction of Dutch professionals who depend on the cloud for work will exceed 35\% by 2025~\cite{DBLP:journals/corr/IosupKLVG22}. + +% What is changing? +In the modern \gls{ai} economy datacenters need diverse and scalable server architectures, because inference-based workloads require more heterogenous server components (GPUs, TPUs, NPUs \etc) to perform well. +Nowadays, datacenter operators try to meet customer expectations by adding more specialized hardware~\cite{DBLP:conf/date/MilojicicFDR21}, at a cost of increased system complexity. In return, operating a modern datacenter warehouse with thousands of diversified servers presents a difficult challenge that requires fast and well-informed decisions from on-site engineers. -Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task. -Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}. -What is more, the rapid expansion of datacenters promotes increased presence of failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}. -Currently, preventing service outages in advance could help datacenter operators reduce substantial operational costs, as over 20\% of all reported failure-caused outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}. -However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that still remains insufficiently unaddressed~\cite{DBLP:conf/wosp/SumanCNTMI24}. +The \gls{ai} computational requirements are expected to increase in the future~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. +Datacenter complexity will continue to grow, and it will become more difficult to manage. +Future servers will include more specialized hardware, which, while improving datacenter performance, will exhibit behaviour that is harder to predict. +Already the rapid expansion of datacenters has increased the presence of service failures across all cloud services~\cite{DBLP:conf/acsos/TalluriOVTI21}. +Preventing failure-caused outages in advance could help datacenter operators reduce operational costs, as over 20\% of all reported outages amount to more than 1 million US\$~\cite{DBLP:report/AnnualOutageAnalysis2025}. +%Moreover, datacenter outages can have catastrophic consequences, cite Fabian. + +In short, the high computational demand and the end of Dennard's scaling have resulted in the rise of larger and more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}. +Both events create a need for more careful datacenter management to tackle the unprecedented complexity and ensure availability of all cloud services. +Specific goals that can help satisfy these needs involve maximising the monitoring insights to help make better informed decisions and minimizing the downtime caused by maintenance and hardware failures. +To address these problems a concept of a datacenter \gls{dt} was proposed~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. + \begin{figure} \centering \includegraphics[width=0.95\linewidth]{images/five_dimensional_dt.pdf} @@ -34,36 +37,39 @@ However, predicting datacenter behaviour quickly and reliably is a non-trivial p % (3) in the original paper by Fei Tao is referenced to just `Services`. % Nonetheless I name them here as Data Analysis Services, because what Fei Tao lists (e.g., fault detection, fault determination, fault-tolerant management, maintenance) is inherently reliant on good data analytics. -The expanding \gls{ai} economy and the end of Moore's law have resulted in the rise of more heterogeneous datacenter architectures~\cite{DBLP:conf/date/MilojicicFDR21}. -This means that in modern datacenters there are more server racks and each rack may contain multiple different hardware architectures. -These events have created a need for: -\begin{enumerate} - \item More careful datacenter management to tackle the unprecedented complexity - \item Greater availability of cloud services - \item Lesser downtime and lower electricity cost -\end{enumerate} -Specific goals that can help satisfy these needs are: -\begin{enumerate} - \item Reducing the downtime of failured-caused outages - \item Maximising the monitoring insights that can help make better informed operational decisions - \item Minimizing the downtime caused by server maintenance and hardware inspections -\end{enumerate} +\section{Context}\label{s:context} + +% A digital twin is often called a virtual twin. +% The communication between a physical entity and the digital twin is referred to as a digital thread. +A \gls{dt} is a digital model of an intended or actual real-world system that serves as a digital counterpart of it for purposes such as simulation, integration, testing, monitoring and maintenance. +The digital twin replicates the physical system to predict failures, prescribe real-time actions for mitigating unexpected events, observing and evaluating the profile of the system~\cite{WIKI:page/DigitalTwin}. + +Most of modern \gls{dt} usages are related to prognostics and system health management~\cite{DBLP:conf/cirp/TAO2018169}. +For example, in aerospace engineering, the \gls{dt} analyzes operational data (\eg temperature, vibration) to predict when a airplane component is likely to fail. +The \gls{dt} can reliably manage the health of the physical entity by detecting fatigue cracks on aircraft wings or damage to the wind turbine blades~\cite{DBLP:journal/IJAE/Teugel2011}. +This allows maintenance to be scheduled proactively, reducing unplanned downtime and preventing catastrophic failures. +A forecast of future maintenance and virtual health management are the prime purpose of many \gls{dt}s used in practice~\cite{DBLP:conf/AIAA/Teugel2012}. + +The first mention of a \gls{dt} dates back to 2003, when Dr. Michael Grieves of Dassault Syst\'emes introduced the 3 core components of a \gls{dt}: the virtual entity, physical entity and the two-way connection (see Figure \ref{fig:five_dimensional_dt}). +Due to insufficient technological foundations, little work is available on \gls{dt}s between 2003 and 2018, and it is only with the rapid growth of cloud computing, \gls{iot} and Big Data analytics that \gls{dt}s have re-emerged. +Today, research is focused on bridging the gap between the long-established foundations of \gls{dt}s and new, novel applications in academia and industry, such as the \gls{dcdt}~\cite{DBLP:conf/cirp/TAO2018169, DBLP:journals/computer/AthavaleBBMMPS24}. A \gls{dcdt} mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. -Crucial to \gls{dt} operation are predictive capabilities and the continuous interaction with the real-world datacenter. +Crucial to \gls{dcdt} operation are predictive capabilities and the continuous interaction with the real-world datacenter. There already exist digital twin deployments. For example, ExaDigiT~\cite{DBLP:conf/sc/BrewerMKWBHSGGW24} is a framework for digital twin development of supercomputers. -It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization, however it lacks core \gls{dt} functions, such as reliable predictive analytics. +It has been demonstrated at the Frontier supercomputer and it facilitates virtual prototyping and system optimization. +Quick and correct decision-making in a 21\textsuperscript{st} century datacenter is a hard task. +Oftentimes unexpected events such as \eg service failures or hardware faults result in a downtime that disturbs the users and produces unfulfilled \gls{sla}~\cite{DBLP:conf/acsos/TalluriOVTI21}. +However, predicting datacenter behaviour quickly and reliably is a non-trivial problem that remains insufficiently unaddressed in the existing \gls{dcdt} architectures ~\cite{DBLP:conf/wosp/SumanCNTMI24, DBLP:journals/computer/AthavaleBBMMPS24}. \section{Problem statement}\label{s:problem-statement} -In this work we argue that the current state-of-the-art ICT Digital Twins lack predictive capabilities that are essential to real-time facility management. -We propose that digital twinning can be enhanced by integrating \gls{oda} through predictive analytics. +In this work we argue that the current state-of-the-art Datacenter Digital Twins lack sufficient predictive capabilities that are essential to real-time facility management of a modern datacenter. +We propose that digital twinning can be enhanced by integrating predictive analytics through \gls{oda}. \section{Research Questions}\label{s:research-questions} -We divide the problem of enabling predictive analytics using digital twinning into three research questions: - \begin{enumerate}[label=\textbf{RQ\arabic*.}, align=left] \item \textbf{How to define 5 \gls{dcdt} use-cases and their functional and non-functional requirements?} \item \textbf{How to design a \gls{dcdt} system model using discrete-event simulation and operational data analysis?} |
