diff options
Diffstat (limited to 'main.tex')
| -rw-r--r-- | main.tex | 128 |
1 files changed, 95 insertions, 33 deletions
@@ -23,16 +23,14 @@ \begin{tcolorbox}[title=DCDT's lack predictive analytics] We need Datacenter Digital Twins (DCDT) to be better able to detect and solve issues in critical ICT infrastructure~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. However, DCDT's are still actively developed and lack crucial features such as predictive analytics~\cite{DBLP:usdoe/report/AP26894} to \emph{e.g.,} prevent unexpected failures. - With predictive analysis (\emph{e.g.,} simulation) DCDT's could save millions of lost \$USD~\cite{DBLP:conf/acsos/TalluriOVTI21}. \end{tcolorbox} \begin{center} \includegraphics[width=0.9\linewidth]{images/predictive_analytics.pdf} \end{center} \tiny - \textbf{Figure 1.2:} Where does our work fit within the field of datacenter digital twinning? - There are 5 core elements to any Digital Twin: \myCircled{A} The Digital $\rightarrow$ Physical Twin link, \myCircled{B} the Physical Twin (\emph{e.g.,} the datacenter), \myCircled{C} the Physical $\rightarrow$ Digital Twin link, \myCircled{D} the Digital Twin, \myCircled{E} the features necessary to any Digital Twin. - \textcolor{Green}{\faHighlighter~Highlighted areas are the contributions from this thesis, which include the autonomous actions resulting from predictive insights \myCircledGreen{A} and the predictive analysis itself within \myCircledGreen{E}.} + \textbf{Figure 1.2:} Datacenter Digital Twin Diagram. There are 5 core elements to any Digital Twin: \myCircled{A} The Digital $\rightarrow$ Physical Twin link, \myCircled{B} the Physical Twin (\emph{e.g.,} the datacenter), \myCircled{C} the Physical $\rightarrow$ Digital Twin link, \myCircled{D} the Digital Twin, \myCircled{E} the features necessary to any Digital Twin. + \textcolor{Green}{\faHighlighter~Highlighted areas are the contributions from this thesis, which include the autonomous actions resulting from predictive insights \myCircledGreen{A} and the predictive analysis (including simple storage capabilities) within \myCircledGreen{E}.} \end{frame} \begin{frame}\frametitle{Research Questions} @@ -50,13 +48,13 @@ \begin{tcolorbox}[title=Research Question 3] % no "and validate?" - How to evaluate a datacenter digital twin architecture in relation to system requirements? + How to validate and evaluate a datacenter digital twin architecture in relation to system requirements? \end{tcolorbox} \end{frame} \begin{frame}\frametitle{\textbf{RQ1}: Literature Review I} - \begin{tcolorbox}[title=Main Finding] - The literature on DCDTs is scarce. + \begin{tcolorbox}[title=Main Finding I] + The literature on DCDTs is sparse. Some systems barely classify as DTs (\emph{e.g.,} Kalibre~\cite{DBLP:conf/sensys/WangZD0TCWZ20}, ChatTwin~\cite{DBLP:conf/sensys/LiW0Z0T23}). Existing deployments specialize in \textcolor{Red}{Cooling and Heat Modelling}, together with \textcolor{Red}{3D visualizations}. Most lack predictive modelling of DC operations. @@ -99,22 +97,22 @@ % Change to Datacenter (Physical Twin) \includegraphics[width=1.15\textwidth]{images/ref_architecture.pdf} \end{center} - \vspace{-0.2cm} + \vspace{-0.15cm} \tiny \textbf{Figure 1.4:} The predictive datacenter digital twin reference architecture. - The architecture was designed with the \emph{AtLarge Design Process}~\cite{DBLP:conf/icdcs/IosupVTETBFMT19}. + The architecture was designed with the \emph{AtLarge Design Process}~\cite{DBLP:conf/icdcs/IosupVTETBFMT19} over several iterations in the past months. \vspace{0.2cm} \end{minipage} - \hspace{0.8cm} + \hspace{0.6cm} \begin{minipage}[b]{0.45\linewidth} \begin{center} - \includegraphics[width=1.15\linewidth]{images/implementation.png} + \includegraphics[width=1.17\linewidth]{images/implementation.png} \end{center} \vspace{-0.2cm} \tiny - \textbf{Figure 1.5:} The prototype components based on \textbf{Figure 1.4}. + \textbf{Figure 1.5:} The prototype -- \emph{Sunfish}, and its components based on \textbf{Figure 1.4}. The time-series data flows first to the \texttt{Grafana} dashboard, \texttt{PostgreSQL} database and \texttt{Redis} cache~\cite{DBLP:conf/sc/TaheriBPRHDEWPM24}. - \vspace{0.2cm} + \vspace{0.1cm} \end{minipage} % We decided to use discrete-event simulation, as opposed to computational fluid dynamics because of the high overheads of development time needed for CFD. @@ -124,6 +122,7 @@ \end{frame} % You should skip \hfill completely or in favour of \hspace very minimally. \begin{frame}\frametitle{\textbf{RQ3}: Experimental Setup} + \hspace{-0.3cm} \begin{minipage}[b]{0.45\linewidth} \begin{tcolorbox}[title=Problem, colbacktitle=red!70!black,colback=red!20!white] We cannot just go and test digital twins on large systems, because we do not have large systems at hand. @@ -150,6 +149,7 @@ \end{minipage} \end{frame} + \begin{frame}\frametitle{\textbf{RQ3}: Experimental Results I} % You have some model, and this can be based on multiple traces. %Get insight from CINECA --> you get a probability of certain hosts failing. @@ -157,20 +157,31 @@ %If you incorporate that? If you can make the case that because of our new digital twin we can incorporate such models, anomaly/failure detection, from CINECA. %If we had that in, we can reach these kinds of gains. % @Mateusz there is really not a possibility to incorporate CINECA's models, so to address Dante's feedback, I created this experiment. - - \begin{tcolorbox}[title=Failure Detection: Main Finding I] - On average, \emph{Sunfish} can detect 14.5\% of unexpected failures in the physical twin. - We show, that digital twinning \emph{can} be used for failure detection. - + % If a single host crashes for the entire workload, that's not really that bad. + % If a lot of hosts suddenly crash but for a really short time, that's terrible. + % Failures that are more intensive are worse than failures with long duration. + \begin{tcolorbox}[title=Main Finding II] + We posit digital twinning can be used for failure detection to the benefit of DC operators. + We replicate an experiment from DyTwin~\cite{DBLP:conf/sc/TaheriBPRHDEWPM24} designed by Milojicic \etal to show our system can reliably detect \emph{unexpected} host failures. \end{tcolorbox} + \hspace{-0.2cm} \begin{minipage}[b]{0.45\linewidth} \begin{center} - \includegraphics[width=1.1\textwidth]{images/23_Jun_2026_102028.pdf} + \includegraphics[width=1.1\textwidth]{images/25_Jun_2026_152341.pdf} \end{center} \vspace{-0.3cm} \tiny - \textbf{Figure 1.5:} Experiment 1 Setup: The Digital Twin estimates the failures based on the Normal Distribution \emph{N\textasciitilde($\mu$,$\sigma$)} with $\mu = 1.5$ and $\sigma = 0.5$. - ``Real'' OpenDC failures come from a WhatsApp user reports. + \textbf{Figure 1.7a:} Experiment 1a. In this experiment we use red and yellow alarms to notify datacenter operators of unexpected failures. + We use a threshold based on predictions done by the simulator and a statistical distribution. + \end{minipage} + \hspace{0.6cm} + \begin{minipage}[b]{0.45\linewidth} + \begin{center} + \includegraphics[width=1.1\textwidth]{images/25_Jun_2026_161052.pdf} + \end{center} + \vspace{-0.3cm} + \tiny + \textbf{Figure 1.7b:} Experiment 1b. The mean failure detection rate is around 15\%. Even though this seems low, if we look at \textbf{Fig. E.1} (see Extra Slides), this simply means around 15\% of failures are unexpected. \end{minipage} % Explain what the axis are in the figure caption. % Talk about the experimental setup in the figure. @@ -178,36 +189,61 @@ \end{frame} \begin{frame}\frametitle{\textbf{RQ3}: Experimental Results II} - \begin{tcolorbox}[title=Scheduling Optimization: Main Finding II] - Here explain what did you find. + \begin{tcolorbox}[title=Main Finding III] + \emph{Sunfish} is capable of dynamic adjustments to the physical twin at runtime, and can lower the mean number of failed tasks. \end{tcolorbox} - + \hspace{0.2cm} + % Let's say we have some knowledge about the kind of workload we are going to run, e.g., Skype video calls. + % We can then estimate on previous Skype node failures and one of statistical distributions when are failures likely to happen. + % During the experiment, we unfortunately do not know what kind of distribution will the failures follow, so we constantly check to see which one fits best, and dynamically adjust the scheduling policy based on that. + %---% + % Step 1: We know we are going to soon run a workload coming in from Skype. Let's try to predict the failure pattern we might encounter. + % Run the OpenDC simulator 5 times to estimate the possible failure patterns. Save the results inside the Digital Twin. + % Step 2: Run the Digital Twin. Each time a new metric comes in, update the similarity score of each possible distribution. + % If the distribution with the similarity score that is the highest is about to match timestamps with the running workload AND according to the distribution we are going to experience failures in hosts A,B,C,D, % We decide to stop scheduling tasks on hosts A,B,C and D (we send a message to the running datacenter). + \begin{minipage}[b]{0.45\linewidth} + \begin{center} + \includegraphics[width=1.1\textwidth]{images/23_Jun_2026_102028.pdf} + \end{center} + \vspace{-0.3cm} + \tiny + \textbf{Figure 1.8a:} Experiment 2a. We can see in this plot which failure distribution is the most likely to be the true distribution while the simulation is running. + \end{minipage} + \begin{minipage}[b]{0.45\linewidth} + \begin{center} + \includegraphics[width=1.1\textwidth]{images/23_Jun_2026_102028.pdf} + \end{center} + \vspace{-0.3cm} + \tiny + \textbf{Figure 1.8b:} Experiment 2b. The gains in number of failures from turning faulty hosts in advance. + \end{minipage} \end{frame} \begin{frame}\frametitle{Key Takeaways} - \begin{tcolorbox}[title=What is the societal context?] + \begin{tcolorbox}[title=Societal Context] Datacenter manageability is a top-priority for the digital society. Over 3 million jobs in the Netherlands directly depend on cloud services, which are hosted in datacenters~\cite{DBLP:journals/corr/IosupKLVG22}. \end{tcolorbox} - \begin{tcolorbox}[title=What problem did we solve?] + \begin{tcolorbox}[title=Problem Statement] DCDT's, still under development, lack crucial features such as predictive analytics to manage datacenters well. The entire DCDT design space remains largely unexplored. \end{tcolorbox} - \begin{tcolorbox}[title=How did we solve this problem?] - Our contributions are: a thorough literature survey with a system model, a DCDT reference architecture, and prototype-based experiments via a novel evaluation method. + \begin{tcolorbox}[title=Contributions] + (1) A thorough literature survey with a system model, (2) a DCDT reference architecture, and (3) prototype-based experiments via a novel evaluation method. \end{tcolorbox} - \begin{tcolorbox}[colbacktitle=red!70!black, colback=red!20!white,title=What did we find?] - \emph{Sunfish} is able to detect around 20\% of unexpected failures based on discrete-event predictions, and can predict the most efficient scheduling policies for given workloads. + \begin{tcolorbox}[title=Main Findings] + \emph{Sunfish} can reliably detect unexpected failures based on discrete-event predictions, and can serve as a foundation for additional research and future work. \end{tcolorbox} % Mandatory to mention here the future work that you see happening. % Not enough space for another tcolorbox. \end{frame} -\setcounter{framenumber}{5} -\setbeamertemplate{footline}[page number]{ +\setcounter{framenumber}{3} +\setbeamertemplate{footline}[page number]{} + % Unfortunately this must remain here. \setbeamercolor{frametitle}{fg=Brown,bg=Brown!20} @@ -217,13 +253,36 @@ \usebeamerfont{frametitle}\insertframetitle\hfill \end{beamercolorbox} } - \begin{frame}[allowframebreaks]\frametitle{Extra Slides: References} \tiny \bibliographystyle{is-plain} \bibliography{main.bib} \end{frame} +\begin{frame}\frametitle{Extra Slides: Technical Setup } + \begin{tcolorbox}[title=What is the simulation workload?] + The compute workload is BitBrainsSmall. + The failure traces include user reports from Gmail, WhatsApp, Facebook and Twitter. + For predictions we use \texttt{prefabs}~\cite{DBLP:journals/fgcs/VersluisCGLPCUI23}. + \end{tcolorbox} + \begin{tcolorbox}[title=What is the experiment environment?] A commodity laptop: Framework Laptop 13, with 32GB of DDR5 RAM and an AMD Ryzen 7840U processor and an ArchLinux OS with Linux 7.0.13-arch1-1 kernel. + + \end{tcolorbox} + + + \begin{tcolorbox}[title=How did we adjust OpenDC (Physical Twin)?] + We use a SURF~\cite{DBLP:journals/fgcs/VersluisCGLPCUI23} datacenter topology with 277 hosts. + We wrote a custom Kotlin \texttt{ComputeMonitor} to export live-metrics into Kafka, and a custom Kotlin \texttt{HTTPClient} to talk to the digital twin. + We add a new scheduling mechanism, the \texttt{SmartScheduler}. + + \end{tcolorbox} + \begin{tcolorbox}[title=Which metrics do we measure?] + Timestamps, host names, uptime, downtime, CPU utilization \emph{etc.} + \end{tcolorbox} + +\end{frame} + + \begin{frame}\frametitle{Extra Slides: Why Digital Twinning?} \begin{tcolorbox}[title=Definition] A DCDT mirrors the structure, context and behaviour of a datacenter~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. The prerequisite to any digital twin is good monitoring and sensing capabilities in the physical entity. @@ -253,6 +312,9 @@ \tiny \textbf{Figure E.3:} Real-time control that is tightly-coupled with the IT equipment is a prerequisite for timely predictions within seconds/minutes~\cite{DBLP:journals/computer/AthavaleBBMMPS24}. \end{frame} + + + % Computational Fluid Dynamics (CFD) have high computation overhead, unsuitable for real-time simulation of a dynamic datacenter. %Moreover oftentimes a poorly configured CFD model can lead to high error rates~\cite{DBLP:conf/sensys/WangZD0TCWZ20}. %Data-driven Machine Learning performs poorly by the cases not covered in the training data. |
