From 0203254b709614fa732c114aa25916f61b8b3275 Mon Sep 17 00:00:00 2001
From: Niels Thiele <noleu66@posteo.net>
Date: Sun, 22 Jun 2025 12:31:21 +0200
Subject: Implemented Single GPU Support & outline of host-level allocation
 policies (#342)

* renamed performance counter to distinguish different resource types

* added GPU, modelled similar to CPU

* added GPUs to machine model

* list of GPUs instead of single instance

* renamed memory speed to bandwidth

* enabled parsing of GPU resources

* split powermodel into cpu and GPU powermodel

* added gpu parsing tests

* added idea of host level scheduling

* added tests for multi gpu parsing

* renamed powermodel to cpupowermodel

* clarified naming of cpu and gpu components

* added resource type to flow suplier and edge

* added resourcetype

* added GPU components and resource type to fragments

* added GPU to workload and updated resource usage retrieval

* implemented first version of multi resource

* added name to workload

* renamed perfomance counters

* removed commented out code

* removed deprecated comments

* included demand and supply into calculations

* resolving rebase mismatches

* moved resource type from flowedge class to common package

* added available resources to machinees

* cleaner separation if workload is started of simmachine or vm

* Replaced exception with dedicated enum

* Only looping over resources that are actually used

* using hashmaps to handle resourcetype instead of arrays for readability

* fixed condition

* tracking finished workloads per resource type

* removed resource type from flowedge

* made supply and demand distribution resource specific

* added power model for GPU

* removed unused test setup

* removed depracated comments

* removed unused parameter

* added ID for GPU

* added GPUs and GPU performance counters (naively)

* implemented capturing of GPU statistics

* added reminders for future implementations

* renamed properties for better identification

* added capturing GPU statistics

* implemented first tests for GPUs

* unified access to performance counters

* added interface for general compute resource handling

* implemented multi resource support in simmachine

* added individual edge to VM per resource

* extended compute resource interface

* implemented multi-resource support in PSU

* implemented generic retrieval of computeresources

* implemented mult-resource suppport in vm

* made method use more resource specific

* implemented simple GPU tests

* rolled back frquency and demand use

* made naming independent of used resource

* using workloads resources instead of VMs to determine available resource

* implemented determination of used resources in workload

* removed logging statements

* implemented reading from workload

* fixed naming for host-level allocation

* fixed next deadline calculation

* fixed forwarding supply

* reduced memory footprint

* made GPU powermodel nullable

* maded Gpu powermodel configurable in topology

* implemented tests for basic gpu scheduler

* added gpu properties

* implemented weights, filter and simple cpu-gpu scheduler

* spotless apply

* spotless apply pt. 2

* fixed capitalization

* spotless kotlin run

* implemented coloumn export

* todo update

* removed code comments

* Merged PerformanceCounter classes into one & removed interface

* removed GPU  specific powermodel

* Rebase master: kept both versions of TopologyFactories

* renamed CpuPowermodel to resource independent Powermodel

Moved it from Cpu package to power package

* implementated default of getResourceType & removed overrides if possible

* split getResourceType into Consumer and Supplier

* added power as resource type

* reduced supply demand from arrayList to single value

* combining GPUs into one large GPU, until full multi-gpu support

* merged distribution policy enum with corresponding factory

* added comment

* post-rebase fixes

* aligned naming

* Added GPU metrics to task output

* Updates power resource type to uppercase.

Standardizes the `ResourceType.Power` enum to `ResourceType.POWER`
for consistency with other resource types and improved readability.

* Removes deprecated test assertions

Removes commented-out assertions in GPU tests.

These assertions are no longer needed and clutter the test code.

* Renames MaxMinFairnessStrategy to Policy

Renames MaxMinFairnessStrategy to MaxMinFairnessPolicy for
clarity and consistency with naming conventions. This change
affects the factory and distributor to use the updated name.

* applies spotless

* nulls GPUs as it is not used
---
 .../compute/simulator/host/GpuHostModel.java       | 33 ++++++++++++++++
 .../opendc/compute/simulator/host/HostModel.java   | 19 +++++++--
 .../compute/simulator/service/ComputeService.java  | 10 ++---
 .../opendc/compute/simulator/service/HostView.java | 11 ++++--
 .../compute/simulator/service/ServiceFlavor.java   | 24 ++++++++---
 .../compute/simulator/telemetry/GuestGpuStats.java | 44 +++++++++++++++++++++
 .../compute/simulator/telemetry/HostGpuStats.java  | 46 ++++++++++++++++++++++
 7 files changed, 171 insertions(+), 16 deletions(-)
 create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java
 create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java
 create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java

(limited to 'opendc-compute/opendc-compute-simulator/src/main/java')

diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java
new file mode 100644
index 00000000..97aaa820
--- /dev/null
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java
@@ -0,0 +1,33 @@
+/*
+ * Copyright (c) 2022 AtLarge Research
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+package org.opendc.compute.simulator.host;
+
+/**
+ * A model for a GPU in a host.
+ *
+ * @param gpuCoreCapacity The capacity of the GPU cores hz.
+ * @param gpuCoreCount    The number of GPU cores.
+ * @param GpuMemoryCapacity The capacity of the GPU memory in GB.
+ * @param GpuMemorySpeed   The speed of the GPU memory in GB/s.
+ */
+public record GpuHostModel(double gpuCoreCapacity, int gpuCoreCount, long GpuMemoryCapacity, double GpuMemorySpeed) {}
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java
index 1ea73ea6..6464a56c 100644
--- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java
@@ -22,11 +22,24 @@
 
 package org.opendc.compute.simulator.host;
 
+import java.util.List;
+
 /**
  * Record describing the static machine properties of the host.
  *
- * @param cpuCapacity The total CPU capacity of the host in MHz.
- * @param coreCount The number of logical processing cores available for this host.
+ * @param cpuCapacity    The total CPU capacity of the host in MHz.
+ * @param coreCount      The number of logical processing cores available for this host.
  * @param memoryCapacity The amount of memory available for this host in MB.
  */
-public record HostModel(double cpuCapacity, int coreCount, long memoryCapacity) {}
+public record HostModel(double cpuCapacity, int coreCount, long memoryCapacity, List<GpuHostModel> gpuHostModels) {
+    /**
+     * Create a new host model.
+     *
+     * @param cpuCapacity    The total CPU capacity of the host in MHz.
+     * @param coreCount      The number of logical processing cores available for this host.
+     * @param memoryCapacity The amount of memory available for this host in MB.
+     */
+    public HostModel(double cpuCapacity, int coreCount, long memoryCapacity) {
+        this(cpuCapacity, coreCount, memoryCapacity, null);
+    }
+}
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java
index 2b4306af..835c7186 100644
--- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java
@@ -198,7 +198,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver {
                 HostView hv = hostToView.get(host);
                 final ServiceFlavor flavor = task.getFlavor();
                 if (hv != null) {
-                    hv.provisionedCores -= flavor.getCoreCount();
+                    hv.provisionedCpuCores -= flavor.getCpuCoreCount();
                     hv.instanceCount--;
                     hv.availableMemory += flavor.getMemorySize();
                 } else {
@@ -496,7 +496,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver {
             if (result.getResultType() == SchedulingResultType.FAILURE) {
                 LOGGER.trace("Task {} selected for scheduling but no capacity available for it at the moment", task);
 
-                if (flavor.getMemorySize() > maxMemory || flavor.getCoreCount() > maxCores) {
+                if (flavor.getMemorySize() > maxMemory || flavor.getCpuCoreCount() > maxCores) {
                     // Remove the incoming image
                     taskQueue.remove(req);
                     tasksPending--;
@@ -531,7 +531,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver {
                 attemptsSuccess++;
 
                 hv.instanceCount++;
-                hv.provisionedCores += flavor.getCoreCount();
+                hv.provisionedCpuCores += flavor.getCpuCoreCount();
                 hv.availableMemory -= flavor.getMemorySize();
 
                 activeTasks.put(task, host);
@@ -612,12 +612,12 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver {
 
         @NotNull
         public ServiceFlavor newFlavor(
-                @NotNull String name, int cpuCount, long memorySize, @NotNull Map<String, ?> meta) {
+                @NotNull String name, int cpuCount, long memorySize, int gpuCoreCount, @NotNull Map<String, ?> meta) {
             checkOpen();
 
             final ComputeService service = this.service;
             UUID uid = new UUID(service.clock.millis(), service.random.nextLong());
-            ServiceFlavor flavor = new ServiceFlavor(service, uid, name, cpuCount, memorySize, meta);
+            ServiceFlavor flavor = new ServiceFlavor(service, uid, name, cpuCount, memorySize, gpuCoreCount, meta);
 
             //            service.flavorById.put(uid, flavor);
             //            service.flavors.add(flavor);
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java
index 7c548add..c07f58c7 100644
--- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java
@@ -31,7 +31,8 @@ public class HostView {
     private final SimHost host;
     int instanceCount;
     long availableMemory;
-    int provisionedCores;
+    int provisionedCpuCores;
+    int provisionedGpuCores;
 
     /**
      * Scheduler bookkeeping
@@ -83,8 +84,12 @@ public class HostView {
     /**
      * Return the provisioned cores on the host.
      */
-    public int getProvisionedCores() {
-        return provisionedCores;
+    public int getProvisionedCpuCores() {
+        return provisionedCpuCores;
+    }
+
+    public int getProvisionedGpuCores() {
+        return provisionedGpuCores;
     }
 
     @Override
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java
index eddde87e..8a4359b4 100644
--- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java
@@ -36,22 +36,31 @@ public final class ServiceFlavor implements Flavor {
     private final ComputeService service;
     private final UUID uid;
     private final String name;
-    private final int coreCount;
+    private final int cpuCoreCount;
     private final long memorySize;
+    private final int gpuCoreCount;
     private final Map<String, ?> meta;
 
-    ServiceFlavor(ComputeService service, UUID uid, String name, int coreCount, long memorySize, Map<String, ?> meta) {
+    ServiceFlavor(
+            ComputeService service,
+            UUID uid,
+            String name,
+            int cpuCoreCount,
+            long memorySize,
+            int gpuCoreCount,
+            Map<String, ?> meta) {
         this.service = service;
         this.uid = uid;
         this.name = name;
-        this.coreCount = coreCount;
+        this.cpuCoreCount = cpuCoreCount;
         this.memorySize = memorySize;
+        this.gpuCoreCount = gpuCoreCount;
         this.meta = meta;
     }
 
     @Override
-    public int getCoreCount() {
-        return coreCount;
+    public int getCpuCoreCount() {
+        return cpuCoreCount;
     }
 
     @Override
@@ -59,6 +68,11 @@ public final class ServiceFlavor implements Flavor {
         return memorySize;
     }
 
+    @Override
+    public int getGpuCoreCount() {
+        return gpuCoreCount;
+    }
+
     @NotNull
     @Override
     public UUID getUid() {
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java
new file mode 100644
index 00000000..1aba13e3
--- /dev/null
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2022 AtLarge Research
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+package org.opendc.compute.simulator.telemetry;
+
+/**
+ * Statistics about the GPUs of a guest.
+ *
+ * @param activeTime The cumulative time (in seconds) that the GPUs of the guest were actively running.
+ * @param idleTime The cumulative time (in seconds) the GPUs of the guest were idle.
+ * @param stealTime The cumulative GPU time (in seconds) that the guest was ready to run, but not granted time by the host.
+ * @param lostTime The cumulative GPU time (in seconds) that was lost due to interference with other machines.
+ * @param capacity The available GPU capacity of the guest (in MHz).
+ * @param usage Amount of GPU resources (in MHz) actually used by the guest.
+ * @param utilization The utilization of the GPU resources (in %) relative to the total GPU capacity.
+ */
+public record GuestGpuStats(
+        long activeTime,
+        long idleTime,
+        long stealTime,
+        long lostTime,
+        double capacity,
+        double usage,
+        double demand,
+        double utilization) {}
diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java
new file mode 100644
index 00000000..e42d7704
--- /dev/null
+++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java
@@ -0,0 +1,46 @@
+/*
+ * Copyright (c) 2022 AtLarge Research
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+package org.opendc.compute.simulator.telemetry;
+
+/**
+ * Statistics about the GPUs of a host.
+ *
+ * @param activeTime The cumulative time (in seconds) that the GPUs of the host were actively running.
+ * @param idleTime The cumulative time (in seconds) the GPUs of the host were idle.
+ * @param stealTime The cumulative GPU time (in seconds) that virtual machines were ready to run, but were not able to.
+ * @param lostTime The cumulative GPU time (in seconds) that was lost due to interference between virtual machines.
+ * @param capacity The available GPU capacity of the host (in MHz).
+ * @param demand Amount of GPU resources (in MHz) the guests would use if there were no GPU contention or GPU
+ *               limits.
+ * @param usage Amount of GPU resources (in MHz) actually used by the host.
+ * @param utilization The utilization of the GPU resources (in %) relative to the total GPU capacity.
+ */
+public record HostGpuStats(
+        long activeTime,
+        long idleTime,
+        long stealTime,
+        long lostTime,
+        double capacity,
+        double demand,
+        double usage,
+        double utilization) {}
-- 
cgit v1.2.3