From 0203254b709614fa732c114aa25916f61b8b3275 Mon Sep 17 00:00:00 2001 From: Niels Thiele Date: Sun, 22 Jun 2025 12:31:21 +0200 Subject: Implemented Single GPU Support & outline of host-level allocation policies (#342) * renamed performance counter to distinguish different resource types * added GPU, modelled similar to CPU * added GPUs to machine model * list of GPUs instead of single instance * renamed memory speed to bandwidth * enabled parsing of GPU resources * split powermodel into cpu and GPU powermodel * added gpu parsing tests * added idea of host level scheduling * added tests for multi gpu parsing * renamed powermodel to cpupowermodel * clarified naming of cpu and gpu components * added resource type to flow suplier and edge * added resourcetype * added GPU components and resource type to fragments * added GPU to workload and updated resource usage retrieval * implemented first version of multi resource * added name to workload * renamed perfomance counters * removed commented out code * removed deprecated comments * included demand and supply into calculations * resolving rebase mismatches * moved resource type from flowedge class to common package * added available resources to machinees * cleaner separation if workload is started of simmachine or vm * Replaced exception with dedicated enum * Only looping over resources that are actually used * using hashmaps to handle resourcetype instead of arrays for readability * fixed condition * tracking finished workloads per resource type * removed resource type from flowedge * made supply and demand distribution resource specific * added power model for GPU * removed unused test setup * removed depracated comments * removed unused parameter * added ID for GPU * added GPUs and GPU performance counters (naively) * implemented capturing of GPU statistics * added reminders for future implementations * renamed properties for better identification * added capturing GPU statistics * implemented first tests for GPUs * unified access to performance counters * added interface for general compute resource handling * implemented multi resource support in simmachine * added individual edge to VM per resource * extended compute resource interface * implemented multi-resource support in PSU * implemented generic retrieval of computeresources * implemented mult-resource suppport in vm * made method use more resource specific * implemented simple GPU tests * rolled back frquency and demand use * made naming independent of used resource * using workloads resources instead of VMs to determine available resource * implemented determination of used resources in workload * removed logging statements * implemented reading from workload * fixed naming for host-level allocation * fixed next deadline calculation * fixed forwarding supply * reduced memory footprint * made GPU powermodel nullable * maded Gpu powermodel configurable in topology * implemented tests for basic gpu scheduler * added gpu properties * implemented weights, filter and simple cpu-gpu scheduler * spotless apply * spotless apply pt. 2 * fixed capitalization * spotless kotlin run * implemented coloumn export * todo update * removed code comments * Merged PerformanceCounter classes into one & removed interface * removed GPU specific powermodel * Rebase master: kept both versions of TopologyFactories * renamed CpuPowermodel to resource independent Powermodel Moved it from Cpu package to power package * implementated default of getResourceType & removed overrides if possible * split getResourceType into Consumer and Supplier * added power as resource type * reduced supply demand from arrayList to single value * combining GPUs into one large GPU, until full multi-gpu support * merged distribution policy enum with corresponding factory * added comment * post-rebase fixes * aligned naming * Added GPU metrics to task output * Updates power resource type to uppercase. Standardizes the `ResourceType.Power` enum to `ResourceType.POWER` for consistency with other resource types and improved readability. * Removes deprecated test assertions Removes commented-out assertions in GPU tests. These assertions are no longer needed and clutter the test code. * Renames MaxMinFairnessStrategy to Policy Renames MaxMinFairnessStrategy to MaxMinFairnessPolicy for clarity and consistency with naming conventions. This change affects the factory and distributor to use the updated name. * applies spotless * nulls GPUs as it is not used --- .../compute/simulator/host/GpuHostModel.java | 33 ++++++++++++++++ .../opendc/compute/simulator/host/HostModel.java | 19 +++++++-- .../compute/simulator/service/ComputeService.java | 10 ++--- .../opendc/compute/simulator/service/HostView.java | 11 ++++-- .../compute/simulator/service/ServiceFlavor.java | 24 ++++++++--- .../compute/simulator/telemetry/GuestGpuStats.java | 44 +++++++++++++++++++++ .../compute/simulator/telemetry/HostGpuStats.java | 46 ++++++++++++++++++++++ 7 files changed, 171 insertions(+), 16 deletions(-) create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java create mode 100644 opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java (limited to 'opendc-compute/opendc-compute-simulator/src/main/java') diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java new file mode 100644 index 00000000..97aaa820 --- /dev/null +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/GpuHostModel.java @@ -0,0 +1,33 @@ +/* + * Copyright (c) 2022 AtLarge Research + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in all + * copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +package org.opendc.compute.simulator.host; + +/** + * A model for a GPU in a host. + * + * @param gpuCoreCapacity The capacity of the GPU cores hz. + * @param gpuCoreCount The number of GPU cores. + * @param GpuMemoryCapacity The capacity of the GPU memory in GB. + * @param GpuMemorySpeed The speed of the GPU memory in GB/s. + */ +public record GpuHostModel(double gpuCoreCapacity, int gpuCoreCount, long GpuMemoryCapacity, double GpuMemorySpeed) {} diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java index 1ea73ea6..6464a56c 100644 --- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/host/HostModel.java @@ -22,11 +22,24 @@ package org.opendc.compute.simulator.host; +import java.util.List; + /** * Record describing the static machine properties of the host. * - * @param cpuCapacity The total CPU capacity of the host in MHz. - * @param coreCount The number of logical processing cores available for this host. + * @param cpuCapacity The total CPU capacity of the host in MHz. + * @param coreCount The number of logical processing cores available for this host. * @param memoryCapacity The amount of memory available for this host in MB. */ -public record HostModel(double cpuCapacity, int coreCount, long memoryCapacity) {} +public record HostModel(double cpuCapacity, int coreCount, long memoryCapacity, List gpuHostModels) { + /** + * Create a new host model. + * + * @param cpuCapacity The total CPU capacity of the host in MHz. + * @param coreCount The number of logical processing cores available for this host. + * @param memoryCapacity The amount of memory available for this host in MB. + */ + public HostModel(double cpuCapacity, int coreCount, long memoryCapacity) { + this(cpuCapacity, coreCount, memoryCapacity, null); + } +} diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java index 2b4306af..835c7186 100644 --- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ComputeService.java @@ -198,7 +198,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver { HostView hv = hostToView.get(host); final ServiceFlavor flavor = task.getFlavor(); if (hv != null) { - hv.provisionedCores -= flavor.getCoreCount(); + hv.provisionedCpuCores -= flavor.getCpuCoreCount(); hv.instanceCount--; hv.availableMemory += flavor.getMemorySize(); } else { @@ -496,7 +496,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver { if (result.getResultType() == SchedulingResultType.FAILURE) { LOGGER.trace("Task {} selected for scheduling but no capacity available for it at the moment", task); - if (flavor.getMemorySize() > maxMemory || flavor.getCoreCount() > maxCores) { + if (flavor.getMemorySize() > maxMemory || flavor.getCpuCoreCount() > maxCores) { // Remove the incoming image taskQueue.remove(req); tasksPending--; @@ -531,7 +531,7 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver { attemptsSuccess++; hv.instanceCount++; - hv.provisionedCores += flavor.getCoreCount(); + hv.provisionedCpuCores += flavor.getCpuCoreCount(); hv.availableMemory -= flavor.getMemorySize(); activeTasks.put(task, host); @@ -612,12 +612,12 @@ public final class ComputeService implements AutoCloseable, CarbonReceiver { @NotNull public ServiceFlavor newFlavor( - @NotNull String name, int cpuCount, long memorySize, @NotNull Map meta) { + @NotNull String name, int cpuCount, long memorySize, int gpuCoreCount, @NotNull Map meta) { checkOpen(); final ComputeService service = this.service; UUID uid = new UUID(service.clock.millis(), service.random.nextLong()); - ServiceFlavor flavor = new ServiceFlavor(service, uid, name, cpuCount, memorySize, meta); + ServiceFlavor flavor = new ServiceFlavor(service, uid, name, cpuCount, memorySize, gpuCoreCount, meta); // service.flavorById.put(uid, flavor); // service.flavors.add(flavor); diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java index 7c548add..c07f58c7 100644 --- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/HostView.java @@ -31,7 +31,8 @@ public class HostView { private final SimHost host; int instanceCount; long availableMemory; - int provisionedCores; + int provisionedCpuCores; + int provisionedGpuCores; /** * Scheduler bookkeeping @@ -83,8 +84,12 @@ public class HostView { /** * Return the provisioned cores on the host. */ - public int getProvisionedCores() { - return provisionedCores; + public int getProvisionedCpuCores() { + return provisionedCpuCores; + } + + public int getProvisionedGpuCores() { + return provisionedGpuCores; } @Override diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java index eddde87e..8a4359b4 100644 --- a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/service/ServiceFlavor.java @@ -36,22 +36,31 @@ public final class ServiceFlavor implements Flavor { private final ComputeService service; private final UUID uid; private final String name; - private final int coreCount; + private final int cpuCoreCount; private final long memorySize; + private final int gpuCoreCount; private final Map meta; - ServiceFlavor(ComputeService service, UUID uid, String name, int coreCount, long memorySize, Map meta) { + ServiceFlavor( + ComputeService service, + UUID uid, + String name, + int cpuCoreCount, + long memorySize, + int gpuCoreCount, + Map meta) { this.service = service; this.uid = uid; this.name = name; - this.coreCount = coreCount; + this.cpuCoreCount = cpuCoreCount; this.memorySize = memorySize; + this.gpuCoreCount = gpuCoreCount; this.meta = meta; } @Override - public int getCoreCount() { - return coreCount; + public int getCpuCoreCount() { + return cpuCoreCount; } @Override @@ -59,6 +68,11 @@ public final class ServiceFlavor implements Flavor { return memorySize; } + @Override + public int getGpuCoreCount() { + return gpuCoreCount; + } + @NotNull @Override public UUID getUid() { diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java new file mode 100644 index 00000000..1aba13e3 --- /dev/null +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/GuestGpuStats.java @@ -0,0 +1,44 @@ +/* + * Copyright (c) 2022 AtLarge Research + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in all + * copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +package org.opendc.compute.simulator.telemetry; + +/** + * Statistics about the GPUs of a guest. + * + * @param activeTime The cumulative time (in seconds) that the GPUs of the guest were actively running. + * @param idleTime The cumulative time (in seconds) the GPUs of the guest were idle. + * @param stealTime The cumulative GPU time (in seconds) that the guest was ready to run, but not granted time by the host. + * @param lostTime The cumulative GPU time (in seconds) that was lost due to interference with other machines. + * @param capacity The available GPU capacity of the guest (in MHz). + * @param usage Amount of GPU resources (in MHz) actually used by the guest. + * @param utilization The utilization of the GPU resources (in %) relative to the total GPU capacity. + */ +public record GuestGpuStats( + long activeTime, + long idleTime, + long stealTime, + long lostTime, + double capacity, + double usage, + double demand, + double utilization) {} diff --git a/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java new file mode 100644 index 00000000..e42d7704 --- /dev/null +++ b/opendc-compute/opendc-compute-simulator/src/main/java/org/opendc/compute/simulator/telemetry/HostGpuStats.java @@ -0,0 +1,46 @@ +/* + * Copyright (c) 2022 AtLarge Research + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in all + * copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +package org.opendc.compute.simulator.telemetry; + +/** + * Statistics about the GPUs of a host. + * + * @param activeTime The cumulative time (in seconds) that the GPUs of the host were actively running. + * @param idleTime The cumulative time (in seconds) the GPUs of the host were idle. + * @param stealTime The cumulative GPU time (in seconds) that virtual machines were ready to run, but were not able to. + * @param lostTime The cumulative GPU time (in seconds) that was lost due to interference between virtual machines. + * @param capacity The available GPU capacity of the host (in MHz). + * @param demand Amount of GPU resources (in MHz) the guests would use if there were no GPU contention or GPU + * limits. + * @param usage Amount of GPU resources (in MHz) actually used by the host. + * @param utilization The utilization of the GPU resources (in %) relative to the total GPU capacity. + */ +public record HostGpuStats( + long activeTime, + long idleTime, + long stealTime, + long lostTime, + double capacity, + double demand, + double usage, + double utilization) {} -- cgit v1.2.3