Skip to content

SlugLab/CXLMemSim

Repository files navigation

CXLMemSim and MEMU

CXLMemSim

The CXL.mem simulator uses the target latency for simulating the CPU perspective taking ROB and different cacheline states into penalty from the application level.

Prerequisite

root@victoryang00-ASUS-Zenbook-S-14-UX5406SA-UX5406SA:/home/victoryang00/CLionProjects/CXLMemSim-dev/build# uname -a
Linux victoryang00-ASUS-Zenbook-S-14-UX5406SA-UX5406SA 6.13.0-rc4+ #12 SMP PREEMPT_DYNAMIC Fri Jan 24 07:08:46 CST 2025 x86_64 x86_64 x86_64 GNU/Linux

User input

SPDLOG_LEVEL=debug ./CXLMemSim -t ./microbench/ld -i 5 -c 0,2 -d 85 -c 100,100 -w 85.5,86.5,87.5,85.5,86.5,87.5,88. -o "(1,(2,3))"
  1. -t Target: The path to the executable
  2. -i Interval: The epoch of the simulator, the parameter is in milisecond
  3. -c CPUSet: The core id to run the executable and the rest will be setaffinity to one other core
  4. -d Dram Latency: The current platform's DRAM latency, default is 85ns # mark that bw in the remote
  5. -b, -l Bandwidth, Latency: Both use 2 input in the vector, first for read, second for write
  6. -c Capacity: The capacity of the memory with first be local, remaining accordingly to the input vector.
  7. -w Weight: Use the heuristic to calculate the bandwidth
  8. -o Topology: Construct the topology using newick tree syntax (1,(2,3)) stands for
            1
          /
0 - local
          \
                   2
         switch  / 
                 \ 
                  3
  1. env SPDLOG_LEVEL stands for logs level that you can see.

Cite

@article{yangyarch23,
  title={CXLMemSim: A pure software simulated CXL.mem for performance characterization},
  author={Yiwei Yang, Pooneh Safayenikoo, Jiacheng Ma, Tanvir Ahmed Khan, Andrew Quinn},
  journal={arXiv preprint arXiv:2303.06153},
  booktitle={The fifth Young Architect Workshop (YArch'23)},
  year={2023}
}

MEMU

Compute Express Link (CXL) 3.0 introduces powerful memory pooling and promises to transform datacenter architectures. However, the lack of available CXL 3.0 hardware and the complexity of multi-host configurations pose significant challenges to the community. This paper presents MEMU, a comprehensive emulation framework that enables full CXL 3.0 functionality, including multi-host memory sharing and pooling support. MEMU provides emulation of CXL 3.0 featuresβ€”such as fabric management, dynamic memory allocation, and coherent memory sharing across multiple hostsβ€”in advance of real hardware availability. An evaluation of MEMU shows that it achieves performance within about 3x of projected native CXL 3.0 speeds having complete compatibility with existing CXL software stacks. We demonstrate the utility of MEMU through a case study on Genomics Pipeline, observing up to a 15% improvement in application performance compared to traditional RDMA-based approaches. MEMU is open-source and publicly available, aiming to accelerate CXL 3.0 research and development.

sudo ip link add br0 type bridge
sudo ip link set br0 up
sudo ip addr add 192.168.100.1/24 dev br0
for i in 0; do
    sudo ip tuntap add tap$i mode tap
    sudo ip link set tap$i up
    sudo ip link set tap$i master br0
done
mkdir build
cd build
wget https://asplos.dev/about/qemu.img
wget https://asplos.dev/about/bzImage
cp qemu.img qemu1.img
../qemu_integration/launch_qemu_cxl1.sh
# in qemu
vi /usr/local/bin/*.sh
# change 192.168.100.10 to 11
vi /etc/hostname
# change node0 to node1
exit
# out of qemu
../qemu_integration/launch_qemu_cxl.sh &
../qemu_integration/launch_qemu_cxl1.sh &

for multiple hosts, you'll need vxlan

#!/bin/bash
set -eux

DEV=enp23s0f0np0
BR=br0
VNI=100
MCAST=239.1.1.1
BR_IP_SUFFIX=$(hostname | grep -oE '[0-9]+$' || echo 1)   # optional auto-index
# Or set manually:
# BR_IP_SUFFIX=<1..4>

# Clean up
ip link del $BR 2>/dev/null || true
ip link del vxlan$VNI 2>/dev/null || true

# Create bridge
ip link add $BR type bridge
ip link set $BR up

# Create multicast VXLAN (no remote attribute!)
ip link add vxlan$VNI type vxlan id $VNI group $MCAST dev $DEV dstport 4789 ttl 10
ip link set vxlan$VNI up
ip link set vxlan$VNI master $BR

# Assign overlay IP
ip addr add 192.168.100.$BR_IP_SUFFIX/24 dev $BR

# Optional: add local TAPs for QEMU
for i in 0 1; do
    ip tuntap add tap$i mode tap
    ip link set tap$i up
    ip link set tap$i master $BR
done

echo "Bridge $BR ready on host $(hostname)"

for every host and edit the qemu's ip with /usr/local/bin/setup* and /etc/hostname.

CXL Type 2 GPU Emulation

This module enables GPU compute through CXL Type 2 device emulation, allowing a guest VM to access the host's NVIDIA GPU via the CXL.cache coherency protocol.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              GUEST VM                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  CUDA Application β”‚    β”‚           Guest libcuda.so Shim              β”‚    β”‚
β”‚  β”‚  (cuda_test.c)   │───▢│  - Intercepts CUDA Driver API calls          β”‚    β”‚
β”‚  β”‚                  β”‚    β”‚  - Translates to CXL GPU command protocol    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  - Maps BAR2 via /sys/bus/pci/.../resource2  β”‚    β”‚
β”‚                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                             β”‚ MMIO Read/Write                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    CXL Type 2 PCI Device                              β”‚    β”‚
β”‚  β”‚                    Vendor: 0x8086  Device: 0x0d92                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚    BAR0     β”‚  β”‚    BAR2     β”‚  β”‚    BAR4     β”‚  β”‚    BAR6     β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  Component  β”‚  β”‚ Cache/GPU   β”‚  β”‚   Device    β”‚  β”‚   MSI-X     β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  Registers  β”‚  β”‚  Command    β”‚  β”‚   Memory    β”‚  β”‚             β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                             β”‚                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              Linux Kernel: cxl_type2_accel.ko                         β”‚    β”‚
β”‚  β”‚  - Binds to CXL Type 2 PCI device                                     β”‚    β”‚
β”‚  β”‚  - Configures CXL.cache and CXL.mem capabilities                      β”‚    β”‚
β”‚  β”‚  - Manages cache coherency state                                      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                              PCIe/CXL Bus
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              QEMU HOST                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    CXL Type 2 Device Emulation                        β”‚   β”‚
β”‚  β”‚                    (hw/cxl/cxl_type2.c)                               β”‚   β”‚
β”‚  β”‚                                                                       β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚                   GPU Command Interface                          β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β”‚ Command Regs β”‚  β”‚ Result Regs  β”‚  β”‚    Data Buffer       β”‚   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β”‚ 0x00-0x3F    β”‚  β”‚ 0x80-0x9F    β”‚  β”‚    0x1000-0xFFFF     β”‚   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β”‚              β”‚  β”‚              β”‚  β”‚    (60KB for PTX,    β”‚   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β”‚ CMD, PARAMS  β”‚  β”‚ RESULT0-3    β”‚  β”‚     memcpy data)     β”‚   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β”‚                               β”‚                                       β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚                   Coherency Engine                               β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Cache line tracking (MESI-like states)                        β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Snoop request handling                                        β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Writeback management                                          β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β”‚                               β”‚                                       β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚                   hetGPU Backend                                 β”‚ β”‚   β”‚
β”‚  β”‚  β”‚                   (hw/cxl/cxl_hetgpu.c)                          β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Loads libcuda.so via dlopen()                                 β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Translates commands to real CUDA API calls                    β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  - Manages GPU context, memory, kernel launches                  β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                  β”‚ dlsym() calls                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              /usr/lib/x86_64-linux-gnu/libcuda.so                   β”‚   β”‚
β”‚  β”‚              (NVIDIA CUDA Driver Library)                           β”‚   β”‚
β”‚  β”‚  cuInit, cuCtxCreate, cuMemAlloc, cuLaunchKernel, ...               β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                  β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    NVIDIA GPU Hardware                              β”‚   β”‚
β”‚  β”‚                    (e.g., RTX 3090, A100)                           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

GPU Command Protocol

The guest communicates with the CXL Type 2 device via MMIO registers in BAR2:

Offset Register Description
0x0000 MAGIC Magic number: 0x43584C32 ("CXL2")
0x0004 VERSION Interface version
0x0008 STATUS Device status (READY, BUSY, ERROR)
0x0010 CMD Command register - write triggers execution
0x0014 CMD_STATUS Command status (IDLE, RUNNING, COMPLETE)
0x0018 CMD_RESULT Result/error code
0x0040-0x78 PARAM0-7 Command parameters
0x0080-0x98 RESULT0-3 Command results
0x0140 TOTAL_MEM Total GPU memory
0x1000-0xFFFF DATA Data buffer for PTX, memcpy

Supported Commands

Command Code Description
CMD_INIT 0x01 Initialize GPU
CMD_GET_DEVICE_COUNT 0x02 Get number of GPUs
CMD_CTX_CREATE 0x10 Create CUDA context
CMD_CTX_SYNC 0x12 Synchronize context
CMD_MEM_ALLOC 0x20 Allocate device memory
CMD_MEM_FREE 0x21 Free device memory
CMD_MEM_COPY_HTOD 0x22 Copy host to device
CMD_MEM_COPY_DTOH 0x23 Copy device to host
CMD_MODULE_LOAD_PTX 0x30 Load PTX module
CMD_FUNC_GET 0x32 Get kernel function
CMD_LAUNCH_KERNEL 0x40 Launch GPU kernel

Data Flow Example: cuMemAlloc

Guest                     QEMU CXL Type 2              Host GPU
  β”‚                             β”‚                          β”‚
  β”‚ 1. Write size to PARAM0     β”‚                          β”‚
  β”‚ ──────────────────────────▢ β”‚                          β”‚
  β”‚                             β”‚                          β”‚
  β”‚ 2. Write CMD_MEM_ALLOC      β”‚                          β”‚
  β”‚ ──────────────────────────▢ β”‚                          β”‚
  β”‚                             β”‚ 3. Call cuMemAlloc_v2()  β”‚
  β”‚                             β”‚ ───────────────────────▢ β”‚
  β”‚                             β”‚                          β”‚
  β”‚                             β”‚ 4. Return device pointer β”‚
  β”‚                             β”‚ ◀─────────────────────── β”‚
  β”‚                             β”‚                          β”‚
  β”‚                             β”‚ 5. Store in RESULT0      β”‚
  β”‚ 6. Poll CMD_STATUS          β”‚                          β”‚
  β”‚ ──────────────────────────▢ β”‚                          β”‚
  β”‚                             β”‚                          β”‚
  β”‚ 7. Read RESULT0 (dev ptr)   β”‚                          β”‚
  β”‚ ◀────────────────────────── β”‚                          β”‚

Setup Instructions

1. Build QEMU with CXL Type 2 Support

cd lib/qemu/build
meson setup --reconfigure
ninja

2. Build Guest libcuda Shim

cd qemu_integration/guest_libcuda
make

3. Load Kernel Modules (in Guest)

modprobe cxl_core
modprobe cxl_port
modprobe cxl_cache
modprobe cxl_type2_accel

4. Run CUDA Applications (in Guest)

# Set library path to use the CXL shim instead of real libcuda
LD_LIBRARY_PATH=/path/to/guest_libcuda ./your_cuda_app

# Enable debug logging
CXL_CUDA_DEBUG=1 LD_LIBRARY_PATH=. ./cuda_test

QEMU Command Line Options

-device cxl-type2,id=cxl-gpu0,\
    cache-size=128M,\           # CXL.cache size
    mem-size=4G,\               # Device-attached memory
    hetgpu-lib=/path/to/libcuda.so,\  # CUDA library path
    hetgpu-device=0             # GPU device index

CXL.cache Coherency

The CXL Type 2 device implements CPU-GPU cache coherency:

    CPU Cache                  CXL Type 2 Device
        β”‚                            β”‚
        β”‚  ◀─── Snoop Request ────── β”‚  (GPU wants exclusive access)
        β”‚                            β”‚
        β”‚  ──── Snoop Response ────▢ β”‚  (CPU provides data/invalidates)
        β”‚                            β”‚
        β”‚  ◀─── Writeback ────────── β”‚  (GPU writes back dirty data)
        β”‚                            β”‚

This enables:

  • Zero-copy data sharing between CPU and GPU
  • Coherent memory regions visible to both processors
  • Reduced memory copy overhead for GPU compute

About

CXLMemSim: A pure software simulated CXL.mem for performance characterization

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •