The CXL.mem simulator uses the target latency for simulating the CPU perspective taking ROB and different cacheline states into penalty from the application level.
root@victoryang00-ASUS-Zenbook-S-14-UX5406SA-UX5406SA:/home/victoryang00/CLionProjects/CXLMemSim-dev/build# uname -a
Linux victoryang00-ASUS-Zenbook-S-14-UX5406SA-UX5406SA 6.13.0-rc4+ #12 SMP PREEMPT_DYNAMIC Fri Jan 24 07:08:46 CST 2025 x86_64 x86_64 x86_64 GNU/LinuxSPDLOG_LEVEL=debug ./CXLMemSim -t ./microbench/ld -i 5 -c 0,2 -d 85 -c 100,100 -w 85.5,86.5,87.5,85.5,86.5,87.5,88. -o "(1,(2,3))"- -t Target: The path to the executable
- -i Interval: The epoch of the simulator, the parameter is in milisecond
- -c CPUSet: The core id to run the executable and the rest will be
setaffinityto one other core - -d Dram Latency: The current platform's DRAM latency, default is 85ns # mark that bw in the remote
- -b, -l Bandwidth, Latency: Both use 2 input in the vector, first for read, second for write
- -c Capacity: The capacity of the memory with first be local, remaining accordingly to the input vector.
- -w Weight: Use the heuristic to calculate the bandwidth
- -o Topology: Construct the topology using newick tree syntax (1,(2,3)) stands for
1
/
0 - local
\
2
switch /
\
3- env SPDLOG_LEVEL stands for logs level that you can see.
@article{yangyarch23,
title={CXLMemSim: A pure software simulated CXL.mem for performance characterization},
author={Yiwei Yang, Pooneh Safayenikoo, Jiacheng Ma, Tanvir Ahmed Khan, Andrew Quinn},
journal={arXiv preprint arXiv:2303.06153},
booktitle={The fifth Young Architect Workshop (YArch'23)},
year={2023}
}Compute Express Link (CXL) 3.0 introduces powerful memory pooling and promises to transform datacenter architectures. However, the lack of available CXL 3.0 hardware and the complexity of multi-host configurations pose significant challenges to the community. This paper presents MEMU, a comprehensive emulation framework that enables full CXL 3.0 functionality, including multi-host memory sharing and pooling support. MEMU provides emulation of CXL 3.0 featuresβsuch as fabric management, dynamic memory allocation, and coherent memory sharing across multiple hostsβin advance of real hardware availability. An evaluation of MEMU shows that it achieves performance within about 3x of projected native CXL 3.0 speeds having complete compatibility with existing CXL software stacks. We demonstrate the utility of MEMU through a case study on Genomics Pipeline, observing up to a 15% improvement in application performance compared to traditional RDMA-based approaches. MEMU is open-source and publicly available, aiming to accelerate CXL 3.0 research and development.
sudo ip link add br0 type bridge
sudo ip link set br0 up
sudo ip addr add 192.168.100.1/24 dev br0
for i in 0; do
sudo ip tuntap add tap$i mode tap
sudo ip link set tap$i up
sudo ip link set tap$i master br0
done
mkdir build
cd build
wget https://asplos.dev/about/qemu.img
wget https://asplos.dev/about/bzImage
cp qemu.img qemu1.img
../qemu_integration/launch_qemu_cxl1.sh
# in qemu
vi /usr/local/bin/*.sh
# change 192.168.100.10 to 11
vi /etc/hostname
# change node0 to node1
exit
# out of qemu
../qemu_integration/launch_qemu_cxl.sh &
../qemu_integration/launch_qemu_cxl1.sh &for multiple hosts, you'll need vxlan
#!/bin/bash
set -eux
DEV=enp23s0f0np0
BR=br0
VNI=100
MCAST=239.1.1.1
BR_IP_SUFFIX=$(hostname | grep -oE '[0-9]+$' || echo 1) # optional auto-index
# Or set manually:
# BR_IP_SUFFIX=<1..4>
# Clean up
ip link del $BR 2>/dev/null || true
ip link del vxlan$VNI 2>/dev/null || true
# Create bridge
ip link add $BR type bridge
ip link set $BR up
# Create multicast VXLAN (no remote attribute!)
ip link add vxlan$VNI type vxlan id $VNI group $MCAST dev $DEV dstport 4789 ttl 10
ip link set vxlan$VNI up
ip link set vxlan$VNI master $BR
# Assign overlay IP
ip addr add 192.168.100.$BR_IP_SUFFIX/24 dev $BR
# Optional: add local TAPs for QEMU
for i in 0 1; do
ip tuntap add tap$i mode tap
ip link set tap$i up
ip link set tap$i master $BR
done
echo "Bridge $BR ready on host $(hostname)"for every host and edit the qemu's ip with /usr/local/bin/setup* and /etc/hostname.
This module enables GPU compute through CXL Type 2 device emulation, allowing a guest VM to access the host's NVIDIA GPU via the CXL.cache coherency protocol.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GUEST VM β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CUDA Application β β Guest libcuda.so Shim β β
β β (cuda_test.c) βββββΆβ - Intercepts CUDA Driver API calls β β
β β β β - Translates to CXL GPU command protocol β β
β βββββββββββββββββββ β - Maps BAR2 via /sys/bus/pci/.../resource2 β β
β ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β MMIO Read/Write β
β ββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β
β β CXL Type 2 PCI Device β β
β β Vendor: 0x8086 Device: 0x0d92 β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β BAR0 β β BAR2 β β BAR4 β β BAR6 β β β
β β β Component β β Cache/GPU β β Device β β MSI-X β β β
β β β Registers β β Command β β Memory β β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β
β β Linux Kernel: cxl_type2_accel.ko β β
β β - Binds to CXL Type 2 PCI device β β
β β - Configures CXL.cache and CXL.mem capabilities β β
β β - Manages cache coherency state β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
PCIe/CXL Bus
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QEMU HOST β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CXL Type 2 Device Emulation β β
β β (hw/cxl/cxl_type2.c) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β GPU Command Interface β β β
β β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β β β
β β β β Command Regs β β Result Regs β β Data Buffer β β β β
β β β β 0x00-0x3F β β 0x80-0x9F β β 0x1000-0xFFFF β β β β
β β β β β β β β (60KB for PTX, β β β β
β β β β CMD, PARAMS β β RESULT0-3 β β memcpy data) β β β β
β β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β β
β β β Coherency Engine β β β
β β β - Cache line tracking (MESI-like states) β β β
β β β - Snoop request handling β β β
β β β - Writeback management β β β
β β ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β β
β β β hetGPU Backend β β β
β β β (hw/cxl/cxl_hetgpu.c) β β β
β β β - Loads libcuda.so via dlopen() β β β
β β β - Translates commands to real CUDA API calls β β β
β β β - Manages GPU context, memory, kernel launches β β β
β β ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β dlsym() calls β
β βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β
β β /usr/lib/x86_64-linux-gnu/libcuda.so β β
β β (NVIDIA CUDA Driver Library) β β
β β cuInit, cuCtxCreate, cuMemAlloc, cuLaunchKernel, ... β β
β βββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β
β β NVIDIA GPU Hardware β β
β β (e.g., RTX 3090, A100) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The guest communicates with the CXL Type 2 device via MMIO registers in BAR2:
| Offset | Register | Description |
|---|---|---|
| 0x0000 | MAGIC | Magic number: 0x43584C32 ("CXL2") |
| 0x0004 | VERSION | Interface version |
| 0x0008 | STATUS | Device status (READY, BUSY, ERROR) |
| 0x0010 | CMD | Command register - write triggers execution |
| 0x0014 | CMD_STATUS | Command status (IDLE, RUNNING, COMPLETE) |
| 0x0018 | CMD_RESULT | Result/error code |
| 0x0040-0x78 | PARAM0-7 | Command parameters |
| 0x0080-0x98 | RESULT0-3 | Command results |
| 0x0140 | TOTAL_MEM | Total GPU memory |
| 0x1000-0xFFFF | DATA | Data buffer for PTX, memcpy |
| Command | Code | Description |
|---|---|---|
| CMD_INIT | 0x01 | Initialize GPU |
| CMD_GET_DEVICE_COUNT | 0x02 | Get number of GPUs |
| CMD_CTX_CREATE | 0x10 | Create CUDA context |
| CMD_CTX_SYNC | 0x12 | Synchronize context |
| CMD_MEM_ALLOC | 0x20 | Allocate device memory |
| CMD_MEM_FREE | 0x21 | Free device memory |
| CMD_MEM_COPY_HTOD | 0x22 | Copy host to device |
| CMD_MEM_COPY_DTOH | 0x23 | Copy device to host |
| CMD_MODULE_LOAD_PTX | 0x30 | Load PTX module |
| CMD_FUNC_GET | 0x32 | Get kernel function |
| CMD_LAUNCH_KERNEL | 0x40 | Launch GPU kernel |
Guest QEMU CXL Type 2 Host GPU
β β β
β 1. Write size to PARAM0 β β
β βββββββββββββββββββββββββββΆ β β
β β β
β 2. Write CMD_MEM_ALLOC β β
β βββββββββββββββββββββββββββΆ β β
β β 3. Call cuMemAlloc_v2() β
β β ββββββββββββββββββββββββΆ β
β β β
β β 4. Return device pointer β
β β ββββββββββββββββββββββββ β
β β β
β β 5. Store in RESULT0 β
β 6. Poll CMD_STATUS β β
β βββββββββββββββββββββββββββΆ β β
β β β
β 7. Read RESULT0 (dev ptr) β β
β βββββββββββββββββββββββββββ β β
cd lib/qemu/build
meson setup --reconfigure
ninjacd qemu_integration/guest_libcuda
makemodprobe cxl_core
modprobe cxl_port
modprobe cxl_cache
modprobe cxl_type2_accel# Set library path to use the CXL shim instead of real libcuda
LD_LIBRARY_PATH=/path/to/guest_libcuda ./your_cuda_app
# Enable debug logging
CXL_CUDA_DEBUG=1 LD_LIBRARY_PATH=. ./cuda_test-device cxl-type2,id=cxl-gpu0,\
cache-size=128M,\ # CXL.cache size
mem-size=4G,\ # Device-attached memory
hetgpu-lib=/path/to/libcuda.so,\ # CUDA library path
hetgpu-device=0 # GPU device indexThe CXL Type 2 device implements CPU-GPU cache coherency:
CPU Cache CXL Type 2 Device
β β
β ββββ Snoop Request ββββββ β (GPU wants exclusive access)
β β
β ββββ Snoop Response βββββΆ β (CPU provides data/invalidates)
β β
β ββββ Writeback ββββββββββ β (GPU writes back dirty data)
β β
This enables:
- Zero-copy data sharing between CPU and GPU
- Coherent memory regions visible to both processors
- Reduced memory copy overhead for GPU compute