



# Making the Case for CXL Native Memory

Wolley Inc.

**Presenter: San Chang** 

©2023 Flash Memory Summit. All Rights Reserved



# CXL Memory Module for Enterprise/Server Applications





- CXL Memory the memory "**module**" with CXL interface
  - Usually built with a CXL-to-DDR controller and a number of DDR chips
  - The module can carry multiple DDR chips to offer high memory capacity

### CXL Memory Module in Memory Hierarchy









- Typical client applications do not require far memory
- Both CXL switch and CXL-to-DDR controllers are "overkill" for client applications

- Client Host directly attaches to memory chips with CXL interface
- LPDDR5 replacement?



|                                                                                                                                                     | Serial Interfaces         | Parallel Interfaces               |
|-----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|-----------------------------------|
|                                                                                                                                                     | CXL<br>(8-lane PCle Gen5) | LPDDR5<br>(2x LPDDR5 6400 16-bit) |
| Bandwidth                                                                                                                                           | 256Gbps, Full duplex      | 204.8Gbps, Half duplex            |
| Power Consumption (PHY only)                                                                                                                        | 1.5 ~ 2 pJ/bit            | 2 ~ 2.5 pJ/bit                    |
| IO (Pin Count)                                                                                                                                      | 49                        | 68 x 2                            |
| Single-command Latency                                                                                                                              | 80-100 ns                 | 40-45 ns                          |
| <ul> <li>Longer Latency (<i>really</i>?)</li> <li>Smaller Area</li> <li>Device Controller</li> <li>Smaller Area</li> <li>Host Controller</li> </ul> |                           |                                   |

# Areas of expertise



#### ARM Research, AMD Research, Google, Micron, Metempsy, HP, and Samsung • Multiple ISAs (x86, Alpha, ARM, SPARC, MIPS, POWER, RISC-V) CXL Modeling Wolley-developed model to simulate full-duplex interface • Gen5x8, 32GB/s: 1GHz, 1ns (PIPE: 32bit per lane => 32B 8-lane @1GHz)

DRAMSim3

• Gem5

- An extension of the well-known DRAM model DRAMSim2
- Cycle-accurate and multiple protocols support (DDR3, DDR4, LPDDR3, LPDDR4, GDDR5, GDDR6, HBM, HMC, • STT-MRAM)

Command

**CXL** Native Memory



Generic DRAM

BankState

Optional Thermal Module

#### System Simulation - [CPU + CXL] $\iff$ CXL Native Memory



#### System Simulation

## - CXL Interface + CXL Native Memory



- Simulation includes
  - CXL translation layer
  - CXL link layer
  - CXL ARB/MUX
- Detail setup of timing parameters refers to our IP design
- We use existing DRAM controller wrapper to emulate the memory management inside the CXL Native Memory
  - Without DDR, real implementation could have a shorter latency



#### **CXL Latency Outperforms in Congested** Bandwidth – Case 1 (L1 32KB)

LPDDR

•



- Application latency is much shorter with CXL Native Memory in real applications, particularly when memory bandwidth is under high utilization
- R/W Ratio
  - perlbench (r53/w47)
  - gcc (r54/w46)
  - x264 (r64/w36)
  - leela (r79/w21)



#### CXL Latency Outperforms in Congested Bandwidth – Case 2 (L1\_32KB, L2\_256KB, L3\_8MB)

- Adopting three-level cache architecture reduces latency apparently
  - Below 150ns (w/o CXL)
  - Below 100ns (w/ CXL)
- Observation
  - Without CXL, latency generally gets longer when memory bandwidth utilization is higher





CXL Memory Performance Better than LPDDR **VTOLLEY** 



- Every configuration executes the same number of CPU cycles
  - More instructions executed means better performance
- Higher performance consumes more memory bandwidth



The Viewpoint of Meta on System Latency **Workers** 

- Memory Source
  - DDR Memory (blue line)
  - CXL Memory (red line)
- Summary
  - When DDR bandwidth hits the bottleneck, the latency suffers apparently (red face)
  - CXL memory provides extra bandwidth
    - Move the "right" amount of pages to CXL memory
  - Overall application latency decreases

#### Bandwidth utilization have a significant impact on application latency

Understanding memory usage in datacenters and Enabling software for CXL-Memory, OCP,2022 (https://www.youtube.com/watch?v=IS2CE-1sgsE)







# CXL Native Memory as "eMMC for Memory" **V**olley



|                        | CXL Native Memory                                                     | CXL Memory Module                                                    |
|------------------------|-----------------------------------------------------------------------|----------------------------------------------------------------------|
| Applications           | Client                                                                | Server                                                               |
| Definition             | the memory " <mark>chip</mark> " with CXL interface                   | the memory " <mark>module</mark> " with CXL interface                |
| Implementation         | CXL controller and memory will be tightly-<br>coupled inside the chip | Usually built with a CXL-to-DDR controller and a number of DDR chips |
|                        | CXL Native Memory<br>CXL<br>Controller<br>Chiplet                     | CXL Memory Module                                                    |
| Chip                   | eMMC<br>eMMC<br>Controller<br>Chiplet                                 | SSD Module<br>PCle SSD<br>Controller<br>Chip DDR NAND Flash<br>Chip  |
| ©2023 Flash Memory Sum | <sup>mit. All</sup> Client Application Space                          | Server Application Space 14                                          |

Call-For-Action from Client/Mobile Users **VTOLLEY** 

- In order to make progress on CXL Native Memory, the baton lies in the hands of client/mobile users
  - Memory companies won't create new products without customer interest
- Technical/Business benefit of CXL Native Memory to client/mobile users
  - Better performance (higher bandwidth, lower application latency)
  - More future-proof (LPDDR5 reaching per-pin performance limit)
  - Small/cheaper client host processor (die size no longer IO bound due to LPDDR5 IP)
  - Host processor no longer needs to be responsible for media management this allow easier switching to different CXL Native Memory chips – which has significant business negotiation advantage
- Dear client/mobile people: let us work together to promote and deliver CXL Native Memory to the industry





- CXL memory module has caught most of the attention so far for server applications requiring memory disaggregation
- In this presentation, we highlight *CXL Native Memory* as another interesting memory device for client applications
- Most people have the impression that due to the inherent serial/parallel operation, CXL has a higher latency than LPDDR5
  - But this is only true for a single-command comparison we showed through simulations that for any practical workload, CXL actually has a lower application latency than LPDDR5
- Wolley will work with memory companies and client/mobile users on CXL Native Memory products

