As new technologies often do, Memory1 has sparked some fascinating conversations. Amongst customers and system designers, one common area of interest revolves around the oft-referenced, seldom-defined notion of “performance”. Historically, there’s been little reason to deeply question, quantify, or analyze system memory performance and the tradeoffs that define it. In fact, the prevailing view on memory performance can be framed as a simple comparative:
“Memory is fast. Storage is slow.”
Until recently, there was no real need for additional complexity. For most of us, that simple comparison encapsulated the key points of interest. System memory provides the shortest path to the CPU, so data spends minimal time in transit. Location, location, location. And since “system memory” has been synonymous with “DRAM”, the comparative effectively became “DRAM is fast. Storage is slow”. With the launch of Memory1, however, system designers now have a new option. As a result, probing questions that challenge assumptions regarding “system memory performance” are being brought to the forefront.
“OK…” you might say, “but, objectively speaking, DRAM really is fast, right? Isn’t memory performance measured in nanoseconds?”
The answer is a resounding “sometimes”.
Debunking DRAM Determinism
First, let’s align on a few facts:
Fact #1: DRAM DIMMs conform to the rigidly defined DDR4 protocol and support deterministic latencies measured in nanoseconds.
Fact #2: DRAM DIMMs can be arrayed in parallel across multiple memory channels and across multiple CPUs.
Due to this powerful combination of device-level latency and system-level parallelism, DRAM consistency has rarely been questioned. Instead, system designers have historically focused on analyzing storage consistency (or lack thereof).
Truth be told, the focus on storage has been thoroughly warranted due to DRAM’s limited capacity and relative high cost. Traditional wisdom tells us that system memory, as implemented with DRAM, can’t hold all the data that you’d like it to….and therefore, large amounts of data must be pulled in from storage as needed. As a result, the impact of high, inconsistent storage latencies far outweighed any potential concern over DRAM consistency. Location, location, location. Now, however, with the emergence of alternative, high-density system memory solutions, a closer look is merited. To fairly evaluate the features and benefits of alternative solutions, an accurate understanding of real-world DRAM performance is required.
(Spoiler alert: DRAM doesn’t guarantee the invariable consistency that you may have assumed.)
Navigating NUMA Non-Uniformity
“So what about those nanosecond latencies?”
They exist. Just not always. The vast majority of modern servers leverage the NUMA architecture to connect CPUs and system memory. NUMA, which stands for “Non-Uniform Memory Access”, enables multiple CPUs to access a shared pool of system memory. This shared pool is created by aggregating the memory connected to each local CPU. However, as its name implies, access to memory within a NUMA-based design is non-uniform, which, in practice, translates to non-deterministic.
“But wait…today, system memory is DRAM, right? And doesn’t DRAM conform to the rigid DDR4 protocol? And aren’t multiple DRAM devices arrayed in parallel within the memory subsystem? So aren’t deterministic latencies a given?
Yes. And Yes. And Yes. And No. The keys to these seemingly conflicting truths lie in the realities of NUMA connectivity between CPUs. If DRAM is accessed on a single, local CPU node, the access times will be:
- Solely governed by the DDR4 protocol
- Measured in nanoseconds
- Extremely consistent
So, how realistic is the single-node scenario?
Well, if a server’s target workload can only leverage small amounts of memory, then the memory connected to a single CPU node could be sufficient. In practice, however, the majority of real-world applications require much more memory than one CPU node can support. This is one of the key reasons that modern servers employ multiple CPUs. NUMA connects those CPUs to one another using fast point-to-point interconnect technologies….however, locally connected memory remains more readily accessible than memory connected to other CPUs.
That’s critically important because, in many cases, when a CPU requests data, that data is not present in the locally connected memory. If not, then remotely connected memory is accessed. Retrieving data from non-local system memory requires inter-CPU data transfer across the point-to-point interconnect. This adds additional complexity, and the consistency typically associated with DRAM access goes right out the window. Even worse, in some 4-socket systems, two separate hops between processors may be required for a single data retrieval. Location, location, location. So while inter-CPU transfers aren’t slow, you may be waiting much longer for data than you’d anticipated. With modern applications demanding more and more memory, the impact of these delays can be quite significant.
Maximizing Main Memory
“Ok, ok. Enough about the problems. Can you do better?”
Well, ideally, each CPU would have access to more local data, thereby maximizing the hit rate of local requests and minimizing the latency and consistency impact of inter-CPU transfers. Obviously, this is easier said than done. The maximum number of physically connected DIMMs is directly determined by the CPU architecture. Unfortunately, this leaves no flexibility to expand the number of attached memory devices.
To increase the amount of locally accessible memory in a meaningful way, we have to focus on the memory technology itself. We need to pack more memory into each available DIMM slot. And that’s where Memory1 comes in. With capacities up to 256GB per DIMM, Memory1 allows up to four times more data to remain local to each CPU. Location, location, location. So yes, we can do better. Memory1 is already blazing this trail. Improved data locality…compelling economics…seamless integration. What’s not to like?
Jerome McFarland, Director, Marketing