

### The Art of Memory System Design



#### **Pipelining and the cache** (Designing...,M.J.Quinn, '87)

**Instruction Pipelining** is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.

#### Ferranti ATLAS (1963):

- Pipelining reduced the average time per instruction by 375%
- Memory could not keep up with the CPU, needed a cache.

Cache memory is a small, fast memory unit used as a buffer between a processor and primary memory

### **Principle of Locality**

#### • Principle of Locality

states that programs access a relatively small portion of their address space at <u>any instance of time</u>

#### <u>Two types of locality</u>

• <u>Temporal locality</u> (locality in time) If an item is referenced, then <u>the same</u> item will tend to be referenced soon "the tendency to reuse recently accessed data items"

#### <u>Spatial locality</u> (locality in space)

If an item is referenced, then <u>nearby</u> items will be referenced soon "the tendency to reference nearby data items"



### Memory Hierarchy of a Modern Computer System

- By taking advantage of the principle of locality:
  - -Present the user with as much memory as is available in the cheapest technology.
  - -Provide access at the speed offered by the fastest technology.



| Speed (ns): 1s     | 10s | 100s | 10,000,000s | 10,000,000,000s |
|--------------------|-----|------|-------------|-----------------|
| Size (bytes): 100s |     |      | (10s ms)    | (10s sec)       |
|                    | Ks  | Ms   | Gs          | Ts              |

### Cache Memory Technology: SRAM 1 bit cell layout



#### Figure 3. IBM's 6-Transistor Memory Cell



# Basic DRAM design

- DRAM replaces all but one transitors of flip-flop with a capacitor
- => smaller!
- Capacitor stores information
- Charge leakage requires periodic refreshment (sense & rewrite)



### Memories Technology and Principle of Locality

- Faster Memories are more expensive per bit
- Slower Memories are usually smaller in area size per bit

| Memory<br>Technology | Typical access<br>time | \$ per Mbyte in<br>1997 |
|----------------------|------------------------|-------------------------|
| SRAM                 | 5-25 ns                | \$100-\$250             |
| DRAM                 | 60-120 ns              | \$5-\$10                |
| Magnetic Disk        | 10-20 million ns       | \$0.10-\$0.20           |

## Cache Memory Technology: SRAM



#### • Speed.

The primary advantage of an SRAM over DRAM is speed.

The fastest DRAMs on the market still require 5 to 10 processor clock cycles to access the first bit of data.

SRAMs can operate at processor speeds of 250 MHz and beyond, with access and cycle times equal to the clock cycle used by the microprocessor

#### • Density.

when 64 Mb DRAMs are rolling off the production lines, the largest SRAMs are expected to be only 16 Mb.

see reference: http://www.chips.ibm.com/products/memory/sramoperations/sramop.html

• Volatility.

Unlike DRAMs, SRAM cells do not need to be refreshed. SRAMs are available 100% of the time for reading & writing.

#### • Cost.

If cost is the primary factor in a memory design, then DRAMs win hands down.

If, on the other hand, performance is a critical factor, then a well-designed SRAM is an effective cost performance solution.

### **Memory Hierarchy of a Modern Computer System**

- By taking advantage of the principle of locality:
  - -Present the user with as much memory as is available in the cheapest technology.
  - -Provide access at the speed offered by the fastest technology.
- DRAM is slow but cheap and dense:
  - -Good choice for presenting the user with a BIG memory system
- SRAM is fast but expensive and not very dense:

-Good choice for providing the user FAST access time.

### **Cache Terminology**

A hit if the data requested by the CPU is in the upper level

Hit rate or Hit ratio

is the fraction of accesses found in the upper level

#### Hit time

is the time required to access data in the upper level
= <detection time for hit or miss> + <hit access time>

#### A miss if the data is not found in the upper level

Miss rate or (1 – hit rate)

is the fraction of accesses <u>not</u> found in the upper level Miss penalty

is the time required to access data in the lower level
= <lower access time>+<reload processor time>

### **Cache Example**



Hit time = Time 1

**Miss penalty = Time 2 + Time 3** 

#### **Basic Cache System**

Figure 1. Basic Cache System



स्म 📢

### Cache Memory Technology: SRAM Block diagram

Figure 2. Simplified Block Diagram of a Synchronous SRAM

≝ ≪



### Cache Memory Technology: SRAM timing diagram

≓ •ા

Figure 4. Reading from Memory (Flow Thru mode)



Note: DQ0 is the data associated with Address 0 (A0). DQ1 is the data associated with Address 1 (A1).

### **Direct Mapped Cache**

• Direct Mapped: assign the cache location based on the address of the word in memory

• cache\_address = memory\_address modulo cache\_size;



Memory

**Observe there is a Many-to-1 memory to cache relationship** 

### **Direct Mapped Cache: Data Structure**

There is a Many-to-1 relationship between memory and cache

How do we know whether the data in the cache corresponds to the requested word?

#### tags

- contain the address information required to identify whether a word in the cache corresponds to the requested word.
- tags need only to contain the upper portion of the memory address (often referred to as a page address)

#### valid bit

indicates whether an entry contains a valid address

### **Direct Mapped Cache: Temporal Example**

|   |       |                                 |       |                                  |      | E E                                       |  |
|---|-------|---------------------------------|-------|----------------------------------|------|-------------------------------------------|--|
|   | w \$2 | ,10 110<br>2,11 010<br>3,10 110 | (\$0) | Miss: vali<br>Miss: vali<br>Hit! | 1 ** | \$1,22(\$0)<br>\$2,26(\$0)<br>\$3,22(\$0) |  |
|   | Index | Valid                           |       | Tag                              |      | Data                                      |  |
|   | 000   | Ν                               |       |                                  |      |                                           |  |
|   | 001   | Ν                               |       |                                  |      |                                           |  |
| L | 010   | Y                               |       | 11                               | Mem  | Memory[11010]                             |  |
|   | 011   | Ν                               |       |                                  |      |                                           |  |
|   | 100   | Ν                               |       |                                  |      |                                           |  |
|   | 101   | Ν                               |       |                                  |      |                                           |  |
|   | 110   | Y                               |       | 10                               | Mem  | ory[10110]                                |  |
|   | 111   | Ν                               |       |                                  |      |                                           |  |

#### **Direct Mapped Cache: Worst case, always miss!**

|   |       |                                 |       |                                      |                  | <u></u> 3                                |
|---|-------|---------------------------------|-------|--------------------------------------|------------------|------------------------------------------|
|   | w \$2 | ,10 110<br>2,11 110<br>3,00 110 | (\$0) | Miss: vali<br>Miss: tag<br>Miss: tag | d Iw<br>Iw<br>Iw | \$1,22(\$0)<br>\$2,30(\$0)<br>\$3,6(\$0) |
|   | Index | Valid                           |       | Tag                                  |                  | Data                                     |
|   | 000 N |                                 |       |                                      |                  |                                          |
|   | 001   | Ν                               |       |                                      |                  |                                          |
|   | 010   | Ν                               |       |                                      |                  |                                          |
|   | 011   | Ν                               |       |                                      |                  |                                          |
|   | 100   | Ν                               |       |                                      |                  |                                          |
|   | 101   | Ν                               |       |                                      |                  |                                          |
| Ľ | 110   | Υ                               |       | 00                                   | Me               | emory[00110]                             |
|   | 111   | N                               |       |                                      |                  |                                          |

#### **Direct Mapped Cache: Mips Architecture**



### **Bits in a Direct Mapped Cache**

How many total bits are required for a direct mapped cache with 64KB (= 2<sup>16</sup> KiloBytes) of data and one word (=32 bit) blocks assuming a 32 bit <u>byte</u> memory address? Cache index width = log<sub>2</sub> words

 $= \log_2 2^{16}/4 = \log_2 2^{14}$  words = 14 bits

Block address width = <br/> byte address width >  $-\log_2$  word = 32 - 2 = 30 bits

Tag size = <block address width> - <cache index width> = 30 - 14 = 16 bits

Cache block size = <valid size>+<tag size>+<block data size> = 1 bit + 16 bits + 32 bits = 49 bits Total size = <Cache word size> × <Cache block size>

> = 2<sup>14</sup> words × 49 bits = 784 × 2<sup>10</sup> = 784 Kbits = 98 KB = 98 KB/64 KB = 1.5 times overhead

### **Split Cache: Exploiting the Harvard Architectures**





<u>Von Neuman architecture</u> Area efficient but requires higher bus bandwidth because instructions and data must compete for memory. <u>Harvard architecture</u> was coined to describe machines with separate memories. **Speed efficient:** Increased parallelism (split cache).

#### **Modern Systems: Pentium Pro and PowerPC**



#### **RAM** (main memory) : von Neuman Architecture

**Cache:** uses Harvard Architecture separate Instruction/Data caches

| Characteristic      | Intel Pentium Pro                 | PowerPC 604                      |  |
|---------------------|-----------------------------------|----------------------------------|--|
| Cache organization  | Split instruction and data caches | Split intruction and data caches |  |
| Cache size          | 8 KB each for instructions/data   | 16 KB each for instructions/data |  |
| Cache associativity | Four-way set associative          | Four-way set associative         |  |
| Replacement         | Approximated LRU replacement      | LRU replacement                  |  |
| Block size          | 32 bytes                          | 32 bytes                         |  |
| Write policy        | Write-back                        | Write-back or write-through      |  |

### **Cache schemes**



#### Hits vs. Misses





<u>Read hits</u>

-this is what we want!

#### <u>Read misses</u>

-stall the CPU, fetch block from memory, deliver to cache, and restart.

#### <u>Write hits</u>

-write-through: can replace data in cache and memory.

-write-buffer: write data into cache and buffer.

-write-back: write the data only into the cache.

#### <u>Write misses</u>

-read the entire block into the cache, then write the word.

### **Example: The DECStation 3100 cache**

DECStation uses a write-through harvard architecture cache

- 128 KB total cache size (=32K words)
  - = 64 KB instruction cache (=16K words)
  - + 64 KB data cache (=16K words)
- 10 processor clock cycles to write to memory

## The DECStation 3100 miss rates

• A split instruction and data cache increases the bandwidth

| Benchmark<br>Program         | gcc  | spice | 1.2% miss, also means<br>that 98.2% of the time it<br>is in the cache. So |
|------------------------------|------|-------|---------------------------------------------------------------------------|
| Instruction<br>miss rate     | 6.1% | 1.2%  | using a cache pays off!                                                   |
| Data<br>miss rate            | 2.1% | 1.3%  | Why a lower miss rate?                                                    |
| Effective split<br>miss rate | 5.4% | 1.2%  | Numerical programs tend to consist of a lot                               |
| Combined miss rate           | 4.8% |       | of small program loops                                                    |

split cache has slightly worse miss rate

### **Review: Principle of Locality**

#### • Principle of Locality

states that programs access a relatively small portion of their address space at <u>any instance of time</u>

#### <u>Two types of locality</u>

• <u>Temporal locality</u> (locality in time) If an item is referenced, then <u>the same</u> item will tend to be referenced soon "the tendency to reuse recently accessed data items"

### <u>Spatial locality</u> (locality in space)

If an item is referenced, then <u>nearby</u> items will be referenced soon "the tendency to reference nearby data items"

### **Spatial Locality**

#### <u>Temporal only cache</u>

cache block contains only one word (No spatial locality).

#### <u>Spatial locality</u>

Cache block contains multiple words.

- When a miss occurs, then fetch multiple words.
- <u>Advantage</u>

Hit ratio increases because there is a high probability that the adjacent words will be needed shortly.

Disadvantage

Miss penalty increases with block size

### Spatial Locality: 64 KB cache, 4 words

- 64KB cache using four-word (16-byte word)
- 16 bit tag, 12 bit index, 2 bit block offset, 2 bit byte offset.



### Performance

Use split caches because there is more spatial locality in

| Program<br>Block size        | gcc<br>=1 | gcc<br>=4 | spice<br>=1 | spice<br>=4 |
|------------------------------|-----------|-----------|-------------|-------------|
| Instruction<br>miss rate     | 6.1%      | 2.0%      | 1.2%        | 0.3%        |
| Data<br>miss rate            | 2.1%      | 1.7%      | 1.3%        | 0.6%        |
| Effective split<br>miss rate | 5.4%      | 1.9%      | 1.2%        | 0.4%        |
| Combined miss rate           | 4.8%      | 4.8%      |             |             |

Temporal only split cache: has slightly worse miss rate Spatial split cache: has lower miss rate

### **Cache Block size Performance**

• Increasing the block size tends to decrease miss rate:

