Lect 6 B

Memory Hierarchy Cache Memory Cache Design Issues
CSC 213: Computer Architecture

Lecture 6b: Cache Memory
November 30, 2021

Agenda
1 Memory Hierarchy
2 Cache Memory
3 Cache Design Issues

Importance of Memory System
Every instruction makes at least one memory reference

to fetch instruction
Typically more memory references are made
to fetch operand
to store operand
A program’s memory references often determines the ultimate
performance of a program

Processor-DRAM Performance Gap

How do you Bridge the Gap?
Goal: Provide an illusion of a fast, large and cheap memory

system.
Method: Memory hierarchy.

Memory Hierarchy Diagram

Mechanics of Technology
The basic mechanics of creating memory directly affect the

first three characteristics of the hierarchy:
Decreasing cost per bit
Increasing capacity
Increasing access time
The fourth characteristic is met because of a principle known
as locality of reference

Locality of Reference
Due to the nature of programming, instructions and data tend

to cluster together (loops, subroutines, and data structures)
Over a long period of time, clusters will change
Over a short period, clusters will tend to be the same

Breaking Memory into Levels
Assume a hypothetical system has two levels of memory

Level 2 should contain all instructions and data
Level 1 doesn’t have room for everything, so when a new
cluster is required, the cluster it replaces must be sent back to
the level 2
These principles can be applied to much more than just two
levels

Performance of a Simple Two-Level Memory

Memory Hierarchy - Performance Examples
A processor has access to two levels of memory. Level 1 has

an access time of 0.01 µs and level 2 has an access time of
0.1 µs.
If 95% of the memory accesses are found in the faster level,
then the average access time might be:
(0.95)(0.01 µs) + (0.05)(0.01 µs + 0.1 µs)
= 0.0095 + 0.0055 = 0.015 µs

Hierarchy List
Registers
L1 Cache
L2 Cache
Main memory
Disk cache
Disk
Optical
Tape

Cache
What is it? A cache is a small amount of fast memory

What makes small fast?
Simpler decoding logic
More expensive SRAM technology
Close proximity to processor – Cache sits between normal main
memory and CPU or it may be located on CPU chip or module

Cache (2)

Cache Structure
Cache includes tags to identify the address of the block of

main memory contained in a line of the cache
Each word in main memory has a unique n-bit address
There are M = 2n /K block of K words in main memory
Cache contains C lines of K words each plus a tag uniquely
identifying the block of K words

Cache Structure (2)

Cache operation – overview
CPU requests contents of memory location

Check cache for this data
If present, get from cache (fast)
If not present, read required block from main memory to cache
Then deliver from cache to CPU

Cache Read Flowchart

Cache Design
Addressing
Size
Mapping Function
Replacement Algorithm
Write Policy
Block Size
Number of Caches

Cache Addressing
Where does cache sit?

Between processor and virtual memory management unit
Between MMU and main memory
Logical cache (virtual cache) stores data using virtual
addresses
Processor accesses cache directly, not thorough physical cache
Cache access faster, before MMU address translation
Virtual addresses use same address space for different
applications
Must flush cache on each context switch
Physical cache stores data using main memory physical
addresses

Cache size
Cost
More cache is expensive
Speed
More cache is faster (up to a point)
Larger decoding circuits slow up a cache
Algorithm is needed for mapping main memory addresses to
lines in the cache. This takes more time than just a direct
RAM

Typical Cache Organization

Mapping Functions
A mapping function is the method used to locate a memory

address within a cache
It is used when copying a block from main memory to the
cache and it is used again when trying to retrieve data from
the cache
There are three kinds of mapping functions
Direct
Associative
Set Associative

Cache Example
These notes use an example of a cache to illustrate each of

the mapping functions.
The characteristics of the cache used are:
Size: 64 kByte
Block size: 4 bytes
i.e. the cache has 16k (214 ) lines of 4 bytes
Address bus: 24-bit
i.e., 16M bytes main memory divided into 4M 4 byte blocks

Direct Mapping
Each block of main memory maps to only one cache line

i.e. if a block is in cache, it will always be found in the same
place
Line number is calculated using the following function
i = j modulo m
where
i = cache line number
j = main memory block number
m = number of lines in the cache

Direct Mapping Address Structure

Each main memory address can by divided into two fields
Least Significant w bits identify unique word within a block
Remaining bits (s) specify which block in memory. These are
divided into two fields
Least significant r bits of these s bits identifies which line in
the cache
Most significant s-r bits uniquely identifies the block within a
line of the cache
s-r bits r bits w bits
Tag Bits identifying Bits identifying word

row in cache offset into block

Direct Mapping Address Structure - Example
Tag s-r Line or Slot r Word w
8 14 2
24 bit address
2 bit word identifier (4 byte block)
22 bit block identifier
8 bit tag (=22-14)
14 bit slot or line
No two blocks in the same line have the same Tag field
Check contents of cache by finding line and checking Tag

Direct Mapping Cache Organization
Cache line Main Memory blocks held
0 0, m, 2m, 3m…2s-m
1 1,m+1, 2m+1…2s-m+1
m-1 m-1, 2m-1,3m-1…2s-1

Direct Mapping Cache Line Table


Direct Mapping Summary
Address length = (s + w) bits

Number of addressable units = 2s+w words or bytes
Block size = line size = 2w words or bytes
Number of blocks in main memory = 2s+w /2w = 2s
Number of lines in cache = m = 2r
Size of tag = (s - r) bits

Direct Mapping pros & cons
Simple
Inexpensive
Fixed location for given block
If a program accesses 2 blocks that map to the same line
repeatedly, cache misses are very high

Associative Mapping
A main memory block can load into any line of cache

Memory address is interpreted as:
Least significant w bits = word position within block
Most significant s bits = tag used to identify which block is
stored in a particular line of cache
Every line’s tag must be examined for a match
Cache searching gets expensive and slower

Associative Mapping from Cache to Main Memory

Fully Associative Cache Organization


Associative Mapping Address Structure - Example
Word
Tag 22 bit 2 bit
22 bit tag stored with each 32 bit block of data

Compare tag field with tag entry in cache to check for hit
Least significant 2 bits of address identify which 16 bit word is
required from 32 bit data block
Address Tag Data Cache line
FFFFFC FFFFFC 24682468 3FFF

Associative Mapping Summary

Number of lines in cache = undetermined
Size of tag = s bits

Set Associative Mapping
Cache is divided into a number of sets, v

Each set contains a number of lines, k
A given memory block maps to any line in a given set
e.g. Block B can be in any line of set i
2 lines per set is the most common organization.
called 2 way associative mapping
A given block can be in one of 2 lines in only one set

Set Associative Mapping (2)
Address length is s + w bits

Cache is divided into a number of sets, v = 2d
k blocks/lines can be contained within each set
k lines in a cache is called a k-way set associative mapping
Number of lines in a cache = v ∗ k = k ∗ 2d
Size of tag = (s-d) bits
Hybrid of Direct and Associative
k = 1, this is basically direct mapping
v = 1, this is associative mapping

Mapping From Main Memory to Cache: v Associative

Alternative Mapping: k-way Associative

K-Way Set Associative Cache Organization

Set Associative Mapping Example
Using a two-way set associative mapping

Divides the 16K lines into 8K sets
This requires a 13 bit set number
With 2 word bits, this leaves 9 bits for the tag
Block number in main memory is modulo 213
Blocks beginning with the addresses 00000016 , 00800016 ,
01000016 , 01800016 , 02000016 , 02800016 , etc. map to the
same set, Set 0.
Blocks beginning with the addresses 00000416 , 00800416 ,
01000416 , 01800416 , 02000416 , 02800416 , etc. map to the
same set, Set 1.

Set Associative Mapping Address Structure - Example
Word
Tag 9 bit Set 13 bit 2 bit
Use set field to determine cache set to look in

Compare tag field to see if we have a hit
e.g.,
Address Tag Data Set
1FF 7FFC 1FF 12345678 1FFF
001 7FFC 001 11223344 1FFF


Associative Mapping Summary

Number of lines in set = k
Number of sets = v = 2d
Number of lines in cache = kv = k ∗ 2d
Size of tag = (s - d) bits

Replacement Algorithms
There must be a method for selecting which line in the cache

is going to be replaced when there’s no room for a new line
Direct mapping
There is no need for a replacement algorithm with direct
mapping
Each block only maps to one line
Replace that line

Replacement Algorithms (2)

Associative & Set Associative
Hardware implemented algorithm (speed)
Least Recently used (LRU)
Replace the block that hasn’t been touched in the longest
period of time
Two way set associative simply uses a USE bit.
- When one block is referenced, its USE bit is set while its
partner in the set is cleared
First in first out (FIFO)
replace block that has been in cache longest
Least frequently used (LFU)
replace block which has had fewest hits
Random
only slightly lower performance than use-based algorithms
LRU, FIFO, and LFU

Write Policy
Must not overwrite a cache block unless main memory is up

to date
Two main problems:
If cache is written to, main memory is invalid or if main
memory is written to, cache is invalid
Can occur if I/O can address main memory directly
Multiple CPUs may have individual caches; once one cache is
written to, all caches are invalid

Write through
All writes go to main memory as well as cache

Multiple CPUs can monitor main memory traffic to keep local
(to CPU) cache up to date
Lots of traffic
Slows down writes

Write through
All writes go to main memory as well as cache

Multiple CPUs can monitor main memory traffic to keep local
(to CPU) cache up to date
Lots of traffic
Slows down writes

Write back
Updates initially made in cache only

Update bit for cache slot is set when update occurs
If block is to be replaced, write to main memory only if
update bit is set
Other caches get out of sync
I/O must access main memory through cache
Research shows that 15% of memory references are writes

Multiple Processors/Multiple Caches
Even if a write through policy is used, other processors may

have invalid data in their caches
In other words, if a processor updates its cache and updates
main memory, a second processor may have been using the
same data in its own cache which is now invalid.

Solutions to Prevent Problems with Multiprocessor/cache

systems
Bus watching with write through

each cache watches the bus to see if data they contain is being
written to the main memory by another processor. All
processors must be using the write through policy
Hardware transparency
a “big brother” watches all caches, and upon seeing an update
to any processor’s cache, it updates main memory AND all of
the caches
Noncacheable memory
Any shared memory (identified with a chip select) may not be
cached.

Line Size
There is a relationship between line size (i.e., the number of

words in a line in the cache) and hit ratios
As the line size (block size) goes up, the hit ratio could go up
due to more words available to the principle of locality of
reference
As block size increases, however, the number of blocks goes
down, and the hit ratio will begin to go back down after a
while
Lastly, as the block size increases, the chances of a hit to a
word farther from the initially referenced word goes down
No definitive optimum value has been found

Multi-Level Caches
Increases in transistor densities have allowed for caches to be

placed inside processor chip
Internal caches have very short wires (within the chip itself)
and are therefore quite fast, even faster then any zero
wait-state memory accesses outside of the chip
This means that a super fast internal cache (level 1) can be
inside of the chip while an external cache (level 2) can provide
access faster then to main memory

Unified versus Split Caches
One cache for data and instructions (unified) or two, one for
data and one for instructions (split)
Advantages of unified cache
Higher hit rate
Balances load of instruction and data fetch
Only one cache to design & implement
Advantages of split cache
Eliminates cache contention between instruction fetch/decode
unit and execution unit
Important in pipelining

Intel x86 caches
80386 – no on chip cache

80486 – 8k using 16 byte lines and four-way set associative
organization (main memory had 32 address lines - 4 Gig)
Pentium (all versions)
Two on chip L1 caches
Data & instructions
Pentium III – L3 cache added off chip

Pentium 4 L1 and L2 Caches
L1 cache
8k bytes
64 byte lines
Four way set associative
L2 cache
Feeding both L1 caches
256k
128 byte lines
8 way set associative

Intel Cache Evolution
Processor on which feature

Problem Solution first appears
Add external cache using faster 386
External memory slower than the system bus. memory technology.
Move external cache on-chip, 486

Increased processor speed results in external bus becoming a operating at the same speed as the
bottleneck for cache access. processor.
Add external L2 cache using faster 486

Internal cache is rather small, due to limited space on chip technology than main memory
Contention occurs when both the Instruction Prefetcher and Create separate data and instruction Pentium
the Execution Unit simultaneously require access to the caches.
cache. In that case, the Prefetcher is stalled while the
Execution Unit’s data access takes place.
Create separate back-side bus that Pentium Pro

runs at higher speed than the main
(front-side) external bus. The BSB is
Increased processor speed results in external bus becoming a dedicated to the L2 cache.
bottleneck for L2 cache access.
Move L2 cache on to the processor Pentium II
chip.
Add external L3 cache. Pentium III

Some applications deal with massive databases and must
have rapid access to large amounts of data. The on-chip
caches are too small. Move L3 cache on-chip. Pentium 4

Pentium 4 Block Diagram

Pentium 4 Operation – Core Processor
Fetch/Decode Unit
Fetches instructions from L2 cache
Decode into micro-ops
Store micro-ops in L1 cache
Out of order execution logic
Schedules micro-ops
Based on data dependence and resources
May speculatively execute
Execution units
Execute micro-ops
Data from L1 cache
Results in registers
Memory subsystem – L2 cache and systems bus

Pentium 4 Design Reasoning
Decodes instructions into RISC like micro-ops before L1 cache

Micro-ops fixed length – Superscalar pipelining and scheduling
Pentium instructions long & complex
Performance improved by separating decoding from scheduling
& pipelining
Data cache is write back – Can be configured to write through
L1 cache controlled by 2 bits in register
CD = cache disable
NW = not write through
2 instructions to invalidate (flush) cache and write back then
invalidate
L2 and L3 are 8-way set-associative – Line size 128 bytes

Lect 6 B

Uploaded by

Copyright:

Available Formats

Lect 6 B

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 6 B

Uploaded by

Copyright:

Available Formats

Memory Hierarchy Cache Memory Cache Design Issues

CSC 213: Computer Architecture

November 30, 2021

CSC 213: Computer Architecture

3 Cache Design Issues

CSC 213: Computer Architecture

Importance of Memory System

Every instruction makes at least one memory reference

CSC 213: Computer Architecture

Processor-DRAM Performance Gap

CSC 213: Computer Architecture

How do you Bridge the Gap?

Goal: Provide an illusion of a fast, large and cheap memory

CSC 213: Computer Architecture

Memory Hierarchy Diagram

CSC 213: Computer Architecture

The basic mechanics of creating memory directly affect the

CSC 213: Computer Architecture

Due to the nature of programming, instructions and data tend

CSC 213: Computer Architecture

Breaking Memory into Levels

Assume a hypothetical system has two levels of memory

CSC 213: Computer Architecture

Performance of a Simple Two-Level Memory

CSC 213: Computer Architecture

Memory Hierarchy - Performance Examples

A processor has access to two levels of memory. Level 1 has

(0.95)(0.01 µs) + (0.05)(0.01 µs + 0.1 µs)

= 0.0095 + 0.0055 = 0.015 µs

CSC 213: Computer Architecture

CSC 213: Computer Architecture

What is it? A cache is a small amount of fast memory

CSC 213: Computer Architecture

CSC 213: Computer Architecture

Cache includes tags to identify the address of the block of

CSC 213: Computer Architecture

Cache Structure (2)

CSC 213: Computer Architecture

Cache operation – overview

CPU requests contents of memory location

CSC 213: Computer Architecture

Cache Read Flowchart

CSC 213: Computer Architecture

CSC 213: Computer Architecture

Where does cache sit?

CSC 213: Computer Architecture

CSC 213: Computer Architecture

Typical Cache Organization

CSC 213: Computer Architecture

A mapping function is the method used to locate a memory

CSC 213: Computer Architecture

These notes use an example of a cache to illustrate each of

CSC 213: Computer Architecture

Each block of main memory maps to only one cache line

CSC 213: Computer Architecture

Direct Mapping Address Structure

s-r bits r bits w bits

Tag Bits identifying Bits identifying word