Fall 2024: CS 6810 Computer Architecture

2024-12-05

Week 1

Introduction and metrics

Week 2

Metrics and ISA

Week 3

To improve the performance of a processor, we introduce a technique called Pipelining. Pipelining splits instructions into multiple stages. Ideally, Throughput increases by a factor of # of the stages.

Pipeling are usually 5 stages: IF, ID, EXE, MEM, WB.

Instruction Fetch(IF): fetch instruction from memory and PC = PC + 4.
Instruction Decode(ID): read registers from register file and sign extension for immediate value.
Execution(EXE): execute the instruction with one input register 0 and either register 1 or immediate value. Computing branch can also be in this stage.
Access Memory(MEM): execute load or store instruction.
Write Back(WB): write value back to register file.

Each stage has a buffer that passes information to the next stage. These buffers are controlled by controlling signals.

One problem for pipelining is to balance the clock period of each stage since the lowest circuit delay determines the clock cycle.

Week 4

Pipeline Hazards are events that restrict the pipeline flow.

Structural Hazard: resource conflicts. For instance, processor with one memory unit could have structural hazard when fetching instruction and executing memory instruction at the same time.
Data Hazard:
Control Hazard:

Static branch predictor: fixed prediction.

Week 5

Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences. It utilizes in-order issues.

Scoreboarding limitation

Structural hazard: functional units are busy for the current instruction.
Resolving WAW, RAW, WAR with stalls.
Registers are only read when they are both available.

Main idea:

Split ID into two stages:
- Issue: decode instruction, check for structural hazard and WAW.
- Read operands: wait until no RAW hazard, read data from registers.
Execution
Write Back: check for WAR

Week 6

Tomasulo’s Algorithm

Reservation stations: a buffer infront of every function unit, so that processor doesn’t stall when there’s structural hazard.
Register renaming: read value for operands without reading from register file. These renaming data are stored in the rename table.
A common data bus(CDB) connects every reservation station.

Tomasulo limitation

Branch stall execution. Tomasulo doesn’t allow branch prediction. It waits until branch is resolved.
Loads and stores are performed in order.

How can we support branch prediction - speculation execution

Multi-issue processors

Superscalar: instructions are chosen dynamically by the hardware.
VLIW: instructions are chosen statically by the compiler. Intel Itanium

For Tomasulo algorithm, we cannot tell which instructions are after branch instruction due to out-of-order execution.

Identify instructions after the branch.
Exception in specualtive code should be buffered before actually raising the exception.
Precise exception: when a exception is raised, all instructions after the exception are squashed.

Add a reorder buffer to keep track the original order when issuing instructions.

Issue inorder -> Execute out-of-order -> Commit inorder

Week 7

Instruction can only be fetched when a branch is resolved.

Why do we need reservation station when we have reorder buffer?

reorder buffer holds output reservation station buffers input

Tomasulo with Hardware Speculation

Issue -> Execute -> Write Result(ROB) -> Commit

Trace cache

Midterm review: all until superscalars

Macro-op fusion: Fuses simple instruction combinations to reduce instruction count, kind of like Peephole optimization.

Practical limitations to ILP: programs can only have a certain level of concurrency

Week 9

Midterm review + Midterm

Week 10

Temporal Locality: recent memory access will have higher chances to be accessed again. Spatial Locality: locations near the cenet memory access will have higher chances to be accessed.

SRAM: cache DRAM: Memory

Cache Block placement

Fully Associative(one set): block can go any where.
- Have lower miss rate.
- Must search the whole cache to find the block.
Direct Mapped: block can only go to location mod blocksize.
- Simplest approach.
- Blocks map to the same location, resulting in higher miss rate.
- Only have one replacement policy.
Set Associative: n-way Associative, each set can have at most n blocks.
- Higher level caches: 2- or 4-way common (faster search).
- Lower level caches: 8- to 32-way common.

Cache Block Identification: Tag - Index - Block Offset

Example: Cache 32 KBytes, 2-way, 64 Bytes per line, Address 32 bits
- 0x000249F0 = (0000 0000 0000 0010 0100 1001 1111 0000)_2
- Block offset = log_2 64 = 6 bits
- Index = log_2(32K / 64 / 2 (2-way)) = 15 - 6 - 1 = 8 bits
- Tag = 32 - 8 - 6 = 18 bits

Eviction Methods: which cache block to evict?

Random
Least-recently-used(LRU)
Not-recently-used(NRU): any cache block other than most-recently-used.

Inclusive cache

lower level cache has a copy of every block in higher-level caches.
- pros: in parallel systems, if lower-level cache is not presented, system doesn’t need to search higher level cache.
- cons: need to evict each level’s cache block if a cache block is evicted. Exclusive cache
each level has dintict cache blocks.
- pros: efficient use of space since there is no duplicate cache block.
- cons: cache coherence across different processors.

Average Memory Access Time(AMAT) = Hit time + Miss rate * Miss penalty = Hit rate * Hit time + Miss rate * Miss time

Hit time is always there because whether the block we’re trying to access is in the cache, we will need to check the cache.

Techniques for reducing hit time

Victim Cache: stores blocks that are evicted from L1. (Also reduces Miss rate or Miss penalty)

Techniques for reducing miss penalty

Early restart: request words in normal orde and send requested word to processor as soon as it arrives, not wait until cache line is filled.
Critical word first: request the exact word from memory and sends requested word to processor as soon as it arrives.
Merging write buffer: CPU only stalls on write when write buffer full. (Write may overtake early write)

Cache Miss Types

Compulsory(Cold) miss: when a block is accessed for the first time.
Capacity miss: cache was evicted due to capacity.
Conflict miss: cache was evicted due to capacity of set. (Fully associative does not have this miss type)

Reducing Cold Miss Rates

Large block size
Prefetch: speculate future instr/data accesses and fetch them into cache. Prefetch should not be late or too early.
- Hardware prefetch: sequential and strided prefetching. Stream buffer(works well for instruction caches) to prevent cache pollution.
- Software prefetch: explicit prefetch instructions. Software prefetch with loop unrolling or software pipeline.
  - Restricted to loops with array accesses and it’s hard to get right.
High associativity caches

Basic Cache Optimization(+ improvement, - worse)

Larger block size: + Miss rate, - Miss penalty. Doesn’t affect Hit time and power consumption.
Bigger cache: + Miss rate(improves capacity misses), - Hit time, - Power.
High associativity: + Miss rate(improves conflict misses), - Hit time, - Power.
Multilevel caches: + Miss penalty(data might be found in L2 cache).
Give priority to read misses: read misses and there are write misses in the write buffer, read misses are handled first. + Miss penalty.
Avoid virtual to physical address translation lookup: + Hit time.

Compiler Optimizations:

Instruction reordering: reduce conflict misses
Data reordering: 2 arrays v.s. struct, loop interchange(swap nested loops to access memory in sequential ), loop fusion, blocking(access blocks of data)

Week 11

Virtual Memory

Programs are too large to fit in physical memory. A method to share physical memory for a large amount of processes.
Each process has its own, full, address space.
- Virtual Memory: pages
- Physical Memory: frames
Recently not used pages may be swapped to disk. (Swap disk)
Virtual Memory miss: page fault. Page not in memory, so OS needs to retrieve pages from disk(very slow).

Page Table

OS maintains a table that maps all virtual pages to physical page frames.
One PT per process(page table register points to the Page table of the current process) and one for the OS.
Memory is fully associative.
OS maintains a list of free frames.

Page Table stores info for translating virtual page number to physical page number.

Methods to make Page Tables space-efficient

Inverted page table: a hash table that maps physical and virtual pages.
Hierarchical page table: n-level page table. Only the N-th level page table has the value of physical frames.

Paging means that every memory access involves 2 memory accesses: 1. get physical address 2. get data from physical address.

What can we do to make paging faster?

Translation Lookaside Buffer

A full-associative(Content Addressable Memory, CAM) cache of Page Table entries. Parallel lookup.
Every entry has many bits(Valid, R/W, User/Supervisor, Dirty, Access), a tag, a data(physcial page number).

Virtually tagged problems

Synonyms(alias problem): different virtual addresses points to the same physical address.
- Write to copy 1 would not be reflected in copy 2.
Homonyms: same virtual addresses different physical address due to process switching.
- Possible solution: 1. flush cache on context switch(increase miss rate) 2. add PID to cache tag.

Methods to address a cache in a virtual-memory system

Physically Indexed, physically tagged: translation first, increasing L1 hit time.
Virtually Indexed, virtually tagged: cannot distinguish synonyms/homonyms in cache.
Virtually Indexed, physically tagged: L1 cache indexed virtual address, tags can be checked after translation.
Physically Indexed, virtually tagged: not practical.

Does physically indexed, physically tagged mean TLB and cache have to be accessed sequentially? Not if PageSize > #Sets * BlockSize

Week 12

Motivation for multicores

Power wall, ILP wall

Parallel architecture = computing model + communication model

Computing model: organization of cores and how data is processed
Communication model: how cores communication?
- Shared memory: explicit synchronization(via loads and stores)
- Message passing: implicit synchronization(via messages)

Multicore processors

Uniform memory access (UMA): physically centralized memory -> Symmetric Multiprocessor(SMP)
Non-Uniform memory access (NUMA): physically distributed memory -> Distributed Shared-Memory

Communication Model

Threads communication is done through shared memory variables
Explicit data synchronization, done by the developers

The main goal for Cache Coherence is to make caches invisible.

Single Write Multiple Reader

Write Propagation: writes are eventually visible in all processors.
Write Serialization: writes are in the same order in all processors.

Cache Coherence Protocol: keep track of what processors have copies of what data.

Invalidate protocols: get rid of data with old values, usually used with write-back caches.
- +: multiple writes to the cache block only require one invalidation.
- +: less bandwidth since it doesn’t need to send new value of the data.
- -: write-back data to memory when evicting a modified block.
Update protocols: update every caches’ copy of data, usually used with write-through caches.
- +: new value can be re-used without the need to ask for it again.
- +: data can always be read from memory.
- -: possible multiple useless updates.

How can cache coherence protocols be implemented?

Software coherence: programmer or compiler-controlled
Hardware coherence: add state bits to cache lines to track state of the line.
- Exclusive state: block is cached only in this cache, has not been modified, but can be modified without permission. Freely modify and upgrade modified state.

Week 13

Cache Coherence Protocol Implementations

Snooping: all cache controllers monitor all other caches’ activites and maintain the state of their lines.
- Information is shared in a common bus. Bus does not scale well.
- Each cache has a bus-side controller that monitors all transactions.
- Two types of snooping: 1. write invalidation 2. write update
Directory: a central control device directly handles all cache activies.
- Directory acts as a serialization to provide ordering.

MSI Protocol

Invalid: block is not present. Need to fetch it from memory or other cache.
Shared: in > 1 caches.
Modified: in 1 cache. Processor can read/write directly.

MESI Protocol has one more Exclusive state.

Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor.

Hard to reduce due to communication and sharing of data in parallel application.
False sharing: processor modify different words of the cache block but end up invalidating the complete block.
- False sharing coherence misses increase with larger cache line size.

Problems for snooping on a common shared bus

When should memory provide data?
What if we need to Write-back?
Conflict when processor and bus-side controller check the cache
State transitions may require several steps
What to do if there are conflicting requests(race conditions) on the bus?
- Transient states

Problems for snooping with multi-level hierarchies

Processor interacts with L1 while bus-side controller interacts with L2.
- Inclusive cache and M state caches in L1 must also be in L2
- Propagate all transactions to L1

Week 14

Snooping implementation has bottleneck at the common data bus. Thus we introduce snooping with split-transaction buses.

Create buffer for each cache to hold pending transactions.
Send negative acknowledgement (NACK) when buffers are full.
Snooping with Ring can enforce write serialization with home node. If there are multiple racing writes, ties are broken via the home node.

Directory contains a line state and sharing bit-vector.

Line state: invalid(00), shared(01), modified(10)
Sharing vector: not cached(00), shared(01)

Directory operation

It is necessary to collect all acknowledgements(ACK) with write that has multiple sharers.
Complex state changes, directory must also receive ACK.

Implementation difficulties for direcotry operation

Operations have to be serialized locally.
Operations have to be serialized at directory.

Directory Overhead grows with number of cores.

Baseline overhead: (number of cores + 1 dirty bit / cache block size * 8 bits)
Cached Directories
Limited Pointer Directories

Distributed Directories

Local, Home, Remote nodes.

Memory Consistency is a specification, which specifies the order of loads and stores.

Memory Consistency model is governed by 1. Core pipeline(memory reorder) 2. Coherence Protocol

Sequential Consistency(SC): 1. Result should be the same in a time-shared multiprocessor 2. Relative order should be maintained in one thread

Sequential Consistency is what should load really happens.
1. Threads issue memory operations in program order
2. Before issuing next memory operation threads wait until last issued memory operation completes (i.e., performs w.r.t. all other processors)
3. A read is allowed to complete only if the matching write (i.e., the one whose value is returned to the read) also completes

Issue: memory operation leaves the processor and becomes visible to the memory subsystem. Performed: memory operation appears to have taken place.

Performed with reference to processor X: performed to processor X.
Globally performed or complete: performed to all processors.

Merging write buffer executes memory operations in the following sequence:

write foo(200)
write A(400)
write flag(204)
write bar(404)

foo and flag will be written to memory before A and bar.

Write-serialization: per variable. Write to same location by different processors are seen in same order by all processors. Write-atomicity: across threads

In-window Speculation:

Speculation: read from cache before commiting. If no change, then commit; If there’s a change, then squash and replay.
Write-prefetcing: obtain read-exclusive out-of-order or in parallel. However, the write should be in program.

Week 15

Relaxed Memory Consistency Models:

Total Store Ordering (TSO): relaxes W -> R. Not guarantee because there could be a store buffer.
Partial Store Ordering (PSO): relaxes W -> R and W -> W.
Relaxes Memory Ordering (RMO): relaxes all four memory orders.
Release Consistency (RC): relaxes all four memory orders but provides release store and acquire load.
- Reads and writes are allowed to bypass both reads and writes.
- Previous reads and writes must complete before release completes.
- No reads and writes can complete before acquire completes.
IBM Power: relaxes all four memory orders and write atomicity. Provides 2 types of barriers.

Every relaxed consistency model ensures single thread dependencies.

Release Consistency:

Writer-initiated invalidation
Without Writer-initiated invalidation

Out-of-thin-air problem

Progress Axiom: a store should be eventually visible for all processors.

Synchronization is necessary to ensure that operations in a parallel program happen in the correct order.

Conditional Variable: signal
Mutual exclusion

Without memory consistency model, we cannot implement different types of synchronization.

Can Sequential Consistency implement mutually exclusion?

Yes. Peterson algorithm, but it is not practical for multiple processors.
For Relaxes models need to use fences.

Building blocks for synchronization. Special instructions(RMW atomic) of the hardware to implement locks.

Test & set: reads a memory location and sets it to value 1.
Compare & swap: it is used more frequently. Check the value and swap if the value is equal.
Fetch & add: fetch value from memory location and atomically increment it.
Load Link(or Load Locked)/Store Conditional(LL/SC): don’t need to retain exclusive state.

Week 16

Implementation of Read Modified Write(RMW) instructions:

Lock the bus: disallows other threads
Cache line blocking: to obtain read-exclusive state, invalidates other processors’ cache. Once a processor gains exclusive state, other processors will receive NACK from the processor that has exclusive access.

Exclusive Access

Directory retries when receives NACK
With Snooping, processors retires when receives NACK

RMW acts like a memory fence(flush write buffer before RMW).

Techniques for reducing test and set traffic:

Test and test-&-set relies on cache coherece. It grabs a lock one time.
Test and set with exponential back-off, retry test and set after pause.