Week 1
Introduction and metrics
Week 2
Metrics and ISA
Week 3
To improve the performance of a processor, we introduce a technique called Pipelining
.
Pipelining
splits instructions into multiple stages. Ideally, Throughput increases by a factor of # of the stages.
Pipeling are usually 5 stages: IF, ID, EXE, MEM, WB.
- Instruction Fetch(IF): fetch instruction from memory and PC = PC + 4.
- Instruction Decode(ID): read registers from register file and sign extension for immediate value.
- Execution(EXE): execute the instruction with one input register 0 and either register 1 or immediate value. Computing branch can also be in this stage.
- Access Memory(MEM): execute load or store instruction.
- Write Back(WB): write value back to register file.
Each stage has a buffer that passes information to the next stage. These buffers are controlled by controlling signals.
One problem for pipelining is to balance the clock period of each stage since the lowest circuit delay determines the clock cycle.
Week 4
Pipeline Hazards are events that restrict the pipeline flow.
- Structural Hazard: resource conflicts. For instance, processor with one memory unit could have structural hazard when fetching instruction and executing memory instruction at the same time.
- Data Hazard:
- Control Hazard:
Static branch predictor: fixed prediction.
Week 5
Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences. It utilizes in-order issues.
Scoreboarding limitation
- Structural hazard: functional units are busy for the current instruction.
- Resolving WAW, RAW, WAR with stalls.
- Registers are only read when they are both available.
Main idea:
- Split ID into two stages:
- Issue: decode instruction, check for structural hazard and WAW.
- Read operands: wait until no RAW hazard, read data from registers.
- Execution
- Write Back: check for WAR
Week 6
Tomasulo’s Algorithm
- Reservation stations: a buffer infront of every function unit, so that processor doesn’t stall when there’s structural hazard.
- Register renaming: read value for operands without reading from register file. These renaming data are stored in the rename table.
- A common data bus(CDB) connects every reservation station.
Tomasulo limitation
- Branch stall execution. Tomasulo doesn’t allow branch prediction. It waits until branch is resolved.
- Loads and stores are performed in order.
How can we support branch prediction - speculation execution
Multi-issue processors
- Superscalar: instructions are chosen dynamically by the hardware.
- VLIW: instructions are chosen statically by the compiler. Intel Itanium
For Tomasulo algorithm, we cannot tell which instructions are after branch instruction due to out-of-order execution.
- Identify instructions after the branch.
- Exception in specualtive code should be buffered before actually raising the exception.
- Precise exception: when a exception is raised, all instructions after the exception are squashed.
Add a reorder buffer to keep track the original order when issuing instructions.
Issue inorder -> Execute out-of-order -> Commit inorder
Week 7
Instruction can only be fetched when a branch is resolved.
Why do we need reservation station when we have reorder buffer?
reorder buffer holds output reservation station buffers input
Tomasulo with Hardware Speculation
Issue -> Execute -> Write Result(ROB) -> Commit
Trace cache
Midterm review: all until superscalars
Macro-op fusion: Fuses simple instruction combinations to reduce instruction count, kind of like Peephole optimization.
Practical limitations to ILP: programs can only have a certain level of concurrency
Week 9
Midterm review + Midterm
Week 10
Temporal Locality: recent memory access will have higher chances to be accessed again. Spatial Locality: locations near the cenet memory access will have higher chances to be accessed.
SRAM: cache DRAM: Memory
Cache Block placement
- Fully Associative(one set): block can go any where.
- Have lower miss rate.
- Must search the whole cache to find the block.
- Direct Mapped: block can only go to location
mod blocksize
.- Simplest approach.
- Blocks map to the same location, resulting in higher miss rate.
- Only have one replacement policy.
- Set Associative: n-way Associative, each set can have at most n blocks.
- Higher level caches: 2- or 4-way common (faster search).
- Lower level caches: 8- to 32-way common.
Cache Block Identification: Tag - Index - Block Offset
- Example: Cache 32 KBytes, 2-way, 64 Bytes per line, Address 32 bits
- 0x000249F0 = (0000 0000 0000 0010 0100 1001 1111 0000)_2
- Block offset = log_2 64 = 6 bits
- Index = log_2(32K / 64 / 2 (2-way)) = 15 - 6 - 1 = 8 bits
- Tag = 32 - 8 - 6 = 18 bits
Eviction Methods: which cache block to evict?
- Random
- Least-recently-used(LRU)
- Not-recently-used(NRU): any cache block other than most-recently-used.
Inclusive cache
- lower level cache has a copy of every block in higher-level caches.
- pros: in parallel systems, if lower-level cache is not presented, system doesn’t need to search higher level cache.
- cons: need to evict each level’s cache block if a cache block is evicted. Exclusive cache
- each level has dintict cache blocks.
- pros: efficient use of space since there is no duplicate cache block.
- cons: cache coherence across different processors.
Average Memory Access Time(AMAT) = Hit time + Miss rate * Miss penalty = Hit rate * Hit time + Miss rate * Miss time
- Hit time is always there because whether the block we’re trying to access is in the cache, we will need to check the cache.
Techniques for reducing hit time
- Victim Cache: stores blocks that are evicted from L1. (Also reduces Miss rate or Miss penalty)
Techniques for reducing miss penalty
- Early restart: request words in normal orde and send requested word to processor as soon as it arrives, not wait until cache line is filled.
- Critical word first: request the exact word from memory and sends requested word to processor as soon as it arrives.
- Merging write buffer: CPU only stalls on write when write buffer full. (Write may overtake early write)
Cache Miss Types
- Compulsory(Cold) miss: when a block is accessed for the first time.
- Capacity miss: cache was evicted due to capacity.
- Conflict miss: cache was evicted due to capacity of set. (Fully associative does not have this miss type)
Reducing Cold Miss Rates
- Large block size
- Prefetch: speculate future instr/data accesses and fetch them into cache. Prefetch should not be late or too early.
- Hardware prefetch: sequential and strided prefetching. Stream buffer(works well for instruction caches) to prevent cache pollution.
- Software prefetch: explicit prefetch instructions. Software prefetch with loop unrolling or software pipeline.
- Restricted to loops with array accesses and it’s hard to get right.
- High associativity caches
Basic Cache Optimization(+ improvement, - worse)
- Larger block size: + Miss rate, - Miss penalty. Doesn’t affect Hit time and power consumption.
- Bigger cache: + Miss rate(improves capacity misses), - Hit time, - Power.
- High associativity: + Miss rate(improves conflict misses), - Hit time, - Power.
- Multilevel caches: + Miss penalty(data might be found in L2 cache).
- Give priority to read misses: read misses and there are write misses in the write buffer, read misses are handled first. + Miss penalty.
- Avoid virtual to physical address translation lookup: + Hit time.
Compiler Optimizations:
- Instruction reordering: reduce conflict misses
- Data reordering: 2 arrays v.s. struct, loop interchange(swap nested loops to access memory in sequential ), loop fusion, blocking(access blocks of data)
Week 11
Virtual Memory
- Programs are too large to fit in physical memory. A method to share physical memory for a large amount of processes.
- Each process has its own, full, address space.
- Virtual Memory: pages
- Physical Memory: frames
- Recently not used pages may be swapped to disk. (Swap disk)
- Virtual Memory miss: page fault. Page not in memory, so OS needs to retrieve pages from disk(very slow).
Page Table
- OS maintains a table that maps all virtual pages to physical page frames.
- One PT per process(page table register points to the Page table of the current process) and one for the OS.
- Memory is fully associative.
- OS maintains a list of free frames.
Page Table stores info for translating virtual page number to physical page number.
Methods to make Page Tables space-efficient
- Inverted page table: a hash table that maps physical and virtual pages.
- Hierarchical page table: n-level page table. Only the N-th level page table has the value of physical frames.
Paging means that every memory access involves 2 memory accesses: 1. get physical address 2. get data from physical address.
What can we do to make paging faster?
Translation Lookaside Buffer
- A full-associative(Content Addressable Memory, CAM) cache of Page Table entries. Parallel lookup.
- Every entry has many bits(Valid, R/W, User/Supervisor, Dirty, Access), a tag, a data(physcial page number).
Virtually tagged problems
- Synonyms(alias problem): different virtual addresses points to the same physical address.
- Write to copy 1 would not be reflected in copy 2.
- Homonyms: same virtual addresses different physical address due to process switching.
- Possible solution: 1. flush cache on context switch(increase miss rate) 2. add PID to cache tag.
Methods to address a cache in a virtual-memory system
- Physically Indexed, physically tagged: translation first, increasing L1 hit time.
- Virtually Indexed, virtually tagged: cannot distinguish synonyms/homonyms in cache.
- Virtually Indexed, physically tagged: L1 cache indexed virtual address, tags can be checked after translation.
- Physically Indexed, virtually tagged: not practical.
Does physically indexed, physically tagged mean TLB and cache have to be accessed sequentially? Not if PageSize > #Sets * BlockSize
Week 12
Motivation for multicores
- Power wall, ILP wall
Parallel architecture = computing model + communication model
- Computing model: organization of cores and how data is processed
- Communication model: how cores communication?
- Shared memory: explicit synchronization(via loads and stores)
- Message passing: implicit synchronization(via messages)
Multicore processors
- Uniform memory access (UMA): physically centralized memory -> Symmetric Multiprocessor(SMP)
- Non-Uniform memory access (NUMA): physically distributed memory -> Distributed Shared-Memory
Communication Model
- Threads communication is done through shared memory variables
- Explicit data synchronization, done by the developers
The main goal for Cache Coherence is to make caches invisible.
Single Write Multiple Reader
- Write Propagation: writes are eventually visible in all processors.
- Write Serialization: writes are in the same order in all processors.
Cache Coherence Protocol: keep track of what processors have copies of what data.
- Invalidate protocols: get rid of data with old values, usually used with write-back caches.
- +: multiple writes to the cache block only require one invalidation.
- +: less bandwidth since it doesn’t need to send new value of the data.
- -: write-back data to memory when evicting a modified block.
- Update protocols: update every caches’ copy of data, usually used with write-through caches.
- +: new value can be re-used without the need to ask for it again.
- +: data can always be read from memory.
- -: possible multiple useless updates.
How can cache coherence protocols be implemented?
- Software coherence: programmer or compiler-controlled
- Hardware coherence: add state bits to cache lines to track state of the line.
- Exclusive state: block is cached only in this cache, has not been modified, but can be modified without permission. Freely modify and upgrade modified state.
Week 13
Cache Coherence Protocol Implementations
- Snooping: all cache controllers monitor all other caches’ activites and maintain the state of their lines.
- Information is shared in a common bus. Bus does not scale well.
- Each cache has a bus-side controller that monitors all transactions.
- Two types of snooping: 1. write invalidation 2. write update
- Directory: a central control device directly handles all cache activies.
- Directory acts as a serialization to provide ordering.
MSI Protocol
- Invalid: block is not present. Need to fetch it from memory or other cache.
- Shared: in > 1 caches.
- Modified: in 1 cache. Processor can read/write directly.
MESI Protocol has one more Exclusive state.
Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor.
- Hard to reduce due to communication and sharing of data in parallel application.
- False sharing: processor modify different words of the cache block but end up invalidating the complete block.
- False sharing coherence misses increase with larger cache line size.
Problems for snooping on a common shared bus
- When should memory provide data?
- What if we need to Write-back?
- Conflict when processor and bus-side controller check the cache
- State transitions may require several steps
- What to do if there are conflicting requests(race conditions) on the bus?
- Transient states
Problems for snooping with multi-level hierarchies
- Processor interacts with L1 while bus-side controller interacts with L2.
- Inclusive cache and M state caches in L1 must also be in L2
- Propagate all transactions to L1
Week 14
Snooping implementation has bottleneck at the common data bus. Thus we introduce snooping with split-transaction buses.
- Create buffer for each cache to hold pending transactions.
- Send
negative acknowledgement (NACK)
when buffers are full. - Snooping with Ring can enforce write serialization with home node. If there are multiple racing writes, ties are broken via the home node.
Directory contains a line state and sharing bit-vector.
- Line state: invalid(00), shared(01), modified(10)
- Sharing vector: not cached(00), shared(01)
Directory operation
- It is necessary to collect all acknowledgements(ACK) with write that has multiple sharers.
- Complex state changes, directory must also receive ACK.
Implementation difficulties for direcotry operation
- Operations have to be serialized locally.
- Operations have to be serialized at directory.
Directory Overhead grows with number of cores.
- Baseline overhead: (number of cores + 1 dirty bit / cache block size * 8 bits)
- Cached Directories
- Limited Pointer Directories
Distributed Directories
- Local, Home, Remote nodes.
Memory Consistency is a specification, which specifies the order of loads and stores.
- Memory Consistency model is governed by 1. Core pipeline(memory reorder) 2. Coherence Protocol
Sequential Consistency(SC): 1. Result should be the same in a time-shared multiprocessor 2. Relative order should be maintained in one thread
- Sequential Consistency is what should load really happens.
- Threads issue memory operations in program order
- Before issuing next memory operation threads wait until last issued memory operation completes (i.e., performs w.r.t. all other processors)
- A read is allowed to complete only if the matching write (i.e., the one whose value is returned to the read) also completes
Issue: memory operation leaves the processor and becomes visible to the memory subsystem. Performed: memory operation appears to have taken place.
- Performed with reference to processor X: performed to processor X.
- Globally performed or complete: performed to all processors.
Merging write buffer executes memory operations in the following sequence:
- write foo(200)
- write A(400)
- write flag(204)
- write bar(404)
foo and flag will be written to memory before A and bar.
Write-serialization: per variable. Write to same location by different processors are seen in same order by all processors. Write-atomicity: across threads
In-window Speculation:
- Speculation: read from cache before commiting. If no change, then commit; If there’s a change, then squash and replay.
- Write-prefetcing: obtain read-exclusive out-of-order or in parallel. However, the write should be in program.
Week 15
Relaxed Memory Consistency Models:
- Total Store Ordering (TSO): relaxes W -> R. Not guarantee because there could be a store buffer.
- Partial Store Ordering (PSO): relaxes W -> R and W -> W.
- Relaxes Memory Ordering (RMO): relaxes all four memory orders.
- Release Consistency (RC): relaxes all four memory orders but provides release store and acquire load.
- Reads and writes are allowed to bypass both reads and writes.
- Previous reads and writes must complete before release completes.
- No reads and writes can complete before acquire completes.
- IBM Power: relaxes all four memory orders and write atomicity. Provides 2 types of barriers.
Every relaxed consistency model ensures single thread dependencies.
Release Consistency:
- Writer-initiated invalidation
- Without Writer-initiated invalidation
Out-of-thin-air problem
Progress Axiom: a store should be eventually visible for all processors.
Synchronization is necessary to ensure that operations in a parallel program happen in the correct order.
- Conditional Variable: signal
- Mutual exclusion
Without memory consistency model, we cannot implement different types of synchronization.
Can Sequential Consistency implement mutually exclusion?
- Yes. Peterson algorithm, but it is not practical for multiple processors.
- For Relaxes models need to use fences.
Building blocks for synchronization. Special instructions(RMW atomic) of the hardware to implement locks.
- Test & set: reads a memory location and sets it to value 1.
- Compare & swap: it is used more frequently. Check the value and swap if the value is equal.
- Fetch & add: fetch value from memory location and atomically increment it.
- Load Link(or Load Locked)/Store Conditional(LL/SC): don’t need to retain exclusive state.
Week 16
Implementation of Read Modified Write(RMW) instructions:
- Lock the bus: disallows other threads
- Cache line blocking: to obtain read-exclusive state, invalidates other processors’ cache. Once a processor gains exclusive state, other processors will receive NACK from the processor that has exclusive access.
Exclusive Access
- Directory retries when receives NACK
- With Snooping, processors retires when receives NACK
RMW acts like a memory fence(flush write buffer before RMW).
Techniques for reducing test and set traffic:
- Test and test-&-set relies on cache coherece. It grabs a lock one time.
- Test and set with exponential back-off, retry test and set after pause.