Fall 2024: CS 6810 Computer Architecture

Week 1

Introduction and metrics

Week 2

Metrics and ISA

Week 3

To improve the performance of a processor, we introduce a technique called Pipelining. Pipelining splits instructions into multiple stages. Ideally, Throughput increases by a factor of # of the stages.

Pipeling are usually 5 stages: IF, ID, EXE, MEM, WB.

  1. Instruction Fetch(IF): fetch instruction from memory and PC = PC + 4.
  2. Instruction Decode(ID): read registers from register file and sign extension for immediate value.
  3. Execution(EXE): execute the instruction with one input register 0 and either register 1 or immediate value. Computing branch can also be in this stage.
  4. Access Memory(MEM): execute load or store instruction.
  5. Write Back(WB): write value back to register file.

Each stage has a buffer that passes information to the next stage. These buffers are controlled by controlling signals.

One problem for pipelining is to balance the clock period of each stage since the lowest circuit delay determines the clock cycle.

Week 4

Pipeline Hazards are events that restrict the pipeline flow.

  1. Structural Hazard: resource conflicts. For instance, processor with one memory unit could have structural hazard when fetching instruction and executing memory instruction at the same time.
  2. Data Hazard:
  3. Control Hazard:

Static branch predictor: fixed prediction.

Week 5

Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences. It utilizes in-order issues.

Scoreboarding limitation

Main idea:

Week 6

Tomasulo’s Algorithm

Tomasulo limitation

How can we support branch prediction - speculation execution

Multi-issue processors

For Tomasulo algorithm, we cannot tell which instructions are after branch instruction due to out-of-order execution.

  1. Identify instructions after the branch.
  2. Exception in specualtive code should be buffered before actually raising the exception.
  3. Precise exception: when a exception is raised, all instructions after the exception are squashed.

Add a reorder buffer to keep track the original order when issuing instructions.

Issue inorder -> Execute out-of-order -> Commit inorder

Week 7

Instruction can only be fetched when a branch is resolved.

Why do we need reservation station when we have reorder buffer?

reorder buffer holds output reservation station buffers input

Tomasulo with Hardware Speculation

Issue -> Execute -> Write Result(ROB) -> Commit

Trace cache

Midterm review: all until superscalars

Macro-op fusion: Fuses simple instruction combinations to reduce instruction count, kind of like Peephole optimization.

Practical limitations to ILP: programs can only have a certain level of concurrency

Week 9

Midterm review + Midterm

Week 10

Temporal Locality: recent memory access will have higher chances to be accessed again. Spatial Locality: locations near the cenet memory access will have higher chances to be accessed.

SRAM: cache DRAM: Memory

Cache Block placement

Cache Block Identification: Tag - Index - Block Offset

Eviction Methods: which cache block to evict?

Inclusive cache

Average Memory Access Time(AMAT) = Hit time + Miss rate * Miss penalty = Hit rate * Hit time + Miss rate * Miss time

Techniques for reducing hit time

Techniques for reducing miss penalty

Cache Miss Types

Reducing Cold Miss Rates

Basic Cache Optimization(+ improvement, - worse)

Compiler Optimizations:

Week 11

Virtual Memory

Page Table

Page Table stores info for translating virtual page number to physical page number.

Methods to make Page Tables space-efficient

Paging means that every memory access involves 2 memory accesses: 1. get physical address 2. get data from physical address.

What can we do to make paging faster?

Translation Lookaside Buffer

Virtually tagged problems

Methods to address a cache in a virtual-memory system

Does physically indexed, physically tagged mean TLB and cache have to be accessed sequentially? Not if PageSize > #Sets * BlockSize

Week 12

Motivation for multicores

Parallel architecture = computing model + communication model

Multicore processors

Communication Model

The main goal for Cache Coherence is to make caches invisible.

Single Write Multiple Reader

Cache Coherence Protocol: keep track of what processors have copies of what data.

How can cache coherence protocols be implemented?

Week 13

Cache Coherence Protocol Implementations

MSI Protocol

MESI Protocol has one more Exclusive state.

Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor.

Problems for snooping on a common shared bus

Problems for snooping with multi-level hierarchies

Week 14

Snooping implementation has bottleneck at the common data bus. Thus we introduce snooping with split-transaction buses.

Directory contains a line state and sharing bit-vector.

Directory operation

Implementation difficulties for direcotry operation

Directory Overhead grows with number of cores.

Distributed Directories

Memory Consistency is a specification, which specifies the order of loads and stores.

Sequential Consistency(SC): 1. Result should be the same in a time-shared multiprocessor 2. Relative order should be maintained in one thread

Issue: memory operation leaves the processor and becomes visible to the memory subsystem. Performed: memory operation appears to have taken place.

Merging write buffer executes memory operations in the following sequence:

foo and flag will be written to memory before A and bar.

Write-serialization: per variable. Write to same location by different processors are seen in same order by all processors. Write-atomicity: across threads

In-window Speculation:

Week 15

Relaxed Memory Consistency Models:

Every relaxed consistency model ensures single thread dependencies.

Release Consistency:

Out-of-thin-air problem

Progress Axiom: a store should be eventually visible for all processors.

Synchronization is necessary to ensure that operations in a parallel program happen in the correct order.

Without memory consistency model, we cannot implement different types of synchronization.

Can Sequential Consistency implement mutually exclusion?

Building blocks for synchronization. Special instructions(RMW atomic) of the hardware to implement locks.

Week 16

Implementation of Read Modified Write(RMW) instructions:

Exclusive Access

RMW acts like a memory fence(flush write buffer before RMW).

Techniques for reducing test and set traffic: