# **Combinational Logic**

# Basic Building Blocks Transistors Metal (conductor), Oxide (insulator), Semiconductor

 $\begin{array}{ccc} n-type \ transistor: \ con- \\ nection \ source \leftrightarrow drain \Leftrightarrow \\ high \ voltage \\ Gate \ - \left[ \begin{array}{c} Drain \\ Gate \ - \end{array} \right] \\ Gate \ - \left[ \begin{array}{c} Source \\ Gate \ - \end{array} \right] \\ Source \\ Source \\ Gate \ - \end{array} \right] \\ \end{array}$ 

good at pulling down good at pulling up (elec-(eletrons don't flow well) trons flow well)

p-/n-type: opposite high voltage: 0.3V-3Vmodern system: n-&p-type  $\rightarrow$  C(omplementary)MOS Logic Gates





AND-gate/ $Y = A \cdot B = AB$ : NAND + inverter (not impossible: p pull up and n pull down - otherwise 4) — XOR: 1 $\leftrightarrow$ odd input 1

# Latency & Power Consumption

Transistors in serial slower than transistors in parallel (pseudo nMOS to alleviate latency) dynamic power: charge capacitors on signal change

 $(C \cdot V^2 \cdot f$ , capacitence, voltage, frequency) - static power: loss due to drain  $((V \cdot I_{leakage}))$  — energy = power  $\cdot$  time

Boolean Algebra

functional specification of logic gates NOT: $\overline{A}$ , AND: $A \cdot B$ , OR:A + B - ring axioms hold

NO1: A, AND:  $A \cdot B$ , OR: A + B - ring axioms hold duality: switching  $0/1 + i \rightarrow \text{still valid}$ idempotent law:  $X + X = X/X \cdot X = X$ , involution law:  $\overline{(x)} = X$ , law of complementarity:  $x + \overline{x} = 1/x \cdot \overline{x} = 0$ , commutative law:  $x + y = y + x/x \cdot y = y \cdot x$ DeMorgan's law:  $\overline{(X + Y + ...)} = \overline{XY} .../\overline{(XY...)} = \overline{X + Z} + ...$ 

boolean algebra to simplify: better implementation complement, literal: input or its complement, implicant: product of literals, minterm: product that includes all inputs, maxterm: sum that includes all inputs

# standardize function representations

truth table of function unique, equations not S(um)O(f)P(roducts)/D(isjunct)N(ormal)F(orm): sum of minterms with value 1 - may only specify rows:  $m3 + m4 + ... = \sum m(3, 4, ...)$  P(roduct)(O)FS(ums)/C(onjunctive)N(ormal)F(orm): product of maxterms with value 0 - may only specify rows:  $M0 \cdot M1 \cdot ... = \prod M(0, 1, ...)$ 

conversion: x rows  $\sum m(1, 2, ...) = \prod M(..., x-1, x)$ inversion:  $\sum m(1, ...) = \sum m(..., x) = \prod M(1, ...)$ logic simplification/minimization

trial & error - goal: reduce number of gates and inputs, number of transistors  $\frac{1}{2}$   $\frac{1}{$ 

use Unity Theorem:  $F = A\overline{B} + AB = A(\overline{B} + B) = A$ subsets of ON set w single varying variable  $\rightarrow$  eliminate ("don't care", denoted X)

# logic completeness

Anything implemented with {AND,OR,NOT} NAND and NOR also logically complete Combinational Logic

2:4

Decoder

 $Y = \overline{A \oplus B}$ 

 $\frac{\mathbf{decoder}}{\mathbf{decoder}}$ input pattern detector,  $n \to 2^n$  signals, input combination  $\leftrightarrow$  one output

implement logic functions gate-level implementation



**multiplexer** select one input as output based on  $\log_2 N$  controls (output always connected), as lookup tables (LUT) for logic



## full adder

add two bits with carry,  $S_i = A \oplus B \oplus C$ ,  $carry_{i+1} = a_i b_i + a_i \cdot carry_i + b \cdot carry_i$ 



more efficient/specialized carry lookahead adder exists programmable logic array



comparator



## arithmetic logic unit



000

001

010

011

100

101

110

111



# Sequential Logic Design



memory hierarchy: Latches/Flip-Flops, SRAM, DRAM Latches: level-triggered (capture data during entire high of write-enable), Flip-Flop: edge-triggered, captures only on rising (clock) edge

#### storing information Cross-coupled inverter, needs control mechanism



**R-S** Latch

Q data & S, R control - S = 1 = R quiescent (idle), S = 0, R = 1 set, S = 1, R = 0 reset, R = 0 = Sinvalid (osciallation, eventually settle unpredictably)





Memory with Gated D Latch Registers

unique locations in memory, read/written with unique address, x locations have  $\log_2 x$ -bit addresses, address-ability: bits in each location, address space, space of unique locations



| , b, c});                           |
|-------------------------------------|
| \$display(" outputs = %b (%b exp)   |
| ",y,yexpected);                     |
| errors = errors + 1;                |
| end                                 |
| vectornum = vectornum + 1;          |
| if (testvectors[vectornum] === 4'bx |
| ) begin                             |
| \$display("%d tests completed with  |
| %d errors", vectornum,              |
| errors);                            |
| <pre>\$finish;</pre>                |
| end end endmodule                   |

## Timing Verification & Testing

check: timing constraints recognized — high-level sim. with #time — circuit-level after synth.

# **Von Neumann Model**

V.N.M: fundamental computing model: memory(data,instructions),processing unit,input, output, contorl unit — instructions in linear memory array, sequential instruction processing

Von Neumann Model register file, memory, IP exposed to programmer Memory

grouped: bytes (8 bits), words (8,16,32,...) — address space - addressability: byte(MIPS,LC-3b), word(LC-3) — capacity=address space addressability — endianess: big(most-significant addr. last),little(leastsig.addr.last)

#### **Processing Unit**

comprises many F(unctional)U(nits) — ALUs/FUs process words — register file close to compute (word size) Control unit

conducts step-by-step execution of instructions with I(nstruction)R(egister),P(rogram)C(ounter)/I(nstruction) P(ointer) — IP increment by multiple bytes/word/...

#### Dataflow

other execution models exist - instructions fetched/executed in dataflow order/when operants ready - no program counter: notifies receivers, instruction executed when all received (tokens) — inherently more parallel

tradeoffs (from von Neumann difference) - partly integrated internally (usually not exposed to programmer)

# **Instruction Set Architectures (ISA)**

ISA: instruction set (opcodes, data types, addressing modes, length), memory (address space, addressability), register file (size, count)  $\Leftrightarrow$  everything programmer

instructions opcode + operands — three instruction types below — opcodes: different amounts  $\leftrightarrow$ hw/sw complexity — operative: do computation, datamovement: load/store memory, control: change control 'LD R2, 0x...'

| flo | W |  |  |  |
|-----|---|--|--|--|
|     |   |  |  |  |

data types one/several data types: 2's complement integers, unsigned integers, floating point numbers semantic gap hw/sw translator for ISA/semantic change **Addressing Modes** specify location of operands: LC-3 all those, MIPS not indirect but pseudo-direct 'memory[ PC(incremented) + **PC-relative**:

signext(offset) ]' — limit: target close to PC indirect: 'memory[ memory[ PC(incremented) + signext(offset) ]]' - can access everywhere **base+offset**: 'memory[ baseregister + signext(offset) ]' — can access everywhere immediate: hardcoded value register: NOT memory, just specify register address

#### Instructions MIPS

r-type, i-type (immediate), j-type (jump) (, f-type (float)) -r: additional func bits for operation (opcode 0) operative instructions

r-type: op,src,src,dst,shift,operation

| 0      | rs     | rt     | rd     | shamt  | funct  |  |
|--------|--------|--------|--------|--------|--------|--|
| 6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |  |

i-type: op,src/dst(base),dst/src(register),immediate/literal TRAP: OS call (8bit vec)

| opcode | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

#### data movement instructions

i-type: 'sw \$s3, 8(\$s0)' (base+offset), 'sw rt, imm(rs)' — 'lw ...'

read/write with M(emory)D(ata)R(egister)&M(A)ddressR lui (load immediate): 'lui \$s0, 255'/'lui rt, imm', 'ori \$s1, \$s0, 42'/'ori rt, rs, imm'

> MIPS byte-addressable: word x with offset 4x (32bit) control instructions

i-type: BRANCHES 'beq rs, rt, imm' (rs==rt  $\rightarrow$  PC = PC(incremented) + signext(offset) \* 4) — JUMP j-type

| 2      | target |           |  |  |  |  |
|--------|--------|-----------|--|--|--|--|
| 6 bits |        | 26 bits   |  |  |  |  |
|        |        | 11501 003 |  |  |  |  |

'j target' (PC = PC(incremented)[31:28] — targer \* 4), 'jr register', 'jal ...' (function calls)

## LC-3

## operative instructions

ADD, AND (no subtraction)

| 15 14 13 12 | 11 10 9 | 876    | 5 | 4 3 | 2 1 0  |  |
|-------------|---------|--------|---|-----|--------|--|
| OP DR       |         | SR1    | 0 | 00  | SR2    |  |
| 4 bits      | 3 bits  | 3 bits |   |     | 3 bits |  |

NOT: only consider OP. DR. SR1, everything else 1 — 'NOT R3. R5'

| OP     | DR     | SR1       | 1 | imm5   |
|--------|--------|-----------|---|--------|
| 4 bits | 3 bits | ts 3 bits |   | 5 bits |

#### data movement instructions

word-addressability-LD, LDR, LDI, LEA, ST, STR, STI LD/ST: 'memory[PC(incremented) + signext(offset)]' -

| 15 14 13 12 | 11 10 9 | 8 | 7 | 6 | 5  | 4    | 3   | 2 | 1 |
|-------------|---------|---|---|---|----|------|-----|---|---|
| OP          | DR/SR   |   |   | F | PC | offs | set | 9 |   |

#### 3 bits 4 bits 9 bits

LDI (indirect): 'memory[memory[PC(incremented) + signext(offset)]]'

LDR: 'LDR R1, R2, 0x1D', 'memory[BaseR + offset]'

| OP | DR | BaseR | offset6 |
|----|----|-------|---------|
| 6  | 1  | 2     | 0x1D    |

LEA: as LD but 'PC(incremented) + signext(offset)' (w/o memory)

## control instructions BRANCH

update to register: N(egative), Z(ero), P(ositive) condition codes set — BRn, BRz, BRp, Brzp, ... (if one true)

| 0000   | n | z | р | PCoffset9 |
|--------|---|---|---|-----------|
| 4 bits |   |   |   | 9 bits    |

# JUMP 'JMP BaseR'

| 1100   | 000 | BaseR  | 000000 |  |  |
|--------|-----|--------|--------|--|--|
| 4 bits |     | 3 bits |        |  |  |

#### Functions

calling: jal (MIPS), jsr (LC-3) — returning: jr (MIPS), ret (LC-3) — arguments: \$a0-\$a3 (MIPS) (stack \$sp if more) — return value \$v0 (MIPS) — only \$t regs not guaranteed to be preserved (MIPS)

# Microarchitecture/uarch

uarch: hw internal (must not confirm ISA) - uarch can change w/o ISA — uarch follows design points

## Basics

build system (hw(here) or hw with software translation) — instruction proces.: A(rchitectural)S(tate) $\rightarrow$ AS' define finite state machine det. AS' — state trans. atomic for program.

components: datapath (deal w./transform data signals, FU/ALU, reg./mem — hw for flow of data), control logic (hw for control signals, what datapath hw should do) — contorl signals: combinational vs. microcoded

#### **Performance Analysis**

 $C(ycles)P(er)I(instruction) \cdot C(lock)C(ycle)T(time) =$ instruction ex.time — #instructions-average CPI-clock cycle time=program ex.time — CPI: constant(single), varying/shorter(multi)

single: 'longest' instr./critical path det. CCT - determine by cons. all instr.types (control logic matters too)

## Single-cycle uarch

1 instruction/clock cycle — slowest instruction bot.neck many disadvantages: slowest is bot.neck, resource duplication, complex implement., diff.2optimize

#### Multi-cycle uarch

1 instruction/multiple clock cycle — only exe. req. IPCs - CCT not det. by 'slowest' instr. (CCT indep. instr.proc.time) - crit. path design, bread&butter, balanced design — allows for waiting (for memory, ...) adds seq. reg. overhead

 $AS \rightarrow step1 \rightarrow step2 \rightarrow ... \rightarrow AS' - design FSM: (mult.)$ 

states/IPC stage - states def. by control signals control logic implemented as FSM

# **Pipelining**

introduce concurrency to multi-cycle - fully separate stages: diff. instr. in diff. stages - easy to start with single cycle as resource layout more suitable divide IPC into distinct stages - req. suff. res./stage diff. intr./stage steps seq. dependent, consid. dep. between instr., consid. imbalance between stages

have lead-in/-out, full utilization middle

#### Pipelines

ideal: no inter-stage overhead, uniform partitioning uniform partitioning: S = seq.delay/stage. time/stage= $\frac{T}{L}$  + S, R cost/register, cost=G+Rk

# **Pipelining instruction processing**

simple 5 IPC stage stages — imbalanced stages: lower speedup — all must proceed all stages (wasted time) pipeline registers at level between stages (latch on trans.) control signals: same as single-cycle, must latch too/keep in phase with data — option 1: generate once, buffer until consumed - option 2: buffer instruction parts, generate on demand

problems: external (wasted stages)/internal (unbalanced staged) fragmentation, stalls (dependencies)

## Stalls & Dependencies

causes: contention, dependencies, long-latencies stalling: disable PC&IF, keep instr.in stages (add we signal to pipeilne registers f,d,e), synchronous reset ex. pipeline register, bubbles

### **Data Dependencies**

flow (true), anti (write after read), output (write after write)

anti/output: false, always same stage

flow: detect&wait, detect&bypass, detect&eliminate, detect&move, predict, sth.else - 1st half write,2nd read interlocking (detection, correctness guarantee): sw (static scheduling)/hw, (bubbles, find indep.instr.) - sw difficult: unknowns→profiling

1.hw scoreboarding: valid bit/register, stall when invalid 2.hw comb.dep.detection logic

detect&wait stop up-stream stages, drain down-stream detect&bypass forward data as soon as available, allow dependent instr. to execute until data needed, add bypass paths — still stalling when no bypass possible (latency) detect&eliminate compiler, reorder indep.instr., nops fine-grained multithreding instructions in pipeline from different indep. threads - no: add. logic, waste cycles — disady.: low single-thread, hw, contention, need threads - employed in GPUs/accelerators

#### **Control Dependencies**

next instruction depends on control flow instr., next address maybe not directly available

option branch prediction: speculative execution, flush pipeline on wrong prediction — flushing: misprediction penalty

option early branch resolution: check resolution in decode, additional logic/hw, less flushing/branch misprediction penalty

#### Remarks

combination difficult (timing)

#### **Precise exceptions**

idea: indep. instr. exec. on diff. hw (division, add, ...) — older instruction finishes with error after younger, younger's writeback sequentially invalid

exception: program-internal problem (prog. context) — interrupt: external event (syst. context)

upon(excpt. immed., intrpt delayed): stop, save, handle, return — precise AS, flush younger instr., save PC/registers, redirect to handler

precise: when handling, AS consistent (prev. retired, later !retired)

#### ensuring precise exceptions

single-/multi-cycle: easy — instr. for getting exception in handler (mfc0, MIPS)

pipelined: difficult — all worst case latency bad — approaches: reorder buffer (here), history buffer, future register file, checkpointing

R(e)O(rder)B(uffer): buffer out-of-order results, retire in-order — circular buffer — valid flag, destination register id, destination register value, store address, store data, pc, valid bits for registers control bits, exceptions decode: reserve in ROB, complete: write to ROB, ROB: oldest w/o exception complete: retire, restore: flushing ROB

dep. instr.: bypass paths, read from ROB, read from reg.file — ROB requires expensive search/content addressability, avoid w indirection — reg.file stores valid flag, ref.to ROB (access reg.file), maybe R(egister)A(lias)T(able)

# **Out-of-Order Execution**

out-of-order dispatch, out-of-order completion, in-order retirement — internal dataflow execution non-ready instructions in reservation stations, operands of reservation station entries ready: dispatch complete parallel execution: long-latency tolerance hw requirements: consumer-producer link (renaming/tagging), reservation stations, track readiness/broadcast operands, dispatch (schedule), complete, retire

#### Tomasulo's Algorithm

initial OoO execution at IBM by T. — no precise exceptions/ROB initially

decode: add w renamed operands to reservation station, add own output to RAT (stall if no res.station available) — in res.station: monitor C(ommon)D(ata)B(us) for source operands/tags — ready: dispatch to FU — FU completion: broadcast tag & value on CDB, register file also listens, free name tag

req. busses from each FU to all registers/table entries How it works

de-/allocate reservation stations, global/FU reservation station, ROB/res.stations/reg.file, broadcast simultane-ously/serial/...

#### **Avoiding Value Replication**

values replicated in tables: waste — introduce P(hysical)R(egister)F(ile) (centralized value storage) — references to PRF from ROB, ...

works because tags/references smaller than values

#### Remarks

internal dataflow graph limited to instruction window (determined by number of reservation stations) — exploits irregular parallelism¶

#### **Precise Exceptions**

| for ADD | Unit |
|---------|------|
|---------|------|

| Source 1             |       |      | Source 2 |            | _                  |          |     |       |      |       |     |      | Source | 1    |       | Source | 2     |
|----------------------|-------|------|----------|------------|--------------------|----------|-----|-------|------|-------|-----|------|--------|------|-------|--------|-------|
| ' 1                  | ag Va | iue  | V Ta     | ig Valu    | e                  |          |     |       |      |       |     | ٧    | Tag    | Valu | e V   | Tag    | Value |
|                      |       |      |          |            |                    |          |     |       |      |       | ×   |      |        |      |       |        |       |
|                      |       |      |          |            |                    |          |     |       |      |       | v   |      |        |      |       |        |       |
| +                    | -     | +    | -        | -          | _                  | Re       | ord | er Bu | ffer | (ROB) | 2   |      |        |      |       |        |       |
| _                    |       |      |          |            | Entry 0<br>Entry 1 |          |     |       |      |       |     |      |        |      |       |        |       |
| iste                 | Valie | i Ta | o Val    | ue         | Entry 2            | $\vdash$ |     |       | +    |       |     | - 1  | Regis  | ster | Value | 1      |       |
| 81                   | 1     |      | 1        |            | 2.10 / 2           | $\vdash$ |     |       | -    |       |     |      | RJ     | _    | 1     | 1      |       |
| 22                   | 1     |      | 2        |            |                    |          |     |       | 1    |       |     |      | R2     | 2    | 2     | 1      |       |
| в                    | 1     |      | 3        |            |                    |          |     |       |      |       |     |      | R3     | 3    | 3     |        |       |
| ۲4                   | 1     |      | 4        |            |                    |          |     |       | -    |       |     |      | R4     | 1    | 4     |        |       |
| 25                   | 1     |      | 5        |            | Entry 8            | $\vdash$ |     |       | -    |       |     |      | RS     | 5    | 5     |        |       |
| ۲6                   | 1     |      | 6        | i          | Linayo             | $\vdash$ |     |       | +    |       |     |      | Ré     | 5    | 6     |        |       |
| 27                   | 1     |      | 7        |            |                    | $\vdash$ |     | -     | -    |       |     |      | R7     | 7    | 7     |        |       |
| 88                   | 1     |      | 8        |            |                    | $\vdash$ |     | -     | -    |       |     |      | RE     | 3    | 8     |        |       |
| 19                   | 1     |      | 9        |            |                    |          |     | -     | -    |       |     |      | RS     | )    | 9     | 1      |       |
| 10                   | 1     |      | 10       | D          | Entry 13           | 3        |     |       |      |       |     |      | R1     | 0    | 10    |        |       |
| 11                   | 1     |      | 1        | 1          | Entry 1            |          |     |       |      |       |     |      | R1     | 1    | 11    | ]      |       |
| ontend Register File |       |      |          | e Entry 18 | 5                  |          |     |       |      | Arc   | hit | ectu | Iral   | Red  | iste  | er Fi  |       |

ntend Register File

Loads & Stores

static/dynamic dependencies, small/large state space, (in)visible to others

cor1: renaming useless during execute, cor2: dep. determined after (partial) exec., cor3: ready, dep. unknown if others with unknown addr.

scheduling approaches — conservative: wait until certainty — aggressive: assume independence — intelligent — latter 2 req. recovery

find dependency — opt1: all prev. committed — opt2: buffer of pending stores, check for match

**reordering** must buffer stores for seq. semantics — dependency: access buffer on load

L(oad)Q(ueue) & S(tore)Q(ueue), both in one queue possible — complex store-to-load forwarding logic to search for value

# **Dataflow & Superscalar Execution**

**Dataflow** — may directly mapped to hw — diff. on today's hw — diff. with today's concepts(excpt.,debug,...)

### **Superscalar Execution**

OoO-exec. horizontal parallelism — sup.scalar: vertical parallelism — also parallel fetch, decode, retire: Nwide sup.scalar  $\Leftrightarrow N$ /cycle

req.: dep.checking between N instr., duplicate datapaths dependencies: detect & wait, detect & bypass, detect & eliminate, detect & move, speculative, FGM

# **Branch Prediction**

always control dependencies — 15-25% control flow — N branch resolution latency:  $N \cdot W$  (sup.scalar) waste — want prediction during fetch (so that ready for next) types (direction, possibilities, when): conditional (unknown, 2, execution), unconditional (always, 1, decode), call ("), return (always, many, execution), indirect (")

goal: keep pipeline correct full: stall, prediction, branch delay slot, FGM, predicate execution, multipath execution

#### **Static (compile-time) Branch Prediction**

always (not) taken, BTFN, profile, program analysis req. ISA support (metadata/-bits) — reduce: penalty, mispredictions

Always (not) taken taken:  $\approx 60 - 70\%$  accuracy — sw/hw: increase probability (code layout) — penalty  $\leftarrow$  bubble/flush count

BTFN: backward taken (loops), forward not taken

**Profile**: likely direction from profile run, accuracy depends on representitiveness of profile run **Program Analysis**: use heuristics (take loop, not take

fp comparisons, not take leq), misprediction 20 % 1993 **Programmer:** language pragmas, programmer hints

## **Dynamic (runtime) Branch Prediction**

**BTB**: Branch Target Buffer, store branch target, access with PC later, content hints whether taken

**Last Time Predictor**: predict branch as last time, bit/branch, store in B(ranch)H(istory)T(able), corresponds to small FSM, loop:  $\frac{N-2}{N}$ , good long, bad short *N*-bit Counter: *N* instead of 1 bit for counting, less volatile/add hysteresis, 2: strongly/weakly (not) taken, 80-90% accuracy on average

**Two-level Prediction**: use P(attern)H(istory)T(able)/PAT index with history, get N-bit prediction counter global: G(lobal)H(istory)R(egister), 3-bit GHR reasonable, improved with GHRxorPC for indexing local: L(ocal)HR/BHR for each branch, one PHT general: BHR: G(lobal)/S(et of branches)/P(branch), PHT: G,S,P, PHT counters: A(adaptive), S(tatic) **loop branch** counts loop iterations, compares to limit **perceptron** simple ML, single neuron: weight\*x+bias > 0, 1 perceptron/branch, weight↔bit on GHR

IF sign $(y_{out}) \neq t$  or  $|y_{out}| \leq 0$ FOR i=0..=n:  $w_i = w_i + tx_i$ 

# hybrid history-length predictor

Hybrid multiple predictors, select best for branch

Branch Confidence Estimation useful for hybrid/choosing among predictors, simple: track past (in)correct at branches — index table with PCxorGHR & reduction on past correctness

### Branch Delay

delayed branch execution, N instructions after branch always executed — perfect branch resolution, diff. to find instructions for delay slots (esp. cond. branches) with squashing: not execute delay slots when not taken

# V(ery)L(ong)I(nstruction)W(word) Architectures

similar to superscalar, but compiler already packs in instr. bundles — true VLIW: instr. in bundle indep. (req. hw understanding), sometimes even indep. between seq. — lockstep execution

bundle has structure: maybe instr. type linked to position

**Philosophy** similar to RISC: compiler most hard work, hw simple (faster, energy efficient) — low power  $\neq$  low energy

tradeoffs: simple hw, compiler indep instr., recompilation for new uarch req., lockstep: stalling of indep. instr. trace scheduling (indep. of VLIW mostly) - basic block: single entry/exit point, after branch reordering for optimization — super block: combine frequent basic blocks to one single-entry/multiple-exit blocks, enables aggressive optimizations for superblock/common case ISA translation through hw/sw, to get better tradeoff

## **Systolic Array Architectures**

in ASICs, accelerates certain tasks/exploits regular parallelism, highly concurrent, balanced compute & I/O use regular array of P(rocessing)E(lements), collective compute, max. compute/data, PEs may be structured many-dimensional, PEs local memory & execute needs indep. data, carefully input data, buffer output flexibility possible through weights in PEs, may be changed on the fly, PEs may have own data/instr. memory

**Convolutions** as in CNN, CV, filtering, ... — can be implemented in systolic arrays (via matrix multiply), Google TPU, Cerebras WSE

**examples/use** warp computer: 10 programmable processors in linear array, programmable, extends general purpose computer

# **Decoupled Access and Execute**

mitigate Tomasulo's complexity — separate memory access & execute — separate memory/execute instruction streams, execute in separate parts, communication via FIFO queues — req. synchronization (branches, ...) can run ahead of each other/some OoO without reservation stations etc., compiler support, branch req. synchronization (loop unrolling to reduce) compiler instr. splitting vs. dynamic not done on ISA-level usually, but hw-internal approx. for latency tolerance

# S(ingle)I(nstruction)M(ultiple)D(data) Architectuers

exploits regular parallelism, same operation done on multiple data — SIS(ingle)D, M(ultiple)ISD, MIMD — MISD generalization of systolic arrays SIMD: same instruction to multiple processing engines

(with diff. data) — I(nstruction)L(evel)P(arallelism) amortization of fetch overhead — data parallel programming model

**Vector Register** holds N M-bit values, entire vector/register — also control registers: V(ector)LEN(gth) (upper bound M), VSTR(ide), VMASK (set with vector test instr.): limits proc. to certain vec. elements

operate on vector instead of scalars here — must: load/ store vectors, operate on diff. VLENs, vec. elements may be stored apart in memory (def. by VSTR) as row/column in matrix

**Vector Processors** instr. on mult. data in consecutive time steps — pipelined execution on hw, operation/element in consecutive cycles — very deep pipelines possible (indep., !control flow, easy (pre)loading) — VFUs can also be deeply pipelined

**Array Processors** instr. on mult. data in parallel (multiple PEs)

**combination** array & vector combined in practice pipeline with multiple resource instances — may partition vector register: partition linked to specific VFUs (via lanes) — may overlap execution of multiple vector instructions

## Memory (vector)

memory (bandwidth) easily bottleneck — load/store req. mult. memory accesses — get high throughput(1) with banking **memory banking** divided, banks accessed independently, share address/data busses (reduce pins),

RS for MUL Unit

start/complete 1 bank access/cycle - N concurrent access to N banks possible (relatively prime best) — sustain throughput when "#banks> #bankLatency" — bank conflicts: more banks, more ports/bank, better data layout, better address mapping to banks

#### Chaining (vector)

forwarding from one VFU to next VFU as soon as result available — still memory limits (banking, ports, ...)

## Remarks

stripmining enable vectors longer than vector register entry length — do multiple iterations (diff. VLEN last) scatter/gather vectors might not be stored in strided fashion, index vector defines references/indirection, can scatter/gather with index vector + base address

conditionals in loops can use VMASK if not want to execute operation on specific vector elements, encode condition in bitmaks/VMASK (predicate execution) hw simple: execute all, turn of write - hw density-time: scan VMASK, only execute necessary

automatic code vectorization compilers may unroll loops to vector operations etc.

modern ISAs modern systems not full ISAs, have extensions (packed arithmetics)

# **GPU** Architectures

like works SIMD underneath (execution model)/MIMD/SIMT(hread), programmed using multiple threads (programming model)/SP(rogram)MD: thread for each independent execution, threads run same program

threads executing same instruction grouped into warps (wavefront) by hw automatically, executed via SIMD at once (thread executes on SIMD line), each tread uses id to identify its task

SIMT advtgs.: threads separate (don't need GPU), flexible warp grouping (supports branches), long latency tolerance

each warp (thread even) indep .: no interlocking, no dep. checking, fine-grain multithread warps  $\rightarrow$  easy to keep pipeline full

A GPU comprises many S(treaming)M(ultiprocessors), SMs comprise many SP(rocessors) themselves. Each SM may execute multiple warps in a pipelined fashion. One instruction is executed as a SIMD instruction so that each thread is executed on a SP in the end. SIMD line  $\leftrightarrow$ SP+ — blocks: group of threads, can cooperate/share data — grid: comprises blocks, execute all the same kernel

dynamic warp regrouping (branch), with many threads efficient, can't move threads between lanes (separate registers, FUs)

underutilization: branch divergence, long latency ops. - two-level warp scheduling of smaller warp groups (one latency blocks ready, others can continue) — large warps, break into sub-warps

terminology generic, NVIDIA, AMD - vector length. warp size, wavefront size - pipelined FU/scalar pipeline, streaming processor/CUDA core, - SIMD function unit/SIMD pipeline, grou pof N streaming processors, Vector ALU - GPU core, streaming multiprocessor, compute unit

# Memory

SRAM: on-chip/in-core, DRAM: 'typical', storage Importance

most area is memory — memory is main nottleneck started 3D-stacking

wordload ML, AI, genomics, databases, datacenter  $\rightarrow$  near/in-memory computing or FPGAs performance memory stalls  $\approx 70\%$  stalls energy reliability/security

Fundamentals

virtual/physical system maps virtual memory addresses to physical memory — programmer sees 'infinite' memory/translation invisible - last PART - physical here

ideal memory 0 latency,  $\infty$  capacity, 0 cost,  $\infty$  bandwidth, 0 energy

memory arrays flip-flops/latches: fast, expensive (bit  $\leftrightarrow$  10s transistors) — S(tatic)RAM: relatively fast, volatile, expensive (6+) — D(ynamic)RAM: slower, refresh, volatile, special manufacturing, cheap (ltrans.+lcapacitor) — storage: much slower, nonvolatile, very cheap

memory: 2-dim. arary of bit cells - 1 bit or byte or .../cell — rows (=:depth) and columns (=:width)



reading: sensing amplificers — detect signal change — SRAM good: complementary - reset bitlines beforehand

in practice: multiple words/row  $\rightarrow$  squared memory (useful for caching)

dram: capacitor leaks, refreshing - sram: used on-chip simple scaling difficult: latencies

#### **DRAM Subsystem**

channel  $\rightarrow$  rank  $\rightarrow$  bank  $\rightarrow$  subarrays  $\rightarrow$  mats channel each channel has a memory controller, one interface to compute unit — everything to one DIMM **DIMM** dual in-line memory module, physical module/organizational structure, rank on front and back — everything to one rank rank has 8 chips on it (ex) — distributed to chips chip has 8 banks (ex) — everything to one bank **bank** has 32 rows with 8 bytes (ex) — accessed in independently in parallel, each bank is a memory array some buffers used/a type of cache - row address (part of address) to get row, sense amplifiers put in row buffer, column address (part of address) to get data, write back (read is descructive) — in practice: subarrays with local row buffer in array, row decoder forwards part of address to local decoder — MAT: usually refers to one row in a subarray, but may divide row into two subrows to not have to active the entire row, depending on selected column — read dominated by wordline, bitline drives



#### **DRAM Operation**

decode row address, drive word-lines, selected bit-cells drive bitlines, differential sensing, decode colum address, seelct subset of row, send to output, (rewrite as destructive), precharge bit-lines

cache blocks (blocks of memory) distributed across chips, accessing takes amount rankAccess, dist. across chips

#### Emerging technologies

P(hase)C(hange)M(emory), resistive memory, state (crystalline/high reflexivity/low resistivity vs amorphous/low reflexivity/high resistivity) changed using heat — higher density, lower cost, non-volatile, no refresh, slower, endurance problem, higher-energy SST-MRAM (magnet based, polarity)

Memristors: atomic structure

Flash, SSD common - doubtful for past two decades (flash durability problems)

#### **Hierarchy & Caches**

hierarchy: storage (off-chip), DRAM (off- or on-chip), SRAM (usually on-chip) as L3 cache, L2 cache, L1 cache (in-core), register file — bigger  $\rightarrow$  slower, faster  $\rightarrow$  expensive — cheaper with time

deal with opposing goals: levels of hierarchy/caches due to locality (past predicts future) appears fast & large - temporal locality: likely to use same data reference again, spatial locality: likely to operate on related/close data

manual cache management: programmer/compiler must do all, still for embedded, GPUs, accellerators (typically sacratchpad today, addition) - automatic: hardware manages across all levels, transparent to programmer, hides uarch details — today: usually only register file manual

extension: remote memory with remote nodes on lowlatency network

sharing/separating data/instruction caches - utilization vs. quality of service - 11 split, 12+ shared

**Latency** intrinsic access time  $t_i := get local data/learn$ miss — perceived access time  $T_i := t_i + \text{getting data}$ 

from higher level if miss — hit rate  $h_i := t_i$  + getting during the form higher level if miss — hit rate  $h_i := \frac{\#\text{hits}}{\#\text{hits} + \#\text{misses}}$ — miss rate  $m_i := \frac{\#\text{misses}}{\#\text{hits} + \#\text{misses}}$  —  $T_i = t_i + m_i T_{i+1}$ — A(verage)M(emory)A(ccess)T(ime): usually lower

better

optimize —  $m_i$  low: increase  $C_i \rightarrow$  increase  $t_i$ , better cache management — lower  $T_{i+1}$ : make faster (expensive), intermediate hierarchy level

#### **Cache Design**

cache: memorizes used/produced data - cache comprises cache blocks — cache block stores block of data from higher level — block: subsequent set of addresses with metadata (source, access patterns) - memory split

into fix blocks, cache can store blocks

questions: placement, replacement, granularity of management, write policy, instruction/data

#### storing cache blocks

have tag (valid, tag) and data stores (data), indexed by index bits

+associativity  $\leftrightarrow$  +hit rates, -speed, costly hw — return for +associativity $\rightarrow$ +hit rates diminishes

**direct-mapped** block only in one cache block — cache block determined by index bits of address — index cache with index bits, compare tag (remaining address part), miss or index data store — contention (conflict miss) if two blocks with same index bits in cache

set-associative block in certain set of cache blocks, less conflict misses — n-way/degree n: n cache blocks/set - more expensive hw: compare all cache blocks in set — bigger tag, smaller index

full-associativity any block in any cache block — one set — very expensive: many comparators

## Cache Management

cache full & new block: conflict miss - must evict before add - define priority, who to evict - insertion (new priorities), promotion (change priorities), replacement (who evict?) — eviction/replacement policy

L(east)R(ecently)U(sed) evict lru, difficulty: track order — minimize processing (+memory):  $\lceil \log_2 n \rceil$  bits/block (position) — minimize memory (+latency):  $\lceil \log_2(n!) \rceil$  bits (combination  $\leftrightarrow$  order) true LRU expensive: not in modern systems

locality approx.: not M(ost)R(ecently)U(sed) - hierarchical LRU (expl. MRU groups, LRU in groups) victim-victimNext

Random sometimes better than LRU/MRU --- set trashing if working set>associativity  $\rightarrow$  Random — set sampling: compare performance repeatedly, choose then Belady's OPT replace block furthest referenced -

provably minimizes miss rate — v.diff. to impl. — min. miss rate  $\neq$  min exec. time (diff. latencies, overlapping) **Multilevel Management** 

higher level: +size, +associativity, +latency, tag/date serially — access higher levels in parallel (speculative) vs. serially on miss — faster vs. energy — latter: different policies as different access patterns

data in all/only one cache?, bypass when loading?. evicted put where? - inclusive: block in inner also in outer (coherency), exclusive, inner never in outer (space utilization), non-inclusive: may/may not

## Writes

write through: write to higher level caches as soon as data is modified (+simple, +coherency, -bandwidth) write back: write back when evicted (combine, +bandwidth, -dirty bit) - dirty bit (in tag store): indicates modified

allocate cache block for write on miss? - not if overwritten — subblocks with indivd. dirty bits: load not entirely overwritten

## Performance

cache size (working set) for temporal locality - block size for spatial locality - for large blocks: critical word first (forward immed.), subblocks (load indep. supply faster) — associativity (-misses, +latency, +cost) — replacement policy

miss/hit rates - compulsory miss: first block reference  $\rightarrow$  +block size, prefetching — capacity miss: all misses in fully-associative, opt. replacement, same capacity  $\rightarrow$ software management of working set — conflict miss: all other misses (different organization could avoid)  $\rightarrow$ +cache size, +associativity (rand, indexing, software eviction hints, victim cache)

*improve performance* — -miss rate: associativity, better policies, sw opt. — -miss latency/cost: multi-lvl. caches, critical word first, subblocking, better replacement policy, non-blocking cache, sw opt. - - hit latency/cost

software opt.: restructuring data access patterns (loop interchange, loop fusion, array merging), restructuring data layout (data structure separation/merging, exmpl. outsource rarely-accessed fields of objects as reference). data reuse/blocking/tiling (expl. matrix multiply)

# Advanced Caching

M(emeory)L(evel)P(arallelism) memory access may have long latency — impact determined by: latency overlapped?, latency tolerance?, evicting longer-torefetch block?

MLP: prefer parallel misses to better utilize stall - min. miss rate, thus, not opt. — prefer elimination of isolated miss, higher-latency misses

Caches in Multi-Core Systems shared vs. private, OoS & predictability, allocation/thread, shared/limited bandwidth/space

shared caches +utilization, -comm. latencies, +shared mem.program. model/+coherency, +contention, -single thread performance, -QoS, -consistency, +capacity, speed — cache management to mitigate

Cache Coherence private caches, shared memory comm., cache incoherence  $\rightarrow$  cache coherency protocol broadcast based: invalidate/update elsewhere on write directory-based cache coherence: directory stores what caches have, caches consult directory — exmpl.: store P + 1 bits/cache block, 1/P indicates contains, 1 exclusive bit - read: set process bit, invalidate exclusivity before (directory + exclusive thread) — write: invalidate all other, exclusive to this

## Prefetching

reduce miss rate & latency (maybe eliminate misses) speculative, works with predictable patterns - nos misprediction penalty

What address address prediction algorithm — prefetch accuracy:= <u>used prefetches</u> patterns, compiler/programmer input — stride, repeating variable stride, rep. pat-

tern.

When (early, late, on-time) affects timeliness - earlier or aggressive  $\rightarrow$  timely

Where to place caches (today), separate buffer — role in replacement policy (equal to demand fetched, ...)

Where prefetcher all memory-lyls. (today)

**How** sw prefetching: prefetch instructions — hw pref.: finds access patterns, prefetches automatically — execution based: 'thread'/instr. flow from program to prefetch (sw or hw) — (cooperative, hybrid)

#### Prefetchers

**Stride** record stride, stable N: predict next M — perinstr. or per-memory-region - need last address referenced, stride, (confidence) — stream prefetching (hw for N = 1)

complex prefetchers multiple regular strides (multistride detection common today), linked data structure traversal, indirect array access, multiple data structures concurrently — bootstrapping: table various histories

self-optimizing memory (Pythia) get state (memory request features), prefetch, get reward (optimize) — features: PC, branch PC, last 3 PCs, cacheline address, physical page number .  $\Delta$ two cacheline addresses prefetch as offset - reward: usefulness (accuracy, timeliness), system-level (memory bandwidth, cache pollution, energy) — outperforms current solutions in always hybrid hw

multi-core prefetch shared data (avoid coherency misses), important as shared/limited res. - throttle prefetching: conflicts, contention (bus, RAM)

execution based pre-execute program part, initiation loads of data - separate core, FGMT, same thread context/during cache misses - word for branches too runahead execution ideally OoO w large instr. window

(expensive) — upon oldest instr. long-latency: runahead mode: speculative pre-execute (without stalling), generates misses/prefetches — upon long-latency return: flush, resume - no add. stalling, accurage, instr. prefetch, train other prefetchers, limited by branch prediction, limited by MLP, -energy - best in hybrid

| Performance                                                                                                                            |
|----------------------------------------------------------------------------------------------------------------------------------------|
| accuracy:= $\frac{\text{used prefetches}}{\text{sent prefetches}}$ — coverage:= $\frac{\text{prefetched misses}}{\text{all misses}}$ — |
| timeliness:= <u>on-time prefetches</u> — bandwidth consumption                                                                         |
| (best during idle) — cache pollution                                                                                                   |
| correlated                                                                                                                             |

# Virtual Memory

multiple programs without interference, authorization. unbound capacity — difficulties only physical: limited size, multiple programs, ISA  $\not\leftrightarrow$  physical memory, programmer burden, data relocation, sharing data

processes individual illusion or large address space (indirection to physical address space) — hw & system cooperatively manage mapping — part of ISA — benefit: program uarch indep.

#### Basics

physical memory as (fully-associative) cache for disk (block-page, block offset-page offset, miss-page fault, index-virtual apge number) - virtual/linear ad- $\stackrel{translation}{\longleftrightarrow}$ physical/real addresses - mapdresses

ping page-based (page:=block of memory, multiple KBs), frame:=page unit in memory — mapping via LUT/page-table (reference to memory or disk) managed by OS/hw, first reference arbitrary to memory - loading from disk/evicting from memory necessary by virtual memory system — virtual memory system: M(emeory)M(anagement)U(nit) + OS



bits, dirty bits, protection bits, disk, metadata - stored in physical memory, P(age)T(able)B(ase)R(egister) Address Translation

virtual address = V(irtual)P(age)N(umber) + Page Offset — offset unchanged, VPN→P(hysical)PN valid bit set: produce page — not set: OS page fault exception handler initiates I/O controller to load via D(irect)M(emory)A(ccess)

eviction: CLOCK algorithm - circular buffer, references all PTEs with physical addresses, pointer to last examined page — evict: traverse, evict first with 0. switch all 1s to 0s - ARC another algorithm (considers access frequency)

#### Multi-Level Paging

one PT/thread: huge! - multiple levels instead - VPN split into identifier for each table hierarchy, continuously index into lower-level tables — N-level page table, Npage table accesses — only first-level PT must be in memory

T(ranslation)L(ookaside)B(uffer) Cache to reduce long-latency of multi-level accesses - maybe only one access to TLB if hit — 16-512 entries in practice, 90-99% hit rates — consider TLB as L1 cache, memory as L2 cache

#### **Memory Protection**

memory protection/isolation between processes if correct mapping by OS — shared physical address for shared memory — page table stores protection bits (access rights: read, write, execute, kernel) — access control concurrent to translation — multi-level paging: multi-level access (ISA specifies access from bit combination) — if access violates bits, Access Protection Exception — x86: protection rings 0 (kernel/OS-only)-1/2(OS service)-3(user application)

RowHammer evades protection by exploiting hardware failure — induce bit-flips in neighboring rows by reading a row repeatedly — can gain kernel privileges



Space

can do hashed page tables — special treatment of virtualization and guest OSs: additional translation (guest virtual, host virtual/guest physical, host physical)