Fall 2024: CS 6475 Advanced Compilers

2024-12-05

Week 1

Overview of this course.

Readings:

Assignment:

Find a missing optimization in LLVM using this.
Prove it with Alive2.

Week 2

Discussions:

Is it really the dealth of optimizing compilers? No.
On Proebsting’s Law. Probably cannot use -O0 as a 18 years old compiler.
Impact of Economics on Compiler Optimization. Look for where the money goes.
Refinement is very important!

Assignment:

Pick a missing optimization in LLVM to implement.
Build LLVM and Alive2 locally.

Week 3

Discussions:

How LLVM is tested?
- unittest: LLVM API tests
- tests: regression tests
- llvm-test-suite: benchmark tests
- libFuzzer
- test by users
AVX512 ternary logic. It’s difficult to decode due to its required bit space.
Superoptimization: To generate optimized code, superoptimizer searches for certain pattern, does refinement check and assess with its cost model.

Readings:

Assignment:

Implement a missing optimization in LLVM and add tests to it. Github PR

Week 4

Discussions:

Compiler without SSA gets harder to get in right, and it messes up the code base.
Brainfuck language
- Game of Life in BF
Speedup interpreter with computed goto.
Need to be cautious when using APInt. There are use cases in LLVM that use 80, 320 bits.
Intro for Partial Evaluation and its relation to staged computation.
- Abstract interpreter, approximate computations, Halting problem, Tainted cell.
General optimization approach for BF.
- Creating virtual instructions for BF is similar with x86-64 processors having virtual instructions that are not exposed to developers.
It’s hard to pass alias information to compilers, like using restrict correctly in C.

Readings:

Simple and Efficient Construction of Static Single Assignment Form: Great paper that solves real problems!
Whose baseline is it anyway?
llvm-reduce

Assignment:

Implement a BF interpreter.

Week 5

Discussions:

Always optimize with a profiler (data). Recursive to a close form.
C++ downcast optimization probably couldn’t catch by the profiler.
ABI doc tells developer how procedures communicate. Which register to store argument.
If a compiler doesn’t know a fact, sometimes we can rewrite code to teach compiler to optimize.
How does compiler recognize certain pattern, such as popcount, to optimize? Hardcode and do some canonicalization before finding certain patterns.
Approximate: tracking info accurately is hard. Instead, tools try to shrink the area of approximation.

Readings:

Assignment:

Implement a profiler on top of the BF interpreter.
- Simple loops: no i/o, no pointer changes, either +1, -1
Implement a compiler for BF that emits x86-64 assembly.

Week 6

Discussions:

Is it okay to remove infinite loop?
Debugging: Optimization fuel decrement one until fuel run out, binary search the commit.
Scan memory tricks
Fast program tries to handle aligned memory.
Lattice and semilattice are abstract values that live in the compiler.
- Design goal tight enough to not run forever. In other words, having enough information for the compiler to run fast.
Dataflow Analysis
- Top is universal set(imprecise info). Start from the top, we will reach a point that has enough information.
- Bottom is empty set(precise info). Start from the bottom, we will reach a fixed point.
- Least fixed point is the point we get the most information.

Examples:

Dead Code Elimination(Lattice with a height of 2):

Top is maybe reachable.
Bottom is provably unreachable.

Constant Propagation(Lattice with a height of 3):

Top is probably not constant.
Integers
Bottom is unreachable.

Readings:

Assignment:

Optimize simple and non-simple loops in BF, fast vector implementation.
- Starts with simple cases: make a tape with bunch of 1s with a 0 in it, and print the pointer of the 0.
- Write a tiny BF program

Week 7

Discussions:

GCC IR gimple
LLVM IR flat
In practice, we use worklist algorithm a lot. Which node inside a worklist to choose first? A node that is SCC since it can affect other nodes.
Transfer function table SPA
How can we run better benchmarks?
Setting up the conditions for the compilers to know more details.
Be a better programmer by moving redundant and loop invariant operations outside of loop.
Math is good for low-level programming.

References for writing x86-64 vector instructions:

Readings:

Videos:

Alive2 like tool for GCC

Week 9

Discussions:

JIT compiler, how to encode instructions to JIT.
Partial evaluation
Dead store elimination
calloc
global, stack, heap(sticking point, people uninitialized on purpose)
Common subexpression elimination -> Available
Rematerialzation
Very busy -> LICM
Autotuning try a lot of things, PGO run a once and tell the compiler some information next the build.

Week 11

Discussions:

Linker script: tells the linker where memory section maps on a dev board.
Demanded Bits: you cannot prove that bits are not necessary.
LLVM Known bits.
Trace based JIT
Precompiled headers
Timeout is a real problem, AWS IAM Z3 solver.
Flaky tests: sometimes fail and success.
LLVM IR verifier

Readings:

Week 12

Discussions:

Bugpoint was used in LLVM for reducing tests. It defaults use some instructions result in Undefined Behavior.
Trusting trust attacks.
Diverse Double compiling

Readings:

Week 13

Discussions:

Undefined Behavior is also in the hardware chips.
Register pair for multiplication.
undef breaks SSA.
Trusting trust really worths reading it!
Java type concurrency
Type system is trying to help you not break the program.
Union find
HCI for expressing types
bitblit

How to implement fast inverse square root on GCC and Clang:

union
memcpy fast
char *
(C++)reinterpret cast

Readings:

Week 14

Discussions:

SMT solvers: provide them constraints to solve. If there are solutions, it will return one solution(model).
- Practice: z3 sudoku solver(Int, BitVector).
Use a solver for unbounded Loop is hard because we cannot unroll it.
People compile language to Javascript -> asm.js -> WebAssembly(Stack machine)
Baseline compiler
Everyone needs to know at least one language in System Programming, Scripting, Parallel Programming, Math.
Autovectorization for large legacy can improve significant performance.

Readings:

Week 15

Discussions:

GPU compilers
What CPU compilers optimizations don’t work on GPU compilers? (GPU has limited registers for each thread)
- Inline
- Loop unrolling
- LICM
- InstCombine
Divergence problem is critical. Divergence > Latency
Rematerialization

Readings:

Compilation On The GPU? A Feasibility Study

Week 16

Discussions:

Compilation on the GPU
Go’s generic rules compares with LLVM’s large C++ code base.
Go’s generic rules should be checked by something like Z3.
e-graph is good but it is probably not practical to adopt to existing compilers like LLVM or GCC. Applying rewrites might blow up memory usage.
Quine is fun.

Readings: