Week 1
Overview of this course.
Readings:
- The death of optimizing compilers
- On Proebsting’s Law
- Impact of Economics on Compiler Optimization
- Why Do Peephole Optimizations Work?
- (option) Compiler Optimization Catalog
Assignment:
Week 2
Discussions:
- Is it really the dealth of optimizing compilers? No.
- On Proebsting’s Law. Probably cannot use
-O0
as a 18 years old compiler. - Impact of Economics on Compiler Optimization. Look for where the money goes.
- Refinement is very important!
Assignment:
- Pick a missing optimization in LLVM to implement.
- Build LLVM and Alive2 locally.
Week 3
Discussions:
- How LLVM is tested?
- unittest: LLVM API tests
- tests: regression tests
- llvm-test-suite: benchmark tests
- libFuzzer
- test by users
- AVX512 ternary logic. It’s difficult to decode due to its required bit space.
- Superoptimization: To generate optimized code, superoptimizer searches for certain pattern, does refinement check and assess with its cost model.
Readings:
- SSA is Functional Programming
- Lecture Notes on Static Single Assignment Form
- System Under Test: LLVM
- Testing LLVM
Assignment:
- Implement a missing optimization in LLVM and add tests to it. Github PR
Week 4
Discussions:
- Compiler without SSA gets harder to get in right, and it messes up the code base.
- Brainfuck language
- Game of Life in BF
- Speedup interpreter with computed goto.
- Need to be cautious when using
APInt
. There are use cases in LLVM that use 80, 320 bits. - Intro for Partial Evaluation and its relation to staged computation.
- Abstract interpreter, approximate computations, Halting problem, Tainted cell.
- General optimization approach for BF.
- Creating virtual instructions for BF is similar with x86-64 processors having virtual instructions that are not exposed to developers.
- It’s hard to pass alias information to compilers, like using
restrict
correctly in C.
Readings:
- Simple and Efficient Construction of Static Single Assignment Form: Great paper that solves real problems!
- Whose baseline is it anyway?
- llvm-reduce
Assignment:
- Implement a BF interpreter.
Week 5
Discussions:
- Always optimize with a profiler (data). Recursive to a close form.
- C++ downcast optimization probably couldn’t catch by the profiler.
- ABI doc tells developer how procedures communicate. Which register to store argument.
- If a compiler doesn’t know a fact, sometimes we can rewrite code to teach compiler to optimize.
- How does compiler recognize certain pattern, such as popcount, to optimize? Hardcode and do some canonicalization before finding certain patterns.
- Approximate: tracking info accurately is hard. Instead, tools try to shrink the area of approximation.
Readings:
Assignment:
- Implement a profiler on top of the BF interpreter.
- Simple loops: no i/o, no pointer changes, either +1, -1
- Implement a compiler for BF that emits x86-64 assembly.
Week 6
Discussions:
- Is it okay to remove infinite loop?
- Debugging: Optimization fuel decrement one until fuel run out, binary search the commit.
- Scan memory tricks
- Fast program tries to handle aligned memory.
- Lattice and semilattice are abstract values that live in the compiler.
- Design goal tight enough to not run forever. In other words, having enough information for the compiler to run fast.
- Dataflow Analysis
- Top is universal set(imprecise info). Start from the top, we will reach a point that has enough information.
- Bottom is empty set(precise info). Start from the bottom, we will reach a fixed point.
- Least fixed point is the point we get the most information.
Examples:
Dead Code Elimination(Lattice with a height of 2):
- Top is maybe reachable.
- Bottom is provably unreachable.
Constant Propagation(Lattice with a height of 3):
- Top is probably not constant.
- Integers
- Bottom is unreachable.
Readings:
Assignment:
- Optimize simple and non-simple loops in BF, fast vector implementation.
- Starts with simple cases: make a tape with bunch of 1s with a 0 in it, and print the pointer of the 0.
- Write a tiny BF program
Week 7
Discussions:
- GCC IR gimple
- LLVM IR flat
- In practice, we use worklist algorithm a lot. Which node inside a worklist to choose first? A node that is SCC since it can affect other nodes.
- Transfer function table SPA
- How can we run better benchmarks?
- Setting up the conditions for the compilers to know more details.
- Be a better programmer by moving redundant and loop invariant operations outside of loop.
- Math is good for low-level programming.
References for writing x86-64 vector instructions:
- Intel Official Intrinsic manual
- Peter Cordes on Stackoverflow
- Felix Cloutier’s x86-64 manual
- One of my working examples
Readings:
Videos:
Week 9
Discussions:
- JIT compiler, how to encode instructions to JIT.
- Partial evaluation
- Dead store elimination
- calloc
- global, stack, heap(sticking point, people uninitialized on purpose)
- Common subexpression elimination -> Available
- Rematerialzation
- Very busy -> LICM
- Autotuning try a lot of things, PGO run a once and tell the compiler some information next the build.
Week 11
Discussions:
- Linker script: tells the linker where memory section maps on a dev board.
- Demanded Bits: you cannot prove that bits are not necessary.
- LLVM Known bits.
- Trace based JIT
- Precompiled headers
- Timeout is a real problem, AWS IAM Z3 solver.
- Flaky tests: sometimes fail and success.
- LLVM IR verifier
Readings:
Week 12
Discussions:
- Bugpoint was used in LLVM for reducing tests. It defaults use some instructions result in Undefined Behavior.
- Trusting trust attacks.
- Diverse Double compiling
Readings:
Week 13
Discussions:
- Undefined Behavior is also in the hardware chips.
- Register pair for multiplication.
undef
breaks SSA.- Trusting trust really worths reading it!
- Java type concurrency
- Type system is trying to help you not break the program.
- Union find
- HCI for expressing types
- bitblit
How to implement fast inverse square root on GCC and Clang:
- union
- memcpy fast
- char *
- (C++)reinterpret cast
Readings:
Week 14
Discussions:
- SMT solvers: provide them constraints to solve. If there are solutions, it will return one solution(model).
- Practice: z3 sudoku solver(Int, BitVector).
- Use a solver for unbounded Loop is hard because we cannot unroll it.
- People compile language to Javascript -> asm.js -> WebAssembly(Stack machine)
- Baseline compiler
- Everyone needs to know at least one language in System Programming, Scripting, Parallel Programming, Math.
- Autovectorization for large legacy can improve significant performance.
Readings:
Week 15
Discussions:
- GPU compilers
- What CPU compilers optimizations don’t work on GPU compilers? (GPU has limited registers for each thread)
- Inline
- Loop unrolling
- LICM
- InstCombine
- Divergence problem is critical. Divergence > Latency
- Rematerialization
Readings:
Week 16
Discussions:
- Compilation on the GPU
- Go’s generic rules compares with LLVM’s large C++ code base.
- Go’s generic rules should be checked by something like Z3.
- e-graph is good but it is probably not practical to adopt to existing compilers like LLVM or GCC. Applying rewrites might blow up memory usage.
- Quine is fun.
Readings: