Review: The death of optimizing compilers
Original talk slides by D.J. Bernstein
(на русском; 2020-12-23)
Can't say that I agree with this talk completely, but it inspires thought. Original formatting is rather peculiar and, in my opinion, unreadable. Maybe presentation was originally in text format before translation to PDF? However, the original is definitely worth reading.
Primary points in my paraphrase.
Software code can be divided by 2 categories:
1. Small quantity of “hot” code which is executed repeatedly through almost all run-time of a program.
2. Much larger quantity of “cold” code, which is executed rarely (including, for example, at initialization — the start of execution — time) and doesn't affect overall performance of the software.
Nobody should really care about the performance of “cold” code. A lot of code today either is written in interpreted languages (Python) or is executed inside a web-browser with a huge overhead (and most people don't even notice this).
“Hot” code should be optimized manually by the algorithm designers, because the optimization requires the deep knowledge of both the algorithm (roughly speaking, what properties can be exploited for the sake of performance) and the target hardware platform. Note that Intel CPU instruction set manual spans 5 volumes (2a-2d) and 2.2 thousands pages.
Optimizing compilers are very complex programs which may introduce errors in optimized code. The set of optimizing stages influences build times and, for example, the convenience of debugging. However, optimizing the “cold” code is useless. And for the “hot” code compiler can't even come close to the performance gains achievable by the specialist. The methods which the specialist can employ include caching, changing the sequence of computation and formulae. Compiler can perform only semantics-preserving transformations. And it definitely can't change the program in a way the specialist can.
The conclusion is: further development of even more complicated optimizing compilers essentially makes no sense.
(In the final part of the talk the author cites Knuth about the perspectives of interactive program transformation systems, as one of prospective development areas. But in my opinion, the former thesis is more interesting.)
Speaking from my personal experience I can make one critical point. Thesis of “hot” code is valid mostly for a standalone piece of software or a software library. Bernstein himself is known as a cryptographer, and the thesis is definitely valid in cryptography. However, when we look at not a standalone piece of software but at a software system, especially distributed one, for example, on a traditional three-tier system “client” (nowadays usually web-browser) — server — database (not even speaking about microservices), the problem becomes much more compilcated.
In a mature if maintainers spend some time on profiling and optimizing bottlenecks, eventually the profile for each component flattens out. Usually the largest share of CPU time in such systems is spent on the interaction between components, with latency and the synchronization overhead contributing the most. On the other hand, indeed, the complex optimizing compilers will not improve the performance in this case significantly. Most performance gains for such systems are only possible with semantics-breaking changes. We can say that even this example confirms an unobvious (and contested on the wrong level) thesis of excellence of a human over the compiler at places where performance is essential.
To continue this topic closer to practice I recommend looking at LuaJIT compiler , a bit more complex Terra compiler  and the PhD thesis of Maximillian Bolingbroke on supercompilation in GHC Haskell compiler . LuaJIT  has for a long time demonstrated a near-optimal implementation approach for a dynamic programming language. With this implementation the programs written in Lua demonstrate performance close to the programs written in C programming language. I can't provide any valid benchmark, but from not-so-valid ones it looks that LuaJIT is still faster than V8 ECMAScript JIT-compiler powering Chromium web-browser and NodeJS application platform. Therefore, LuaJIT, measured at about 90 000 lines of code, reflects the state of the art from the optimizing compilers perspective.
The Terra compiler  is implemented on top of LuaJIT. On the other hand, it materializes Knuth's vision of program transformation system. In case of Terra compiler a typed subset of Lua language is transformed to LLVM bitcode and optimized the pipeline powering Clang C compiler. The remaining part of the typical program in Terra is written in Lua language in its LuaJIT implementation, stepping in as smart hybrid of preprocessor, linker and the glue language. In paper  the authors use Terra compiler to create auto-tuned high-performance image-processing kernels which perform better than their C versions on CPU and can be further translated to FPGA or ASIC. Another example of auto-tuning from my colleague Vladimir Roganov is an s-call (scaled-call or super-call) extension for Fortran language which was used for effecient solution of sparse linear system for nuclear power plant simulation in our paper . At the initialization stage the simulator runs benchmarks of a several libraries solving the sparse linear system on the distributed system. After the benchmarks the simulator automatically (or maybe automatized — with possible human intervention?) selects the best implementation for the specific platform on which the simulator is executed.
Finally, Maximillian Bolingbroke in his PhD thesis  shows that supercompilation can be essentially the only optimization in a compiler. We discussed something like this with Andrey Bannikov around our third year studying at the Faculty of Mechanics and Mathematics. His idea was that it would be cool to implement a compiler with only a single optimization “to rule them all” — but at the time our scope was not wide enough to include such an arcane topic. And nowadays starting from a supercompilation, possibly supplemented by some simple rule-based transformations on the output and rule-based pessimizations on the input looks like a feasible way for a new, ground-up compiler.
I will close this note with the following speculation: is it possible to make a maintainable implementation of Lua/LuaJIT in a primitive (i.e. trivially-implemented and close to the metal) language like Forth/machineForth/colorForth, and how will the performance of such implementation relate to LuaJIT? Of course, for long-running programs the tracing compiler will probably outperform less sophisticated Forth-based compiler. On the other hand, in the latter case there will be significantly less code, which is much better for the cache.
References (all to HTTP-resources):
Almost-classical notes on LuaJIT implementation:
[1a] Mike Pall. LuaJIT's interpreter is fast, because:…
[1b] Mike Pall. LuaJIT 2.0 intellectual property disclosure and research opportunities
 Terra compiler
 Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines
The following links on the page of the paper with short English abstract, but not-so-far at the web-site there is a full-text PDF (in Russian):
 Parallelization of Three-Dimensional Thermal-Hydraulic Best-Estimate Code “BAGIRA” for Simulation of Multiphase Flow as Part of Full-Scale Supercomputer Simulator “Virtual Nuclear Power Plant” / V. Vasenin, M. Krivchikov, A. Kroshilin et al. // Programmnaya Ingeneriya (Software Engineering). — 2012. — № 6. — P. 15–23.
There is also a paper in English:
[4a] Vasenin, V. A., Krivchikov, M. A., Kroshilin, V. E., Kroshilin, A. E., & Roganov, V. A. (2012). Scalable three-dimensional thermal-hydraulic best-estimate code BAGIRA. United States: American Nuclear Society - ANS.
 Max Bolingbroke. Call-by-need supercompilation
-- CC-BY-SA: maxxk (Maxim Krivchikov)