spo600:compiler_optimizations
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
spo600:compiler_optimizations [2024/05/24 12:20] – [Machine-Code Level Optimization] chris | spo600:compiler_optimizations [2025/02/06 03:44] (current) – [Examples of Common Optimizations] chris | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ===== Compiler Optimizations | ||
+ | //Compiler optimizations// | ||
+ | |||
+ | ===== Why It's Important to Understand Compiler Optimizations ===== | ||
+ | |||
+ | As programmers, | ||
+ | |||
+ | ===== GCC Optimization Options | ||
+ | |||
+ | GCC offers a large number of optimization options, most of which can be controlled in combination using the '' | ||
+ | |||
+ | The '' | ||
+ | |||
+ | To see the individual features enabled by a particular '' | ||
+ | |||
+ | '' | ||
+ | |||
+ | The GCC documentation -- both the man page and even more so the online manuals -- have good information about the 200+ optimization features of the GCC compilers. | ||
+ | |||
+ | ===== LLVM Optimization Options | ||
+ | |||
+ | LLVM, an alternative open source compiler project, can perform an extensive set of analysis and transform operations (optimizations). See the llvm documentation for details. | ||
+ | |||
+ | ===== Examples of Common Optimizations | ||
+ | |||
+ | Here are some common C optimizations. For each optimization, | ||
+ | |||
+ | Important notes: | ||
+ | * There are many other kinds of optimizations, | ||
+ | * Many of these optimizations cannot be performed when code has a side effect -- for example, the direction of a loop can sometimes be reversed without impact if the loop contains only calculations, | ||
+ | * To make decisions on whether to apply some specific optimizations, | ||
+ | * Many optimization decisions are based on a cost-benefit analysis. For example, optimizations such as loop unrolling will often increase performance at the cost of larger code size. In some cases, the usage pattern of the code, such as loop counts and procedure call frequency, is relevant to optimization decisions. [[Profile-Guided Optimization|Profile-Guided Optimization (PGO)]] can assist in making these determinations. | ||
+ | |||
+ | ==== Code Rewriting Optimizations | ||
+ | |||
+ | These optimizations involve rewriting code for performance or space. | ||
+ | |||
+ | **Although these examples are shown using rewritten source code, the compiler actually applies these optimizations to an intermediate representation of the program which does not closely resemble the original source.** | ||
+ | |||
+ | === Dead Code Elimination === | ||
+ | |||
+ | Some code cannot be //reached// in order to be executed. For example: | ||
+ | |||
+ | < | ||
+ | int y=(int) x * 2; | ||
+ | if (x % 2 == 0) { | ||
+ | | ||
+ | | ||
+ | |||
+ | This " | ||
+ | |||
+ | Another example of dead code may include stores performed to variables that are never read. | ||
+ | |||
+ | Dead code is intentionally used in some situations. For example, in code for embedded devices, there may be debug statements like this: | ||
+ | |||
+ | < | ||
+ | #define DEBUG 1 | ||
+ | #define DEBUG_PRINTF if (DEBUG) serial.printf | ||
+ | ... | ||
+ | DEBUG_PRINTF(" | ||
+ | ... | ||
+ | DEBUG_PRINTF(" | ||
+ | | ||
+ | Each DEBUG_PRINTF statement in the code will output messages to the diagnostic serial port on the device. There may be hundreds of these statements in the code. Once the software has been thoroughly tested, the macro definition of DEBUG will be changed to 0, and the compiler will eliminate all of the DEBUG_PRINTF code when the program is compiled. | ||
+ | |||
+ | === Strength Reduction | ||
+ | |||
+ | Certain operation are [[Expensive|expensive]] in terms of the processor time they consume. Replacing these expensive operations with cheaper (simpler, faster) operations is called //strength reduction// | ||
+ | |||
+ | This is a simple loop that counts to ten; each iteration displays the loop index multiplied by 6: | ||
+ | |||
+ | < | ||
+ | int x; | ||
+ | for (x=0; x < 10; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | By adding six to the loop index instead of one after each iteration, the slow multiplication can be eliminated from the loop; on most CPUs, the addition takes the same amount of time as the increment in the code above: | ||
+ | |||
+ | < | ||
+ | int x; | ||
+ | for (x=0; x < 60; x+=6) { | ||
+ | | ||
+ | | ||
+ | |||
+ | === Hoisting | ||
+ | |||
+ | // | ||
+ | |||
+ | == Hoisting I - Loop-Invariant Variable | ||
+ | |||
+ | Here is a simple loop: | ||
+ | |||
+ | < | ||
+ | int t, x; | ||
+ | | ||
+ | t = readtemp(); /* current temp in farenheit */ | ||
+ | for (x = 0; x < 200; x++) { | ||
+ | c = (t-32)/ | ||
+ | | ||
+ | | ||
+ | |||
+ | The value of c is invariant within the loop, so the calculation of its value may be safely moved outside the loop: | ||
+ | |||
+ | < | ||
+ | int t, x; | ||
+ | | ||
+ | t = readtemp(); /* current temp in farenheit */ | ||
+ | c = (t-32)/ | ||
+ | for (x = 0; x < 200; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | == Hoisting II - Loop-Invariant Expression | ||
+ | |||
+ | Here is a similar example: | ||
+ | |||
+ | < | ||
+ | int t, x; | ||
+ | t = readtemp(); /* current temp in farenheit */ | ||
+ | for (x = 0; x < 200; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | In this case, there is no variable with a loop-invariant value, but the second argument to the function '' | ||
+ | |||
+ | < | ||
+ | int t, x; | ||
+ | | ||
+ | t = readtemp(); /* current temp in farenheit */ | ||
+ | c = (t-32)/ | ||
+ | for (x = 0; x < 200; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | == Hoisting III - Loop-Invariant Expression in Loop Condition | ||
+ | |||
+ | In this example, the invariant expression is in the loop control condition: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | y = foo(); | ||
+ | for (x=0; x < y*12; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | Hoisting this out of the loop will reduce the number of multiplications performed: | ||
+ | |||
+ | < | ||
+ | int x, y, c; | ||
+ | y = foo(); | ||
+ | c = y * 12; | ||
+ | for (x=0; x < c; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | === Pre-calculation of Constants | ||
+ | |||
+ | Where possible, the compiler will evaluate constant expressions at compile time rather than runtime: | ||
+ | |||
+ | < | ||
+ | ff = (212-32)/ | ||
+ | conv = c * ff + 32;</ | ||
+ | |||
+ | The factor '' | ||
+ | |||
+ | < | ||
+ | conv = c * 1.800 + 32;</ | ||
+ | |||
+ | === Loop Unswitching | ||
+ | |||
+ | It is sometimes possible to swap a condition inside a loop for loops inside a condition. | ||
+ | |||
+ | == Loop Unswitching I - Inner/Outer Swap == | ||
+ | |||
+ | Consider this code: | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | if (ctl == 0) { | ||
+ | | ||
+ | } else { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | | ||
+ | |||
+ | This code evaluates the condition '' | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | if (ctl == 0) { | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | | ||
+ | } | ||
+ | } else { | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | | ||
+ | |||
+ | == Loop Unswitching II - Inner/Outer Swap with Code Repetition | ||
+ | |||
+ | This example is similar. Here, the loop contents are partially-conditional: | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | | ||
+ | if (ctl == 0) { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | | ||
+ | |||
+ | Constructing two loops and placing them in a conditional structure removes the condition evaluation from inside the loop -- again, at the expense of size: | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | if (ctl == 0) { | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | | ||
+ | | ||
+ | } | ||
+ | } else { | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | | ||
+ | |||
+ | === Loop Splitting | ||
+ | |||
+ | Similar to loop unswitching, | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | for (x=0; x < 10000; x++) { | ||
+ | if (x < 3700) { | ||
+ | | ||
+ | } else { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | | ||
+ | |||
+ | Becomes: | ||
+ | |||
+ | < | ||
+ | int foo(float ctl) { | ||
+ | int x; | ||
+ | for (x=0; x < 3700; x++) { | ||
+ | | ||
+ | } | ||
+ | for (; x < 10000; x++) { | ||
+ | | ||
+ | } | ||
+ | | ||
+ | |||
+ | === Loop Interchange | ||
+ | |||
+ | Swapping an inner and outer loop can sometimes improve performance. This code iterates through a two-dimensional array: | ||
+ | |||
+ | < | ||
+ | int x, y, max=10000; | ||
+ | long double a[[max]][max]; | ||
+ | for (y=0; y < max; y++) { | ||
+ | for (x=0; x<max; x++) { | ||
+ | | ||
+ | } | ||
+ | | ||
+ | |||
+ | If the data is stored in memory in row-major format (all of row 0 is followed by all of row 1), then the memory accesses in the loop above are spread across memory. This reduces the effectiveness of pre-fetching and caching memory. | ||
+ | |||
+ | Swapping the inner and outer loops achieves the same result, but causes the memory access to be sequential: | ||
+ | |||
+ | < | ||
+ | int x, y, max=10000; | ||
+ | long double a[[max]][max]; | ||
+ | for (x=0; x < max; x++) { | ||
+ | for (y=0; y<max; y++) { | ||
+ | | ||
+ | } | ||
+ | | ||
+ | |||
+ | === Loop Unrolling | ||
+ | |||
+ | A loop can be unrolled so that the inner portion of the loop contains multiple copies of the code (corresponding to multiple iterations of the loop). This takes additional space, but reduces the number of times that the loop control condition is evaluated. | ||
+ | |||
+ | == Loop Unrolling I - Guaranteed-Multiple Iterations | ||
+ | |||
+ | This loop is guaranteed to execute an even number of times: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x = 0; x < y*2; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | It can be unrolled so that the loop expression is evaluated only every second time that the loop contents are executed: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x = 0; x < y*2; ) { | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | == Loop Unrolling II - Pairs-of-Iterations plus a Conditional Extra Iteration | ||
+ | |||
+ | The unrolling performed in the previous example can be performed even when the number of iterations is not guaranteed to be even, if you add an additional conditionally-executed iteration after the loop. | ||
+ | |||
+ | Consider this code: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x=0; x < y; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | If we unroll the loop so that it executes an even number of times, we can use an if statement to conditionally execute the loop contents one extra time if an odd number of iterations are required: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x=0; x < y-1; ) { | ||
+ | | ||
+ | | ||
+ | } | ||
+ | if (x < y ) { | ||
+ | | ||
+ | | ||
+ | |||
+ | == Loop Unrolling III - Large Number of Iterations | ||
+ | |||
+ | You can extend this principle to a larger number of unrolled iterations: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x=0; x < y; x++) { | ||
+ | | ||
+ | | ||
+ | |||
+ | This code evaluates the loop control condition only one-tenth as often as the original loop, up to the largest multiple of ten iterations that is less than the total iteration count: | ||
+ | |||
+ | < | ||
+ | int x, y; | ||
+ | | ||
+ | for (x=0; x < y-10; ) { | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | } | ||
+ | for (; x < y; ) { | ||
+ | | ||
+ | | ||
+ | |||
+ | Loop unrolling is often beneficial as long as the unrolled loop fits into cache; unrolling past that point will reduce performance. | ||
+ | |||
+ | === Inlining | ||
+ | |||
+ | // | ||
+ | |||
+ | < | ||
+ | int foo(int a, int b) { | ||
+ | | ||
+ | } | ||
+ | int x, i; | ||
+ | long ttl=0; | ||
+ | for (x = 0; x < 10000; x++) { | ||
+ | i = rand(); | ||
+ | ttl += foo(x, i); | ||
+ | | ||
+ | |||
+ | Using inlining, the calls to the function '' | ||
+ | |||
+ | < | ||
+ | int x, i; | ||
+ | long ttl=0; | ||
+ | for (x = 0; x < 10000; x++) { | ||
+ | i = rand(); | ||
+ | ttl += x + i*2; | ||
+ | | ||
+ | |||
+ | Exercise: Can you see an additional optimization (besides loop unrolling) that could be performed in the example above? | ||
+ | |||
+ | === Common Subexpression Elimination | ||
+ | |||
+ | Evaluating and saving the result for a common subexpression can reduce execution time. Consider this expression: | ||
+ | |||
+ | < | ||
+ | x = (a * c) - (a * c) / 12</ | ||
+ | |||
+ | The subexpression '' | ||
+ | |||
+ | < | ||
+ | t = (a * c) | ||
+ | x = t - t/ | ||
+ | |||
+ | === Jump Threading | ||
+ | |||
+ | **Jump threading** involves eliminating unnecessary condition evaluation by recognizing conditions that are in alternation. | ||
+ | |||
+ | < | ||
+ | int a, b; | ||
+ | | ||
+ | b=!a; | ||
+ | if (a) { | ||
+ | | ||
+ | } | ||
+ | if (b) { | ||
+ | | ||
+ | | ||
+ | |||
+ | The variable '' | ||
+ | |||
+ | < | ||
+ | int a, b; | ||
+ | | ||
+ | if (a) { | ||
+ | | ||
+ | } else { | ||
+ | | ||
+ | | ||
+ | |||
+ | === Short-Circuit Evaluation | ||
+ | |||
+ | == Short Circuit I - AND == | ||
+ | |||
+ | When evaluating a condition with a logical AND (&& | ||
+ | |||
+ | < | ||
+ | if (a * b > c && d > e) { | ||
+ | | ||
+ | | ||
+ | |||
+ | You could rewrite this as: | ||
+ | |||
+ | < | ||
+ | if (a * b > c) { | ||
+ | if (d > e) { | ||
+ | | ||
+ | } | ||
+ | | ||
+ | |||
+ | == Short Circuit II - OR == | ||
+ | |||
+ | In the case of a logical OR (||), the short-circuit permits you to skip evaluation of the second half of the condition if the first half is true. | ||
+ | |||
+ | < | ||
+ | if (a * b > c || d > e ) { | ||
+ | | ||
+ | | ||
+ | |||
+ | Could be rewritten as: | ||
+ | |||
+ | < | ||
+ | // This makes more sense in assembler than C! | ||
+ | if (a * b > c) { | ||
+ | | ||
+ | } elseif (d > e) { | ||
+ | | ||
+ | | ||
+ | |||
+ | === Test Reordering | ||
+ | |||
+ | In a short-circuit expression, you can perform a partial strength reduction by placing the least expensive expression first: | ||
+ | |||
+ | < | ||
+ | if (a * b > c && d > e) { | ||
+ | | ||
+ | | ||
+ | |||
+ | When rewriting this in short-circuit form, the expression '' | ||
+ | |||
+ | < | ||
+ | if (d > e) { | ||
+ | if (a * b > c) { | ||
+ | | ||
+ | } | ||
+ | | ||
+ | |||
+ | === Dead Store Elimination | ||
+ | |||
+ | A //dead store// occurs when a value is stored to a variable and never read again, often because the value is overwritten before being read. | ||
+ | |||
+ | === Dead Store Elimination I - Simple Case === | ||
+ | |||
+ | Here is a simple dead store example: | ||
+ | |||
+ | < | ||
+ | a = b * c + d / f; | ||
+ | a = c / f * 60;</ | ||
+ | |||
+ | This looks stupid, but is actually fairly common in programs, especially if there are many lines of code between the two assignments. | ||
+ | |||
+ | This can be written as: | ||
+ | |||
+ | < | ||
+ | a = c / f * 60;</ | ||
+ | |||
+ | == Dead Store Elimination II - Unused Initialized Value in Declaration | ||
+ | |||
+ | A common source of dead stores is unused initialized values from declarations: | ||
+ | |||
+ | < | ||
+ | int a = 0; | ||
+ | /* ...possibly many lines later, before a is read: */ | ||
+ | a = foo();</ | ||
+ | |||
+ | Eliminating the dead store saves a memory access: | ||
+ | |||
+ | < | ||
+ | int a; | ||
+ | a = foo();</ | ||
+ | |||
+ | == Dead Store Elimination III - Special Case == | ||
+ | |||
+ | Some dead stores are the result of overriding a variable value in a special case: | ||
+ | |||
+ | < | ||
+ | int a; | ||
+ | a = b * c + d / f; /* normal/ | ||
+ | if (foo()) { | ||
+ | a = c / f * 60; /* special case! */ | ||
+ | | ||
+ | |||
+ | This can be written so that only one store is ever performed: | ||
+ | |||
+ | < | ||
+ | int a; | ||
+ | if (foo()) { | ||
+ | a = c / f * 60; | ||
+ | } else { | ||
+ | a = b * c + d / f; | ||
+ | | ||
+ | |||
+ | ==== Machine-Code Level Optimization | ||
+ | |||
+ | In addition to code rewriting, there are optimizations that can be performed at the machine language level -- the actual instructions that are emitted by the compiler and the arrangement of these instructions in memory. Here are some examples: | ||
+ | |||
+ | === Block Rearrangement | ||
+ | |||
+ | In a block of code such as this: | ||
+ | |||
+ | < | ||
+ | if (condition) { | ||
+ | | ||
+ | } | ||
+ | following code</ | ||
+ | |||
+ | The compiler usually inverts the // | ||
+ | |||
+ | There are many cases where the // | ||
+ | |||
+ | In a normal cache-load and prefetch pattern, the code for the conditional //action// is loaded into cache and the early stages of the processor' | ||
+ | |||
+ | In order to do this, the compiler needs to know that the // | ||
+ | |||
+ | (Note that in the Linux Kernel, the '' | ||
+ | |||
+ | === Instruction Selection | ||
+ | |||
+ | There are many optimizations that occur at the object code level. Selecting an instruction that provides the same result as another instruction but which takes less time to execute, called // | ||
+ | |||
+ | In order to perform instruction selection, the compiler relies upon a //Cost Model//, which contain detailed information about the relative execution cost of each instruction which the target can execute. Execution costs can vary between different implementations of an architecture (e.g., cpu models) due to varying numbers and efficiencies of execution units, pipeline size, cache, memory interface details, and so forth, so most compilers accepts both a target architecture and a specific chip model (for example, '' | ||
+ | |||
+ | The x86 instruction set in particular has some quirks that the chip designers have maintained over the years, including [[https:// | ||
+ | |||
+ | ===== Debugging with Optimizations Enabled | ||
+ | |||
+ | Using low-level [[Debugger|debugging]] tools with code that has been highly optimized can be very challenging, | ||
+ | |||
+ | The gcc option '' | ||
+ | |||
+ | ===== Resources | ||
+ | |||
+ | * GCC | ||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ | * LLVM (Clang) | ||
+ | * [[http:// | ||
+ | * [[https:// |