Loop Optimization
For loops with large numbers of iterations, finding ways to optimize them is essential. This example demonstrates pointer array-accessing, loop unrolling, and power-of-two exploitation.
Test Machine:
Operating System: |
Microsoft Windows XP Professional |
|
System Model: |
HP xw4400 Workstation |
|
Processor (x2): |
x86 Family 6 Model 15 Stepping 6 GenuineIntel ~2400 Mhz |
|
The Results:
The program was run with an outer loop that iterated 200,000 times and an inner loop that iterated 9,973 times for each iteration of the outer loop. The inner loop (with iterator j ) was optimized, while the outer loop (with iterator i ) was not. Processing time was measured in millions of clock cycles. A measurement was taken immediately preceding the outer loop and immediately following, then the two measurements were compared to calculate the actual processing time. Each method was tested in 20 separate trials, both with compiler optimization turned on as well as off. Interestingly, my optimizations even improved upon the compiler-optimized loop. The average time for each is shown below.
Compiler Optimization Turned Off | |
Normal: |
11,385 |
Optimized: |
4807 |
Compiler Optimization Turned On | |
Normal: |
2382 |
Optimized: |
2339 |
The Code:
The following code is the program's main loop. The array is first filled with random values, then the clock cycle count is obtained. Inside the outer loop (with iterator i ), we see our two test cases. The unimproved loop appears first, with the optimized loop below it. Immediately after the inner loop is a checksum routine that ensures the accuracy of the test. Finally, the clock cycle count is obtained again to calculate the actual processing time. To download the entire source code, click here.
/* Initialize the array with random values. */
srand(time(NULL));
for (j=0; j<ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
/* Get our start time */
long long start = GetMachineCycleCount();
for (i = 0; i < N_TIMES; i++) {
#if !OPTIMIZED
int j;
for (j=0; j < ARRAY_SIZE; j+=2) {
sum += array[j] + array[j+1];
}
#else
int j;
/* Create a pointer to the array to avoid unnecessary adds when indexing the array.
* Example:
* Without pointer: n add instructions
* array[j] + array[j+1] + array[j+2] + ... + array[j+n];
*
* With pointer: 1 add instruction
* pointr[0] + pointr[1] + pointr[2] + ... + pointr[n];
* pointr+=n;
*/
int *p = array;
for (j=0; j < ARRAY_SIZE; ++j) {
/* Unroll the loop in large, power-of-two chunks. Check remaining size each iteration
* to taper it off at the end. */
if ((ARRAY_SIZE - j)%32 == 0) {
sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9] + p[10] +
p[11] + p[12] + p[13] + p[14] + p[15] + p[16] + p[17] + p[18] + p[19] + p[20] + p[21] +
p[22] + p[23] + p[24] + p[25] + p[26] + p[27] + p[28] + p[29] + p[30] + p[31];
j+=31;
p+=32;
}
else if ((ARRAY_SIZE - j)%16 == 0) {
sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9] + p[10] +
p[11] + p[12] + p[13] + p[14] + p[15];
j+=15;
p+=16;
}
else if ((ARRAY_SIZE - j)%8 == 0) {
sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7];
j+=7;
p+=8;
}
else if ((ARRAY_SIZE - j)%4 == 0) {
sum += p[0] + p[1] + p[2] + p[3];
j+=3;
p+=4;
}
else if ((ARRAY_SIZE - j)%2 == 0) {
sum += p[0] + p[1];
++j;
p+=2;
}
else
{
sum+= p[0];
++p;
}
}
#endif
/* Check each iteration */
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
/* Get our end time */
long long end = GetMachineCycleCount();
printf("Cycle Count: %lld M\n", (end-start)/1000000);