Loop Optimization

For loops with large numbers of iterations, finding ways to optimize them is essential. This example demonstrates pointer array-accessing, loop unrolling, and power-of-two exploitation.

Test Machine:

Operating System:

Microsoft Windows XP Professional

System Model:

HP xw4400 Workstation

Processor (x2):

x86 Family 6 Model 15 Stepping 6 GenuineIntel ~2400 Mhz

The Results:

The program was run with an outer loop that iterated 200,000 times and an inner loop that iterated 9,973 times for each iteration of the outer loop. The inner loop (with iterator j ) was optimized, while the outer loop (with iterator i ) was not. Processing time was measured in millions of clock cycles. A measurement was taken immediately preceding the outer loop and immediately following, then the two measurements were compared to calculate the actual processing time. Each method was tested in 20 separate trials, both with compiler optimization turned on as well as off. Interestingly, my optimizations even improved upon the compiler-optimized loop. The average time for each is shown below.

Compiler Optimization Turned Off

Normal:

11,385

Optimized:

4807

Compiler Optimization Turned On

Normal:

2382

Optimized:

2339

The Code:

The following code is the program's main loop. The array is first filled with random values, then the clock cycle count is obtained. Inside the outer loop (with iterator i ), we see our two test cases. The unimproved loop appears first, with the optimized loop below it. Immediately after the inner loop is a checksum routine that ensures the accuracy of the test. Finally, the clock cycle count is obtained again to calculate the actual processing time. To download the entire source code, click here.

/* Initialize the array with random values. */
    srand(time(NULL));
    for (j=0; j<ARRAY_SIZE; j++) {
      x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
      array[j] = x;
      checksum += x;
    }

	/* Get our start time */
    long long start = GetMachineCycleCount();
    
	for (i = 0; i < N_TIMES; i++) {


#if !OPTIMIZED

      int j;
      for (j=0; j < ARRAY_SIZE; j+=2) {
        sum += array[j] + array[j+1];
      }
      
#else

	  int j;
	  /* Create a pointer to the array to avoid unnecessary adds when indexing the array.
	   * Example:
	   *	Without pointer:	n add instructions
	   *					array[j] + array[j+1] + array[j+2] + ... + array[j+n];
	   *
	   *	With pointer:		1 add instruction
	   *					pointr[0] + pointr[1] + pointr[2] + ... + pointr[n];
	   *					pointr+=n;
	   */
	  int *p = array;
      for (j=0; j < ARRAY_SIZE; ++j) {
		  /* Unroll the loop in large, power-of-two chunks.  Check remaining size each iteration
		   * to taper it off at the end. */
		if ((ARRAY_SIZE - j)%32 == 0) {
			sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9] + p[10] +
                   p[11] + p[12] + p[13] + p[14] + p[15] + p[16] + p[17] + p[18] + p[19] + p[20] + p[21] +
                   p[22] + p[23] + p[24] + p[25] + p[26] + p[27] + p[28] + p[29] + p[30] + p[31];
			j+=31;
			p+=32;
		  }
		else if ((ARRAY_SIZE - j)%16 == 0) {
			sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + p[8] + p[9] + p[10] +
            p[11] + p[12] + p[13] + p[14] + p[15];
			j+=15;
			p+=16;
		  }
		else if ((ARRAY_SIZE - j)%8 == 0) {
			sum += p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7];
			j+=7;
			p+=8;
		  }
		else if ((ARRAY_SIZE - j)%4 == 0) {
			sum += p[0] + p[1] + p[2] + p[3];
			j+=3;
			p+=4;
		  }
		else if ((ARRAY_SIZE - j)%2 == 0) {
			sum += p[0] + p[1];
			++j;
			p+=2;
		  }
		else
		{
			sum+= p[0];
			++p;
		}
      }

#endif

      /* Check each iteration */ 
      if (sum != checksum) {
        printf("Checksum error!\n");
      }
      sum = 0;
    }

	/* Get our end time	*/
    long long end = GetMachineCycleCount();

    printf("Cycle Count: %lld M\n", (end-start)/1000000);