Quick Links:

bal | bbl | bdl | bsl

Namespaces | Defines

Component bsls_performancehint
[Package bsls]

Provide performance hints for code optimization. More...

Namespaces

namespace  bsls

Defines

#define BSLS_PERFORMANCEHINT_PREDICT_LIKELY(expr)   (expr)
#define BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY(expr)   (expr)
#define BSLS_PERFORMANCEHINT_PREDICT_EXPECT(expr, value)   (expr)
#define BSLS_PERFORMANCEHINT_ATTRIBUTE_COLD
#define BSLS_PERFORMANCEHINT_UNLIKELY_HINT

Detailed Description

Outline
Purpose:
Provide performance hints for code optimization.
Classes:
bsls::PerformanceHint namespace for performance optimization hints
Macros:
BSLS_PERFORMANCEHINT_PREDICT_LIKELY(X) X probably evaluates to non-zero
BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY(X) X probably evaluates to zero
BSLS_PERFORMANCEHINT_PREDICT_EXPECT(X, Y) X probably evaluates to Y
BSLS_PERFORMANCEHINT_UNLIKELY_HINT annotate block unlikely to be taken
BSLS_PERFORMANCEHINT_OPTIMIZATION_FENCE prevent compiler optimizations
Description:
This component provides performance hints for the compiler or hardware. There are currently two types of hints that are supported:
  • branch prediction
  • data cache prefetching
Branch Prediction:
The three macros provided, BSLS_PERFORMANCEHINT_PREDICT_LIKELY, BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY, and BSLS_PERFORMANCEHINT_PREDICT_EXPECT, can be used to optimize compiler generated code for branch prediction. The compiler, when given the hint under optimized mode (i.e., with BDE_BUILD_TARGET_OPT defined) will rearrange the assembly instructions it generates to minimize the number of jumps needed.
The following describes the macros provided by this component:
                Macro Name                          Description of Macro
 ----------------------------------------       -----------------------------
 BSLS_PERFORMANCEHINT_PREDICT_LIKELY(expr)      Hint to the compiler that the
                                                specified *integral* 'expr'
                                                expression is likely to
                                                evaluate to non-zero.
                                                Returns 'true' or 'false'
                                                depending on the result of
                                                the expression.

 BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY(expr)    Hint to the compiler that the
                                                specified *integral* 'expr'
                                                expression is likely to
                                                evaluate to zero.  Returns
                                                'true' or 'false' depending
                                                on the result of the
                                                expression.

 BSLS_PERFORMANCEHINT_PREDICT_EXPECT(expr, value)
                                                Hint to the compiler that the
                                                specified *integral* 'expr'
                                                expression is likely to
                                                evaluate to the specified
                                                'value'.  Returns the result
                                                of the expression.

 BSLS_PERFORMANCEHINT_UNLIKELY_HINT             Hint to the compiler that the
                                                block which contains the hint
                                                is unlikely chosen.  Use this
                                                in conjunction with the
                                                'PREDICT_UNLIKELY' clause for
                                                maximum portability.
Warning:
Please use the macros provided in this component with caution. Always profile your code to get an idea of actual usage before attempting to optimize with these macros. Furthermore, these macros are merely hints to the compiler. Whether or not they will have visible effect on performance is not guaranteed. Note that one can perform similar optimization with a profile-based compilation. When compiled with the proper options, the compiler can collect usage information of the code, and such information can then be passed back to recompile the code in a more optimized form. Please refer to the compiler manual for more information.
Limitations:
There is a bug in gcc 4.2, 4.3, and 4.4 such that when using the branch prediction macros with multiple conditions, the generated code might not be properly optimized. For example:
  if (BSLS_PERFORMANCEHINT_PREDICT_LIKELY(a && b)) {
      // ...
  }
The work-around is simply to split the conditions: This applies to all of the "likely", "unlikely", and "expect" macros defined in this component. Note that a bug report has been filed:
  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42233
Data Cache Prefetching:
The two functions provided in the bsls::PerformanceHint struct are prefetchForReading and prefetchForWriting. Use of these functions will cause the compiler to generate prefetch instructions to prefetch one cache line worth of data at the specified address into the cache line to minimize processor stalls.
       Function Name                       Description of Function
  ------------------------         ------------------------------------------
  prefetchForReading(address)      Prefetches one cache line worth of data at
                                   the specified 'address' for reading.

  prefetchForWriting(address)      Prefetches one cache line worth of data at
                                   the specified 'address' for writing.
Warning:
These functions must be used with caution. Inappropriate use of these functions degrades performance. Note that there should be sufficient time for the prefetch instruction to finish before the specified address is accessed, otherwise prefetching will be pointless. A profiler should be used to understand the program's behavior before attempting to optimize with these functions.
Optimization Fence:
The macro BSLS_PERFORMANCEHINT_OPTIMIZATION_FENCE prevents some compiler optimizations, particularly compiler instruction reordering. This fence does not map to a CPU instruction and has no impact on processor instruction re-ordering, and therefore should not be used to synchronize memory between threads. The fence may be useful in unusual contexts, like performing benchmarks, or working around bugs identified in the compiler's optimizer.
Warning:
This macro should be used with caution. The macro will generally decrease the performance of code on which it is applied, and is not implemented on all platforms.
Usage:
The following series of examples illustrates use of the macros and functions provided by this component.
Example 1: Using the Branch Prediction Macros:
The following demonstrates the use of BSLS_PERFORMANCEHINT_PREDICT_LIKELY and BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY to generate more efficient assembly instructions. Note the use of BSLS_PERFORMANCEHINT_UNLIKELY_HINT inside the if branch for maximum portability.
  volatile int global;

  void foo()
  {
      global = 1;
  }

  void bar()
  {
      global = 2;
  }

  int main(int argc, char **argv)
  {
      argc = std::atoi(argv[1]);

      for (int x = 0; x < argc; ++x) {
          int y = std::rand() % 10;

          // Correct usage of 'BSLS_PERFORMANCEHINT_PREDICT_LIKELY' since
          // there are nine of ten chance that this branch is taken.

          if (BSLS_PERFORMANCEHINT_PREDICT_LIKELY(8 != y)) {
              foo();
          }
          else {
              BSLS_PERFORMANCEHINT_UNLIKELY_HINT;
              bar();
          }
      }
      return 0;
  }
An excerpt of the assembly code generated using xlC Version 10 on AIX from this small program is:
  b8:   2c 00 00 08     cmpwi   r0,8
  bc:   41 82 00 38     beq-    f4 <.main+0xb4>
                         ^
                         Note that if register r0 (y) equals 8, branch to
                         instruction f4 (a jump).  The '-' after 'beq'
                         indicates that the branch is unlikely to be taken.
                         The predicted code path continues the 'if'
                         statement, which calls 'foo' below.

  c0:   4b ff ff 41     bl      0 <.foo__Fv>
  ...
  f4:   4b ff ff 2d     bl      20 <.bar__Fv>
Now, if BSLS_PERFORMANCEHINT_PREDICT_LIKELY is changed to BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY, and the BSLS_PERFORMANCEHINT_UNLIKELY_HINT is moved to the first branch, the following assembly code will be generated:
  b8:   2c 00 00 08     cmpwi   r0,8
  bc:   40 c2 00 38     bne-    f4 <.main+0xb4>
                         ^
                         Note that the test became a "branch not equal"
                         test.  The predicted code path now continues to the
                         'else' statement, which calls 'bar' below.

  c0:   4b ff ff 61     bl      20 <.bar__Fv>
  ...
  f4:   4b ff ff 0d     bl      0 <.foo__Fv>
A timing analysis shows that effective use of branch prediction can have a material effect on code efficiency:
  $time ./unlikely.out 100000000

  real    0m2.022s
  user    0m2.010s
  sys     0m0.013s

  $time ./likely.out 100000000

  real    0m2.159s
  user    0m2.149s
  sys     0m0.005s
Example 2: Using BSLS_PERFORMANCEHINT_PREDICT_EXPECT:
This macro is essentially the same as the __builtin_expect(expr, value) macro that is provided by some compilers. This macro allows the user to define more complex hints to the compiler, such as the optimization of switch statements. For example, given:
  int x = std::rand() % 4;
the following is incorrect usage of BSLS_PERFORMANCEHINT_PREDICT_EXPECT, since the probability of getting a 3 is equivalent to the other possibilities ( 0, 1, 2 ):
  switch (BSLS_PERFORMANCEHINT_PREDICT_EXPECT(x, 3)) {
    case 1: //..
            break;
    case 2: //..
            break;
    case 3: //..
            break;
    default: break;
  }
However, this is sufficient to illustrate the intent of this macro.
Example 3: Cache Line Prefetching:
The following demonstrates use of prefetchForReading and prefetchForWriting to prefetch data cache lines:
  const int SIZE = 10 * 1024 * 1024;

  void add(int *arrayA, int *arrayB)
  {
      for (int i = 0; i < SIZE / 8; ++i){
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;

          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
      }
  }

  int array1[SIZE];
  int array2[SIZE];

  int main()
  {
      BloombergLP::bsls::Stopwatch timer;
      timer.start();
      for (int i = 0; i < 10; ++i) {
          add(array1, array2);
      }
      printf("time: %f\n", timer.elapsedTime());
      return 0;
  }
The above code simply adds two arrays together multiple times. Using bsls::Stopwatch, we recorded the running time and printed it to stdout:
  $./prefetch.sundev1.tsk
  time: 8.446806
Now, we can observe that in the add function, arrayA and arrayB are accessed sequentially for the majority of the program. arrayA is used for writing and arrayB is used for reading. Making use of prefetch, we add calls to prefetchForReading and prefetchForWriting:
  void add2(int *arrayA, int *arrayB)
  {
      for (int i = 0; i < SIZE / 8; ++i){
          using namespace BloombergLP; // Generally avoid 'using' in this TD.
          bsls::PerformanceHint::prefetchForWriting((int *) arrayA + 16);
          bsls::PerformanceHint::prefetchForReading((int *) arrayB + 16);

          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;

          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
          *arrayA += *arrayB; ++arrayA; ++arrayB;
      }
  }
Adding the prefetch improves the program's efficiency:
  $./prefetch.sundev1.tsk
  time: 6.442100
Note that we prefetch the address 16 * sizeof(int) bytes away from arrayA. This is such that the prefetch instruction has sufficient time to finish before the data is actually accessed. To see the difference, if we changed + 16 to + 4:
  $./prefetch.sundev1.tsk
  time: 6.835928
And we get less of an improvement in speed. Similarly, if we prefetch too far away from the data use, the data might be removed from the cache before it is looked at and the prefetch is wasted.

Define Documentation

#define BSLS_PERFORMANCEHINT_PREDICT_LIKELY (   expr  )     (expr)
#define BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY (   expr  )     (expr)
#define BSLS_PERFORMANCEHINT_PREDICT_EXPECT (   expr,
  value 
)    (expr)
#define BSLS_PERFORMANCEHINT_ATTRIBUTE_COLD
#define BSLS_PERFORMANCEHINT_UNLIKELY_HINT