A dot product1 2 (a.k.a. scalar product), the generalization of which is known as inner product3 4, is a very common operation in several fields. Given two vectors \(\vec x\) and \(\vec y\), it can be depicted and defined like the following:

\[\vec x \cdot \vec y \equiv \begin{pmatrix} x_1 & x_2 & \cdots & x_n \end{pmatrix} \cdot \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \equiv x_1 y_1 + x_2 y_2 + \cdots + x_n y_n\]

Let's test the runtime speed of this simple operation using the Yeppp!5 library and compare it with what we achieve from compiler optimization (both clang 3.4.190255 and gcc 4.8.1) of the generic C++ std::inner_product6 function (libstdc++ source code):

/**
 *  @brief  Compute inner product of two ranges.
 *
 *  Starting with an initial value of @p __init, multiplies successive
 *  elements from the two ranges and adds each product into the accumulated
 *  value using operator+().  The values in the ranges are processed in
 *  order.
 *
 *  @param  __first1  Start of range 1.
 *  @param  __last1  End of range 1.
 *  @param  __first2  Start of range 2.
 *  @param  __init  Starting value to add other values to.
 *  @return  The final inner product.
 */
template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
  inline _Tp
  inner_product(_InputIterator1 __first1, _InputIterator1 __last1,
  	  _InputIterator2 __first2, _Tp __init)
  {
    // concept requirements
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
    __glibcxx_requires_valid_range(__first1, __last1);

    for (; __first1 != __last1; ++__first1, ++__first2)
  __init = __init + (*__first1 * *__first2);
    return __init;
  }

The test will just calculate the dot product of two vectors of 1024 components, \(10^6\) times:

#include <memory>
#include <vector>
#include <chrono>
#include <numeric>
#include <iostream>
#include <functional>
#include <yepCore.h>
#include <yepLibrary.h>

int main() {
    using namespace std;
    using namespace chrono;
    using defer = shared_ptr<void>;

    yepLibrary_Init();
    defer release(nullptr, bind(yepLibrary_Release));

    const auto size = 1024;
    const vector<double> v1(size, 0.0);
    const vector<double> v2(size, 0.0);
    double r;

    auto naive_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        r = inner_product(begin(v1), end(v1), begin(v2), 0.0);
    auto naive_elapsed = high_resolution_clock::now() - naive_start;

    auto yeppp_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        yepCore_DotProduct_V64fV64f_S64f(&v1[0], &v2[0], &r, size);
    auto yeppp_elapsed = high_resolution_clock::now() - yeppp_start;

    auto count_milliseconds = [](const high_resolution_clock::duration &elapsed) {
        return duration_cast<milliseconds>(elapsed).count();
    };

    cout << "Naive implementation:\n\tTime = " <<
        count_milliseconds(naive_elapsed) << " milliseconds" << endl;
    cout << "Yeppp! implementation:\n\tTime = " <<
        count_milliseconds(yeppp_elapsed) << " milliseconds" << endl;
    cout << "Yeppp!/Naive = " <<
        divides<double>{}(yeppp_elapsed.count(), naive_elapsed.count()) << endl;
}

First round, let's compare it with a generic optimized compilation:

$ clang++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 943 milliseconds
Yeppp! implementation:
	Time = 177 milliseconds
Yeppp!/Naive = 0.188716
$ ./basics
Naive implementation:
	Time = 928 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.188878
$ ./basics
Naive implementation:
	Time = 920 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.191578
$ g++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 917 milliseconds
Yeppp! implementation:
	Time = 178 milliseconds
Yeppp!/Naive = 0.194186
$ ./basics
Naive implementation:
	Time = 920 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.190795
$ ./basics
Naive implementation:
	Time = 922 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.19016

Second round, let's add the -Ofast optimization flag, that will combine -O3 and -ffast-math7 8 9:

$ clang++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 246 milliseconds
Yeppp! implementation:
	Time = 179 milliseconds
Yeppp!/Naive = 0.727429
$ ./basics
Naive implementation:
	Time = 252 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.697511
$ ./basics
Naive implementation:
	Time = 238 milliseconds
Yeppp! implementation:
	Time = 174 milliseconds
Yeppp!/Naive = 0.733306
$ g++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 484 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.362857
$ ./basics
Naive implementation:
	Time = 474 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.370727
$ ./basics
Naive implementation:
	Time = 472 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.374784

Last round, let's do a processor targeted compilation, my processor is an Intel® Core™ i7-2620M with AVX extensions10:

$ clang++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 208 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.845582
$ ./basics
Naive implementation:
	Time = 205 milliseconds
Yeppp! implementation:
	Time = 186 milliseconds
Yeppp!/Naive = 0.904559
$ ./basics
Naive implementation:
	Time = 209 milliseconds
Yeppp! implementation:
	Time = 181 milliseconds
Yeppp!/Naive = 0.868061
$ g++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 245 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.714875
$ ./basics
Naive implementation:
	Time = 247 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.714651
$ ./basics
Naive implementation:
	Time = 237 milliseconds
Yeppp! implementation:
	Time = 177 milliseconds
Yeppp!/Naive = 0.747752

Assembly Output11

Conclusion

Although a trivial test, it illustrates a point I want to make: most languages and virtual machines still start being designed without much further thought about whether we're stuck in an outdated processor model, few will try to offer abstract and general language constructs for the concepts embedded into SIMD12, barriers13 14, GPGPU15 and other stuff that we can leverage from the currently generally available hardware, there's a tendency for letting this stuff for voodoo optimization, unsafe libraries and non embedded DSLs16. I believe constructs can be created and even providing compile-time safety. There're changes happening in this respect to some well known languages17 18 19 20 21 22 23, but given the increasing number of languages borning everyday, I know none that target this. Most of what I'm seeing is about adoption of higher level concurrency models, which is a good thing nonetheless.

The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.

--Henry Peteroski 24

Point made, the obvious disadvantage of the presented C++ compilation model compared to using something like Yeeep! is that to achieve a performance on par to Yeppp! we had to build a monolithic processor-specific executable, while using Yeppp! the processor-specific binary blob is decided at runtime25 (of course, for the common high level operations and algorithms it provides).


  1. http://en.wikipedia.org/wiki/Dot_product [return]
  2. http://mathworld.wolfram.com/DotProduct.html [return]
  3. http://mathworld.wolfram.com/InnerProduct.html [return]
  4. http://en.wikipedia.org/wiki/Inner_product_space [return]
  5. http://www.reddit.com/r/programming/comments/1mos20 [return]
  6. http://en.cppreference.com/w/cpp/algorithm/inner_product [return]
  7. http://gcc.gnu.org/ml/gcc/2001-07/msg02150.html [return]
  8. http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Optimize-Options.html [return]
  9. https://twitter.com/pepper_chico/status/388595932788711424 [return]
  10. http://ark.intel.com/products/52231 [return]
  11. https://gist.github.com/oblitum/6938559 [return]
  12. http://en.wikipedia.org/wiki/SIMD [return]
  13. http://preshing.com/20130922/acquire-and-release-fences [return]
  14. http://en.wikipedia.org/wiki/Memory_barrier [return]
  15. http://en.wikipedia.org/wiki/Gpgpu [return]
  16. http://en.wikipedia.org/wiki/Domain-specific_language [return]
  17. https://www.dartlang.org/slides/2013/02/Bringing-SIMD-to-the-Web-via-Dart.pdf [return]
  18. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3571.pdf [return]
  19. http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3561.pdf [return]
  20. http://parfunk.blogspot.com.br/2012/05/how-to-write-hybrid-cpugpu-programs.html [return]
  21. http://community.haskell.org/~simonmar/slides/cadarache2012/7%20-%20accelerate.pdf [return]
  22. http://www.cse.unsw.edu.au/~pls/cuda-workshop09/slides/gpu-research.pdf [return]
  23. http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl1.pdf [return]
  24. http://mechanical-sympathy.blogspot.com.br/2011/07/why-mechanical-sympathy.html [return]
  25. http://www.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html [return]