yeppp! a trivial test

A dot product1 2 (a.k.a. scalar product), the generalization of which is known as inner product3 4, is a very common operation in several fields. Given two vectors \(\vec x\) and \(\vec y\), it can be depicted and defined like the following:

Let’s test the runtime speed of this simple operation using the Yeppp!5 library and compare it with what we achieve from compiler optimization (both clang 3.4.190255 and gcc 4.8.1) of the generic C++ std::inner_product6 function (libstdc++ source code):

/**
 *  @brief  Compute inner product of two ranges.
 *
 *  Starting with an initial value of @p __init, multiplies successive
 *  elements from the two ranges and adds each product into the accumulated
 *  value using operator+().  The values in the ranges are processed in
 *  order.
 *
 *  @param  __first1  Start of range 1.
 *  @param  __last1  End of range 1.
 *  @param  __first2  Start of range 2.
 *  @param  __init  Starting value to add other values to.
 *  @return  The final inner product.
 */
template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
  inline _Tp
  inner_product(_InputIterator1 __first1, _InputIterator1 __last1,
  	  _InputIterator2 __first2, _Tp __init)
  {
    // concept requirements
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
    __glibcxx_requires_valid_range(__first1, __last1);

    for (; __first1 != __last1; ++__first1, ++__first2)
  __init = __init + (*__first1 * *__first2);
    return __init;
  }

The test will just calculate the dot product of two vectors of 1024 components, \(10^6\) times:

#include <memory>
#include <vector>
#include <chrono>
#include <numeric>
#include <iostream>
#include <functional>
#include <yepCore.h>
#include <yepLibrary.h>

int main() {
    using namespace std;
    using namespace chrono;
    using defer = shared_ptr<void>;

    yepLibrary_Init();
    defer release(nullptr, bind(yepLibrary_Release));

    const auto size = 1024;
    const vector<double> v1(size, 0.0);
    const vector<double> v2(size, 0.0);
    double r;

    auto naive_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        r = inner_product(begin(v1), end(v1), begin(v2), 0.0);
    auto naive_elapsed = high_resolution_clock::now() - naive_start;

    auto yeppp_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        yepCore_DotProduct_V64fV64f_S64f(&v1[0], &v2[0], &r, size);
    auto yeppp_elapsed = high_resolution_clock::now() - yeppp_start;

    auto count_milliseconds = [](const high_resolution_clock::duration &elapsed) {
        return duration_cast<milliseconds>(elapsed).count();
    };

    cout << "Naive implementation:\n\tTime = " <<
        count_milliseconds(naive_elapsed) << " milliseconds" << endl;
    cout << "Yeppp! implementation:\n\tTime = " <<
        count_milliseconds(yeppp_elapsed) << " milliseconds" << endl;
    cout << "Yeppp!/Naive = " <<
        divides<double>{}(yeppp_elapsed.count(), naive_elapsed.count()) << endl;
}

First round, let’s compare it with a generic optimized compilation:

$ clang++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 943 milliseconds
    Yeppp! implementation:
    	Time = 177 milliseconds
    Yeppp!/Naive = 0.188716
    $ ./basics
    Naive implementation:
    	Time = 928 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.188878
    $ ./basics
    Naive implementation:
    	Time = 920 milliseconds
    Yeppp! implementation:
    	Time = 176 milliseconds
    Yeppp!/Naive = 0.191578
$ g++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 917 milliseconds
    Yeppp! implementation:
    	Time = 178 milliseconds
    Yeppp!/Naive = 0.194186
    $ ./basics
    Naive implementation:
    	Time = 920 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.190795
    $ ./basics
    Naive implementation:
    	Time = 922 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.19016

Second round, let’s add the -Ofast optimization flag, that will combine -O3 and -ffast-math7 8 9:

$ clang++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 246 milliseconds
    Yeppp! implementation:
    	Time = 179 milliseconds
    Yeppp!/Naive = 0.727429
    $ ./basics
    Naive implementation:
    	Time = 252 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.697511
    $ ./basics
    Naive implementation:
    	Time = 238 milliseconds
    Yeppp! implementation:
    	Time = 174 milliseconds
    Yeppp!/Naive = 0.733306
$ g++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 484 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.362857
    $ ./basics
    Naive implementation:
    	Time = 474 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.370727
    $ ./basics
    Naive implementation:
    	Time = 472 milliseconds
    Yeppp! implementation:
    	Time = 176 milliseconds
    Yeppp!/Naive = 0.374784

Last round, let’s do a processor targeted compilation, my processor is an Intel® Core™ i7-2620M with AVX extensions10:

$ clang++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 208 milliseconds
    Yeppp! implementation:
    	Time = 176 milliseconds
    Yeppp!/Naive = 0.845582
    $ ./basics
    Naive implementation:
    	Time = 205 milliseconds
    Yeppp! implementation:
    	Time = 186 milliseconds
    Yeppp!/Naive = 0.904559
    $ ./basics
    Naive implementation:
    	Time = 209 milliseconds
    Yeppp! implementation:
    	Time = 181 milliseconds
    Yeppp!/Naive = 0.868061
$ g++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
    $ ./basics
    Naive implementation:
    	Time = 245 milliseconds
    Yeppp! implementation:
    	Time = 175 milliseconds
    Yeppp!/Naive = 0.714875
    $ ./basics
    Naive implementation:
    	Time = 247 milliseconds
    Yeppp! implementation:
    	Time = 176 milliseconds
    Yeppp!/Naive = 0.714651
    $ ./basics
    Naive implementation:
    	Time = 237 milliseconds
    Yeppp! implementation:
    	Time = 177 milliseconds
    Yeppp!/Naive = 0.747752

Assembly Output11

Conclusion

Although a trivial test, it illustrates a point I want to make: most languages and virtual machines still start being designed without much further thought about whether we’re stuck in an outdated processor model, few will try to offer abstract and general language constructs for the concepts embedded into SIMD12, barriers13 14, GPGPU15 and other stuff that we can leverage from the currently generally available hardware, there’s a tendency for letting this stuff for voodoo optimization, unsafe libraries and non embedded DSLs16. I believe constructs can be created and even providing compile-time safety. There’re changes happening in this respect to some well known languages17 18 19 20 21 22 23, but given the increasing number of languages borning everyday, I know none that target this. Most of what I’m seeing is about adoption of higher level concurrency models, which is a good thing nonetheless.

The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.

–Henry Peteroski 24

Point made, the obvious disadvantage of the presented C++ compilation model compared to using something like Yeeep! is that to achieve a performance on par to Yeppp! we had to build a monolithic processor-specific executable, while using Yeppp! the processor-specific binary blob is decided at runtime25 (of course, for the common high level operations and algorithms it provides).

  1. http://en.wikipedia.org/wiki/Dot_product

  2. http://mathworld.wolfram.com/DotProduct.html

  3. http://mathworld.wolfram.com/InnerProduct.html

  4. http://en.wikipedia.org/wiki/Inner_product_space

  5. http://www.reddit.com/r/programming/comments/1mos20

  6. http://en.cppreference.com/w/cpp/algorithm/inner_product

  7. http://gcc.gnu.org/ml/gcc/2001-07/msg02150.html

  8. http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Optimize-Options.html

  9. https://twitter.com/pepper_chico/status/388595932788711424

  10. http://ark.intel.com/products/52231

  11. https://gist.github.com/oblitum/6938559

  12. http://en.wikipedia.org/wiki/SIMD

  13. http://preshing.com/20130922/acquire-and-release-fences

  14. http://en.wikipedia.org/wiki/Memory_barrier

  15. http://en.wikipedia.org/wiki/Gpgpu

  16. http://en.wikipedia.org/wiki/Domain-specific_language

  17. https://www.dartlang.org/slides/2013/02/Bringing-SIMD-to-the-Web-via-Dart.pdf

  18. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3571.pdf

  19. http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3561.pdf

  20. http://parfunk.blogspot.com.br/2012/05/how-to-write-hybrid-cpugpu-programs.html

  21. http://community.haskell.org/~simonmar/slides/cadarache2012/7%20-%20accelerate.pdf

  22. http://www.cse.unsw.edu.au/~pls/cuda-workshop09/slides/gpu-research.pdf

  23. http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl1.pdf

  24. http://mechanical-sympathy.blogspot.com.br/2011/07/why-mechanical-sympathy.html

  25. http://www.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html