yeppp! a trivial test

by Francisco Lopes on Oct 11, 2013

6 minute read

A dot product¹ ² (a.k.a. scalar product), the generalization of which is known as inner product³ ⁴, is a very common operation in several fields. Given two vectors \(\vec x\) and \(\vec y\), it can be depicted and defined like the following:

\[\vec x \cdot \vec y \equiv \begin{pmatrix} x_1 & x_2 & \cdots & x_n \end{pmatrix} \cdot \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \equiv x_1 y_1 + x_2 y_2 + \cdots + x_n y_n\]

Let's test the runtime speed of this simple operation using the Yeppp!⁵ library and compare it with what we achieve from compiler optimization (both clang 3.4.190255 and gcc 4.8.1) of the generic C++ std::inner_product⁶ function (libstdc++ source code):

/**
 *  @brief  Compute inner product of two ranges.
 *
 *  Starting with an initial value of @p __init, multiplies successive
 *  elements from the two ranges and adds each product into the accumulated
 *  value using operator+().  The values in the ranges are processed in
 *  order.
 *
 *  @param  __first1  Start of range 1.
 *  @param  __last1  End of range 1.
 *  @param  __first2  Start of range 2.
 *  @param  __init  Starting value to add other values to.
 *  @return  The final inner product.
 */
template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
  inline _Tp
  inner_product(_InputIterator1 __first1, _InputIterator1 __last1,
  	  _InputIterator2 __first2, _Tp __init)
  {
    // concept requirements
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
    __glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
    __glibcxx_requires_valid_range(__first1, __last1);

    for (; __first1 != __last1; ++__first1, ++__first2)
  __init = __init + (*__first1 * *__first2);
    return __init;
  }

The test will just calculate the dot product of two vectors of 1024 components, \(10^6\) times:

#include <memory>
#include <vector>
#include <chrono>
#include <numeric>
#include <iostream>
#include <functional>
#include <yepCore.h>
#include <yepLibrary.h>

int main() {
    using namespace std;
    using namespace chrono;
    using defer = shared_ptr<void>;

    yepLibrary_Init();
    defer release(nullptr, bind(yepLibrary_Release));

    const auto size = 1024;
    const vector<double> v1(size, 0.0);
    const vector<double> v2(size, 0.0);
    double r;

    auto naive_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        r = inner_product(begin(v1), end(v1), begin(v2), 0.0);
    auto naive_elapsed = high_resolution_clock::now() - naive_start;

    auto yeppp_start = high_resolution_clock::now();
    for(int i = 0; i < 1000000; ++i)
        yepCore_DotProduct_V64fV64f_S64f(&v1[0], &v2[0], &r, size);
    auto yeppp_elapsed = high_resolution_clock::now() - yeppp_start;

    auto count_milliseconds = [](const high_resolution_clock::duration &elapsed) {
        return duration_cast<milliseconds>(elapsed).count();
    };

    cout << "Naive implementation:\n\tTime = " <<
        count_milliseconds(naive_elapsed) << " milliseconds" << endl;
    cout << "Yeppp! implementation:\n\tTime = " <<
        count_milliseconds(yeppp_elapsed) << " milliseconds" << endl;
    cout << "Yeppp!/Naive = " <<
        divides<double>{}(yeppp_elapsed.count(), naive_elapsed.count()) << endl;
}

First round, let's compare it with a generic optimized compilation:

$ clang++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 943 milliseconds
Yeppp! implementation:
	Time = 177 milliseconds
Yeppp!/Naive = 0.188716
$ ./basics
Naive implementation:
	Time = 928 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.188878
$ ./basics
Naive implementation:
	Time = 920 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.191578

$ g++ basics.cpp -o basics -O3 -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 917 milliseconds
Yeppp! implementation:
	Time = 178 milliseconds
Yeppp!/Naive = 0.194186
$ ./basics
Naive implementation:
	Time = 920 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.190795
$ ./basics
Naive implementation:
	Time = 922 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.19016

Second round, let's add the -Ofast optimization flag, that will combine -O3 and -ffast-math⁷ ⁸ ⁹:

$ clang++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 246 milliseconds
Yeppp! implementation:
	Time = 179 milliseconds
Yeppp!/Naive = 0.727429
$ ./basics
Naive implementation:
	Time = 252 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.697511
$ ./basics
Naive implementation:
	Time = 238 milliseconds
Yeppp! implementation:
	Time = 174 milliseconds
Yeppp!/Naive = 0.733306

$ g++ basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 484 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.362857
$ ./basics
Naive implementation:
	Time = 474 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.370727
$ ./basics
Naive implementation:
	Time = 472 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.374784

Last round, let's do a processor targeted compilation, my processor is an Intel® Core™ i7-2620M with AVX extensions¹⁰:

$ clang++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 208 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.845582
$ ./basics
Naive implementation:
	Time = 205 milliseconds
Yeppp! implementation:
	Time = 186 milliseconds
Yeppp!/Naive = 0.904559
$ ./basics
Naive implementation:
	Time = 209 milliseconds
Yeppp! implementation:
	Time = 181 milliseconds
Yeppp!/Naive = 0.868061

$ g++ -march=corei7-avx basics.cpp -o basics -Ofast -std=c++1y -lyeppp
$ ./basics
Naive implementation:
	Time = 245 milliseconds
Yeppp! implementation:
	Time = 175 milliseconds
Yeppp!/Naive = 0.714875
$ ./basics
Naive implementation:
	Time = 247 milliseconds
Yeppp! implementation:
	Time = 176 milliseconds
Yeppp!/Naive = 0.714651
$ ./basics
Naive implementation:
	Time = 237 milliseconds
Yeppp! implementation:
	Time = 177 milliseconds
Yeppp!/Naive = 0.747752

Assembly Output¹¹

Conclusion

Although a trivial test, it illustrates a point I want to make: most languages and virtual machines still start being designed without much further thought about whether we're stuck in an outdated processor model, few will try to offer abstract and general language constructs for the concepts embedded into SIMD¹², barriers¹³ ¹⁴, GPGPU¹⁵ and other stuff that we can leverage from the currently generally available hardware, there's a tendency for letting this stuff for voodoo optimization, unsafe libraries and non embedded DSLs¹⁶. I believe constructs can be created and even providing compile-time safety. There're changes happening in this respect to some well known languages¹⁷ ¹⁸ ¹⁹ ²⁰ ²¹ ²² ²³, but given the increasing number of languages borning everyday, I know none that target this. Most of what I'm seeing is about adoption of higher level concurrency models, which is a good thing nonetheless.

The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.

--Henry Peteroski ²⁴

Point made, the obvious disadvantage of the presented C++ compilation model compared to using something like Yeeep! is that to achieve a performance on par to Yeppp! we had to build a monolithic processor-specific executable, while using Yeppp! the processor-specific binary blob is decided at runtime²⁵ (of course, for the common high level operations and algorithms it provides).

hpc, optimization, cpp, yeppp!

Spotted a mistake in this article? Why not suggest an edit!