Some operation were substantially accelerated (explicitly specialized) using meta-programing,
explicit vectorisation (with SSE), parallel programming (with OpenMP), out of order optimization, and some inline assembly.
Below are some benchmarks on two-core, 2200Mhz Core2 Duo CPU:
array::sum()
Tick refer to CPU tick and is about 0.45 nano seconds.
Sum was done for 100,000,000 float-s with values {1.f, 2.f, 1.f, 2.f, 1.f, 2.f …}. Same benchmark is in below table:
| Method |
Ticks per element |
Computed Value |
Source |
plain for-loop, double |
3.14 |
1.5e+08 |
double sum=0; for (int i=0; i<N; i++) sum += A[i]; |
plain for-loop, float |
3.06 |
3.35e+07 |
float sum=0; for (int i=0; i<N; i++) sum += A[i]; |
std::accumulate<float>() |
3.06 |
3.35e+07 |
float sum = accumulate(A.begin(), A.end(), 0.f)); |
lvv::array |
1.74 |
1.5e+08 |
float sum = A.sum(); |
SSE method is selected (through meta-programming) if no summation method explicitly specified (if CPU supports SSE).
Note that float plain-for-loop and std::accumulate methods have incorrect computed values due to rounding error.
array::max()
Maximum search was done on 100,000,000 float-s
| Method |
Ticks per element |
Source |
plain for-loop |
5.81 |
float max=0; for (size_t i=0; i<N; i++) if (A[i] > max) max = A[i]; |
std::max_element() |
5.81 |
float max = *std::max_element (A.begin(), A.end()); |
lvv::array |
1.63 |
float max = A.max() |
So far I implemented only combinations needed for my work, so it is quite incomplete.
If there is no a type specialization then generic implementation is used.
Table 1. Implemented optimized specialisation
| Type |
sum |
max |
V1 OP= V2 |
V1 OP V2 |
generic |
std:: |
std:: |
for-loop |
for-loop |
float |
sse |
sse |
generic |
generic |
double |
generic |
generic |
generic |
generic |
long double |
generic |
generic |
generic |
generic |
int8_t |
generic |
generic |
generic |
generic |
int16_t |
generic |
sse2 |
generic |
generic |
int32_t |
generic |
generic |
generic |
generic |
int64_t |
generic |
generic |
generic |
generic |
uint8_t |
generic |
generic |
generic |
generic |
uint16_t |
generic |
generic |
generic |
generic |
uint32_t |
generic |
generic |
generic |
generic |
uint64_t |
generic |
generic |
generic |
generic |
Though I’ve targeted only x86-64, some optimized specialisation (out-of-order,
meta-programming, OpenMP) are platform independent.
Appropriate specialisation selected automatically (but can be specified
explicitly) based on CPU capabilities, array size and array element type.
-
Index of first element defaults to 0, but can be any number (third template parameter).
-
Index value for opertator[] tested if it is in valid range when NDEBUG macro is not defined (not optimized compile).
-
basic linear algebra functions: norm2(A), distance_norm2(A1,A2), dot(A1,A2), etc
See also sample use in test files t-*.cc and unit test u-array.cc.
If you are on Github it is even easier.
See github patch-submit HOWTOs:
(
1,
2,
3,
4
)
.
There is no hard set style rules.