Most operations in Arraymancer are parallelized through OpenMP including linear algebra functions, universal functions, map, reduce and fold based operations.
Arraymancer provides several constructs for the YOLO™ paradigm (You Only Loop Once).
A naïve logistic sigmoid implementation in Numpy would be:
import math proc sigmoid(x): return 1 / (1 + math.exp(-x))
With Numpy broadcasting, all those operations would be done on whole tensors using Numpy C implementation, pretty efficient?
Actually no, this would create lots of temporary and loops across the data: - temp1 = -x - temp2 = math.exp(temp1) - temp3 = 1 + temp2 - temp4 = 1 / temp3
So you suddenly get a O(4*n) algorithm.
Arraymancer can do the same using the explicit broadcast operator /. and +.. (To avoid name conflict we change the logistic sigmoid name)
import arraymancer proc customSigmoid[T: SomeFloat](t: Tensor[T]): Tensor[T] = result = 1 /. (1 +. exp(-t))
Well, unfortunately, the only thing we gain here is parallelism but we still have 4 loops over the data implicitly. Another way would be to use the loop fusion template map_inline:
import arraymancer proc customSigmoid2[T: SomeFloat](t: Tensor[T]): Tensor[T] = result = map_inline(t): 1 / (1 + exp(-x))
Now in a single loop over t, Arraymancer will do 1 / (1 + exp(-x)) for each x found. x is a shorthand for the elements of the first tensor argument.
Here is another example with 3 tensors and element-wise fused multiply-add C += A *. B:
import arraymancer proc fusedMultiplyAdd[T: SomeNumber](c: var Tensor[T], a, b: Tensor[T]) = ## Implements C += A *. B, *. is the element-wise multiply apply3_inline(c, a, b): x += y * z
Since the tensor were given in order (c, a, b): - x corresponds to elements of c - y to a - z to b
Today Arraymancer offers map_inline, map2_inline, apply_inline, apply2_inline and apply3_inline.
Those are also parallelized using OpenMP. In the future, this will be generalized to N inputs.
Similarly, reduce_inline and fold_inline are offered for parallel, custom, fused reductions operations.
For most operations in machine learning, memory and cache is the bottleneck, for example taking the log of a Tensor can use at most 20% of your theoretical max CPU speed (in GFLOPS) while matrix multiplication can use 70%-90%+ for the best implementations (MKL, OpenBLAS).
In the log case, the processor gives a result faster than it can load data into its cache. In the matrix multiplication case, each element of a matrix can be reused several times before loading data again.
Arraymancer strives hard to limit memory allocation with the inline version of map, apply, reduce, fold (map_inline, apply_inline, reduce_inline, fold_inline) mentioned above that avoids intermediate results.
Integers seem to be the abandoned children of ndarrays and tensors libraries. Everyone is optimising the hell of floating points. Not so with Arraymancer:
Archlinux, i9-9980XE (Skylake-X 18 cores, overclocked 4.1GHz all-core turbo, 4.0GHz all-AVX-turbo, 3.5 GHz all-AVX512 turbo) Input 1500x1500 random large int64 matrix Arraymancer 81777f0 (~v0.6.0, master branch 2020-01-09)
Language | Speed | Memory |
---|---|---|
Nim 1.0.4 (-d:danger) + OpenMP | 0.14s | 22.7 MB |
Julia v1.3.1 | 1.67s | 246.5 MB |
Python 3.8.1 + Numpy-MKL 1.18 compiled from source | 5.69s | 75.9 MB |
Benchmark setup is in the ./benchmarks folder and similar to (stolen from) Kostya’s.
Note: Arraymancer, Julia and Numpy have the same speed as each other on float matrix multiplication as they all use Assembly-based BLAS + OpenMP underneath. In the future, pure-Nim backends without Assembly and/or OpenMP may be used to ease deployment, especially on Windows and be free of OpenMP limitations with regards to nested parallelism and load-balancing of generic algorithms. Speed will be competitive at least with OpenBLAS, see the Weave multithreading runtime benchmarks.