Macros
macro forEach(args: varargs[untyped]): untyped
-
Parallel iteration over one or more tensors
Format: forEach x in a, y in b, z in c: x += y * z
The iteration strategy is selected at runtime depending of the tensors memory layout. If you know at compile-time that the tensors are contiguous or strided, use forEachContiguous or forEachStrided instead. Runtime selection requires duplicating the code body.
In the contiguous case: The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE = 1024 elementwise operations to process per cores.
Compiler will also be hinted to unroll loop for SIMD vectorization.
Otherwise if tensor is strided: The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE div OMP_NON_CONTIGUOUS_SCALE_FACTOR = 1024/4 = 256 elementwise operations to process per cores.
Use forEachStaged to fine-tune this default.
Source Edit macro forEachContiguous(args: varargs[untyped]): untyped
-
Parallel iteration over one or more contiguous tensors.
Format: forEachContiguous x in a, y in b, z in c: x += y * z
The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE = 1024 elementwise operations to process per cores.
Compiler will also be hinted to unroll loop for SIMD vectorization.
Use forEachStaged to fine-tune those defaults.
Source Edit macro forEachContiguousSerial(args: varargs[untyped]): untyped
-
Serial iteration over one or more contiguous tensors.
Format: forEachContiguousSerial x in a, y in b, z in c: x += y * z
Source Edit macro forEachSerial(args: varargs[untyped]): untyped
-
Serial iteration over one or more tensors
Format: forEachSerial x in a, y in b, z in c: x += y * z
openMP parameters will be ignored
The iteration strategy is selected at runtime depending of the tensors memory layout. If you know at compile-time that the tensors are contiguous or strided, use forEachContiguousSerial or forEachStridedSerial instead. Runtime selection requires duplicating the code body.
Source Edit macro forEachStrided(args: varargs[untyped]): untyped
-
Parallel iteration over one or more tensors of unknown strides for example resulting from most slices.
Format: forEachStrided x in a, y in b, z in c: x += y * z
The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE div OMP_NON_CONTIGUOUS_SCALE_FACTOR = 1024/4 = 256 elementwise operations to process per cores.
Use forEachStaged to fine-tune this default.
Source Edit macro forEachStridedSerial(args: varargs[untyped]): untyped
-
Serial iteration over one or more tensors of unknown strides for example resulting from most slices.
Format: forEachStridedSerial x in a, y in b, z in c: x += y * z
Source Edit