Imports

foreach_common, ast_utils, openmp, compiler_optim_hints

Macros

macro forEach(args: varargs[untyped]): untyped: Parallel iteration over one or more tensors

Format: forEach x in a, y in b, z in c: x += y * z

The iteration strategy is selected at runtime depending of the tensors memory layout. If you know at compile-time that the tensors are contiguous or strided, use forEachContiguous or forEachStrided instead. Runtime selection requires duplicating the code body.

In the contiguous case: The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE = 1024 elementwise operations to process per cores.

Compiler will also be hinted to unroll loop for SIMD vectorization.

Otherwise if tensor is strided: The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE div OMP_NON_CONTIGUOUS_SCALE_FACTOR = 1024/4 = 256 elementwise operations to process per cores.

Use forEachStaged to fine-tune this default.
Source Edit
macro forEachContiguous(args: varargs[untyped]): untyped: Parallel iteration over one or more contiguous tensors.

Format: forEachContiguous x in a, y in b, z in c: x += y * z

The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE = 1024 elementwise operations to process per cores.

Compiler will also be hinted to unroll loop for SIMD vectorization.

Use forEachStaged to fine-tune those defaults.
Source Edit
macro forEachContiguousSerial(args: varargs[untyped]): untyped: Serial iteration over one or more contiguous tensors.

Format: forEachContiguousSerial x in a, y in b, z in c: x += y * z
Source Edit
macro forEachSerial(args: varargs[untyped]): untyped: Serial iteration over one or more tensors

Format: forEachSerial x in a, y in b, z in c: x += y * z

openMP parameters will be ignored

The iteration strategy is selected at runtime depending of the tensors memory layout. If you know at compile-time that the tensors are contiguous or strided, use forEachContiguousSerial or forEachStridedSerial instead. Runtime selection requires duplicating the code body.
Source Edit
macro forEachStrided(args: varargs[untyped]): untyped: Parallel iteration over one or more tensors of unknown strides for example resulting from most slices.

Format: forEachStrided x in a, y in b, z in c: x += y * z

The threshold for parallelization by default is OMP_MEMORY_BOUND_GRAIN_SIZE div OMP_NON_CONTIGUOUS_SCALE_FACTOR = 1024/4 = 256 elementwise operations to process per cores.

Use forEachStaged to fine-tune this default.
Source Edit
macro forEachStridedSerial(args: varargs[untyped]): untyped: Serial iteration over one or more tensors of unknown strides for example resulting from most slices.

Format: forEachStridedSerial x in a, y in b, z in c: x += y * z
Source Edit

Exports

omp_suffix