Consts

OMP_MEMORY_BOUND_GRAIN_SIZE {.intdefine.} = 1024

This is the minimum amount of work per physical cores for memory-bound processing.

"copy" and "addition" are considered memory-bound
"float division" can be considered 2x~4x more complex and should be scaled down accordingly
"exp" and "sin" operations are compute-bound and there is a perf boost even when processing only 1000 items on 28 cores

Launching 2 threads per core (HyperThreading) is probably desirable:

https://medium.com/data-design/destroying-the-myth-of-number-of-threads-number-of-physical-cores-762ad3919880

Raising the following parameters can have the following impact:

number of sockets: higher, more over memory fetch
number of memory channel: lower, less overhead per memory fetch
RAM speed: lower, less overhead per memory fetch
Private L2 cache: higher, feed more data per CPU
Hyperthreading and cache associativity
Cores, shared L3 cache: Memory contention

Note that setting num_threads manually might impact performance negatively:

http://studio.myrian.fr/openmp-et-num_threads/
2x2ms overhead when changing num_threads from 16->6->16

Source Edit

OMP_NON_CONTIGUOUS_SCALE_FACTOR {.intdefine.} = 4

Due to striding computation, we can use a lower grainsize for non-contiguous tensors Source Edit

Procs

proc omp_suffix(genNew: static bool = false): string {.compileTime.}: genNew: if false, return the last suffix else return a fresh one Source Edit

Macros

macro omp_flush(variables: varargs[untyped]): untyped: Source Edit

Templates

template attachGC(): untyped: If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.

This attaches the GC.

Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489
Source Edit
template detachGC(): untyped: If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.

This detaches the GC.

Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489
Source Edit
template omp_barrier(): untyped: Source Edit
template omp_chunks(omp_size: Natural; chunk_offset, chunk_size: untyped; body: untyped): untyped: Internal proc This is is the chunk part of omp_parallel_chunk omp_size should be a lvalue (assigned value) and not the result of a routine otherwise routine and its side-effect will be called multiple times Source Edit
template omp_critical(body: untyped): untyped: Source Edit
template omp_for(index: untyped; length: Natural; use_simd, nowait: static bool; body: untyped): OpenMP for loop (not parallel)

This must be used in an omp_parallel block for parallelization.

Inputs:

index, the iteration index, similar to for index in 0 ..< length: doSomething(index)

length, the number of elements to iterate on

use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...

Source Edit
template omp_get_max_threads(): cint: Source Edit
template omp_get_nested(): cint: Source Edit
template omp_get_num_threads(): cint: Source Edit
template omp_get_thread_num(): cint: Source Edit
template omp_master(body: untyped): untyped: Source Edit
template omp_parallel(body: untyped): untyped: Starts an openMP parallel section

Don't forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.
Source Edit
template omp_parallel_chunks(length: Natural; chunk_offset, chunk_size: untyped; omp_grain_size: static Natural; body: untyped): untyped: Create a chunk for each thread. You can use: for index in chunk_offset ..< chunk_size: or zeroMem(foo[chunk_offset].addr, chunk_size)

Splits the input length into chunks and do a parallel loop on each chunk. The number of chunks depends on the number of cores at runtime. chunk_offset and chunk_size should be passed as undeclared identifiers. Within the template scope they will contain the start offset and the length of the current thread chunk. I.e. their value is thread-specific.

Use omp_get_thread_num() to get the current thread number

This is useful for non-contiguous processing as a replacement to omp_parallel_for or when operating on (contiguous) ranges for example for memset or memcpy.

Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.
Source Edit
template omp_parallel_chunks_default(length: Natural; chunk_offset, chunk_size: untyped; body: untyped): untyped: This will be renamed omp_parallel_chunks once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.

Source Edit
template omp_parallel_for(index: untyped; length: Natural; omp_grain_size: static Natural; use_simd: static bool; body: untyped): Parallel for loop

Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.

Inputs:

index, the iteration index, similar to for index in 0 ..< length: doSomething(index)

length, the number of elements to iterate on

omp_grain_size, the minimal amount of work per thread. If below, we don't start threads. Note that we always start as much hardware threads as available as starting varying number of threads in the lifetime of the program will add oberhead.

use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...

Source Edit
template omp_parallel_for_default(index: untyped; length: Natural; body: untyped): This will be renamed omp_parallel_for once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.

simd is used by default

Source Edit
template omp_parallel_if(condition: bool; body: untyped): Source Edit
template omp_set_nested(x: cint): Source Edit
template omp_set_num_threads(x: cint): Source Edit
template omp_single(body: untyped): untyped: Source Edit
template omp_single_nowait(body: untyped): untyped: Source Edit
template omp_task(annotation: static string; body: untyped): untyped: Source Edit
template omp_taskloop(index: untyped; length: Natural; annotation: static string; body: untyped): OpenMP taskloop Source Edit
template omp_taskwait(): untyped: Source Edit

src/arraymancer/laser/openmp

Consts

Procs

Macros

Templates