Consts
OMP_MEMORY_BOUND_GRAIN_SIZE {.intdefine.} = 1024
-
This is the minimum amount of work per physical cores for memory-bound processing.
- "copy" and "addition" are considered memory-bound
- "float division" can be considered 2x~4x more complex and should be scaled down accordingly
- "exp" and "sin" operations are compute-bound and there is a perf boost even when processing only 1000 items on 28 cores
Launching 2 threads per core (HyperThreading) is probably desirable:
Raising the following parameters can have the following impact:
- number of sockets: higher, more over memory fetch
- number of memory channel: lower, less overhead per memory fetch
- RAM speed: lower, less overhead per memory fetch
- Private L2 cache: higher, feed more data per CPU
- Hyperthreading and cache associativity
- Cores, shared L3 cache: Memory contention
Note that setting num_threads manually might impact performance negatively:
- http://studio.myrian.fr/openmp-et-num_threads/
2x2ms overhead when changing num_threads from 16->6->16
OMP_NON_CONTIGUOUS_SCALE_FACTOR {.intdefine.} = 4
- Due to striding computation, we can use a lower grainsize for non-contiguous tensors Source Edit
Procs
proc omp_suffix(genNew: static bool = false): string {.compileTime.}
- genNew: if false, return the last suffix else return a fresh one Source Edit
Templates
template attachGC(): untyped
-
If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.
This attaches the GC.
Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489
Source Edit template detachGC(): untyped
-
If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.
This detaches the GC.
Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489
Source Edit template omp_barrier(): untyped
- Source Edit
template omp_chunks(omp_size: Natural; chunk_offset, chunk_size: untyped; body: untyped): untyped
- Internal proc This is is the chunk part of omp_parallel_chunk omp_size should be a lvalue (assigned value) and not the result of a routine otherwise routine and its side-effect will be called multiple times Source Edit
template omp_critical(body: untyped): untyped
- Source Edit
template omp_for(index: untyped; length: Natural; use_simd, nowait: static bool; body: untyped)
-
OpenMP for loop (not parallel)
This must be used in an omp_parallel block for parallelization.
Inputs:
- index, the iteration index, similar to for index in 0 ..< length: doSomething(index)
- length, the number of elements to iterate on
- use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...
template omp_get_max_threads(): cint
- Source Edit
template omp_get_nested(): cint
- Source Edit
template omp_get_num_threads(): cint
- Source Edit
template omp_get_thread_num(): cint
- Source Edit
template omp_master(body: untyped): untyped
- Source Edit
template omp_parallel(body: untyped): untyped
-
Starts an openMP parallel section
Don't forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.
Source Edit template omp_parallel_chunks(length: Natural; chunk_offset, chunk_size: untyped; omp_grain_size: static Natural; body: untyped): untyped
-
Create a chunk for each thread. You can use: for index in chunk_offset ..< chunk_size: or zeroMem(foo[chunk_offset].addr, chunk_size)
Splits the input length into chunks and do a parallel loop on each chunk. The number of chunks depends on the number of cores at runtime. chunk_offset and chunk_size should be passed as undeclared identifiers. Within the template scope they will contain the start offset and the length of the current thread chunk. I.e. their value is thread-specific.
Use omp_get_thread_num() to get the current thread number
This is useful for non-contiguous processing as a replacement to omp_parallel_for or when operating on (contiguous) ranges for example for memset or memcpy.
Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.
Source Edit template omp_parallel_chunks_default(length: Natural; chunk_offset, chunk_size: untyped; body: untyped): untyped
-
This will be renamed omp_parallel_chunks once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
- omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.
template omp_parallel_for(index: untyped; length: Natural; omp_grain_size: static Natural; use_simd: static bool; body: untyped)
-
Parallel for loop
Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.
Inputs:
- index, the iteration index, similar to for index in 0 ..< length: doSomething(index)
- length, the number of elements to iterate on
- omp_grain_size, the minimal amount of work per thread. If below, we don't start threads. Note that we always start as much hardware threads as available as starting varying number of threads in the lifetime of the program will add oberhead.
- use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...
template omp_parallel_for_default(index: untyped; length: Natural; body: untyped)
-
This will be renamed omp_parallel_for once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
- omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.
- simd is used by default
template omp_parallel_if(condition: bool; body: untyped)
- Source Edit
template omp_set_nested(x: cint)
- Source Edit
template omp_set_num_threads(x: cint)
- Source Edit
template omp_single(body: untyped): untyped
- Source Edit
template omp_single_nowait(body: untyped): untyped
- Source Edit
template omp_taskloop(index: untyped; length: Natural; annotation: static string; body: untyped)
- OpenMP taskloop Source Edit
template omp_taskwait(): untyped
- Source Edit