Fork me on GitHub

src/arraymancer/laser/openmp

Search:
Group by:
  Source Edit

Consts

OMP_MEMORY_BOUND_GRAIN_SIZE {.intdefine.} = 1024
This is the minimum amount of work per physical cores for memory-bound processing.
  • "copy" and "addition" are considered memory-bound
  • "float division" can be considered 2x~4x more complex and should be scaled down accordingly
  • "exp" and "sin" operations are compute-bound and there is a perf boost even when processing only 1000 items on 28 cores

Launching 2 threads per core (HyperThreading) is probably desirable:

Raising the following parameters can have the following impact:

  • number of sockets: higher, more over memory fetch
  • number of memory channel: lower, less overhead per memory fetch
  • RAM speed: lower, less overhead per memory fetch
  • Private L2 cache: higher, feed more data per CPU
  • Hyperthreading and cache associativity
  • Cores, shared L3 cache: Memory contention

Note that setting num_threads manually might impact performance negatively:

  Source Edit
OMP_NON_CONTIGUOUS_SCALE_FACTOR {.intdefine.} = 4
Due to striding computation, we can use a lower grainsize for non-contiguous tensors   Source Edit

Procs

proc omp_suffix(genNew: static bool = false): string {.compileTime.}
genNew: if false, return the last suffix else return a fresh one   Source Edit

Macros

macro omp_flush(variables: varargs[untyped]): untyped
  Source Edit

Templates

template attachGC(): untyped

If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.

This attaches the GC.

Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489

  Source Edit
template detachGC(): untyped

If you are allocating reference types, sequences or strings in a parallel section, you need to attach and detach a GC for each thread. Those should be thread-local temporaries.

This detaches the GC.

Note: this creates too strange error messages when --threads is not on: https://github.com/nim-lang/Nim/issues/9489

  Source Edit
template omp_barrier(): untyped
  Source Edit
template omp_chunks(omp_size: Natural; chunk_offset, chunk_size: untyped;
                    body: untyped): untyped
Internal proc This is is the chunk part of omp_parallel_chunk omp_size should be a lvalue (assigned value) and not the result of a routine otherwise routine and its side-effect will be called multiple times   Source Edit
template omp_critical(body: untyped): untyped
  Source Edit
template omp_for(index: untyped; length: Natural; use_simd, nowait: static bool;
                 body: untyped)

OpenMP for loop (not parallel)

This must be used in an omp_parallel block for parallelization.

Inputs:

  • index, the iteration index, similar to for index in 0 ..< length: doSomething(index)
  • length, the number of elements to iterate on
  • use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...
  Source Edit
template omp_get_max_threads(): cint
  Source Edit
template omp_get_nested(): cint
  Source Edit
template omp_get_num_threads(): cint
  Source Edit
template omp_get_thread_num(): cint
  Source Edit
template omp_master(body: untyped): untyped
  Source Edit
template omp_parallel(body: untyped): untyped

Starts an openMP parallel section

Don't forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.

  Source Edit
template omp_parallel_chunks(length: Natural; chunk_offset, chunk_size: untyped;
                             omp_grain_size: static Natural; body: untyped): untyped

Create a chunk for each thread. You can use: for index in chunk_offset ..< chunk_size: or zeroMem(foo[chunk_offset].addr, chunk_size)

Splits the input length into chunks and do a parallel loop on each chunk. The number of chunks depends on the number of cores at runtime. chunk_offset and chunk_size should be passed as undeclared identifiers. Within the template scope they will contain the start offset and the length of the current thread chunk. I.e. their value is thread-specific.

Use omp_get_thread_num() to get the current thread number

This is useful for non-contiguous processing as a replacement to omp_parallel_for or when operating on (contiguous) ranges for example for memset or memcpy.

Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.

  Source Edit
template omp_parallel_chunks_default(length: Natural;
                                     chunk_offset, chunk_size: untyped;
                                     body: untyped): untyped
This will be renamed omp_parallel_chunks once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
  • omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.
  Source Edit
template omp_parallel_for(index: untyped; length: Natural;
                          omp_grain_size: static Natural; use_simd: static bool;
                          body: untyped)

Parallel for loop

Do not forget to use attachGC and detachGC if you are allocating sequences, strings, or reference types. Those should be thread-local temporaries.

Inputs:

  • index, the iteration index, similar to for index in 0 ..< length: doSomething(index)
  • length, the number of elements to iterate on
  • omp_grain_size, the minimal amount of work per thread. If below, we don't start threads. Note that we always start as much hardware threads as available as starting varying number of threads in the lifetime of the program will add oberhead.
  • use_simd, instruct the compiler to unroll the loops for simd use. For example, for float32: for i in 0..<16: xi += yi will be unrolled to take 128, 256 or 512-bit to use SSE, AVX or AVX512. for 256-bit AVX: for i in countup(0, 2, 8): # Step 8 by 8 xi += yi xi+1 += yi+1 xi+2 += yi+2 ...
  Source Edit
template omp_parallel_for_default(index: untyped; length: Natural; body: untyped)
This will be renamed omp_parallel_for once https://github.com/nim-lang/Nim/issues/9414 is solved. Compared to omp_parallel_for the following are set by default
  • omp_grain_size: The default OMP_MEMORY_BOUND_GRAIN_SIZE is suitable for contiguous copy or add operations. It's 1024 and can be changed by passing -d:OMP_MEMORY_BOUND_GRAIN_SIZE=123456 during compilation. A value of 1 will always parallelize the loop.
  • simd is used by default
  Source Edit
template omp_parallel_if(condition: bool; body: untyped)
  Source Edit
template omp_set_nested(x: cint)
  Source Edit
template omp_set_num_threads(x: cint)
  Source Edit
template omp_single(body: untyped): untyped
  Source Edit
template omp_single_nowait(body: untyped): untyped
  Source Edit
template omp_task(annotation: static string; body: untyped): untyped
  Source Edit
template omp_taskloop(index: untyped; length: Natural;
                      annotation: static string; body: untyped)
OpenMP taskloop   Source Edit
template omp_taskwait(): untyped
  Source Edit
Arraymancer Technical reference Tutorial Spellbook (How-To's) Under the hood