Currently I compile gcc version 6.2.1 20161117 and I'm running very simple code that use 32 gangs,
I compile it with gcc -fopenacc -o3 -o piexample.x piexample.c with the following env.
export GOMP_DEBUG=1
export ACC_DEVICE_TYPE=nvidia
The problem is that anyone number of gangs that i use, the result is always the same: only 1 gang.
GOACC_parallel_keyed: mapnum=1, hostaddrs=0x7fffc872dff0, size=0x6012c0, kinds=0x6012b8
nvptx_exec: prepare mappings
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32
nvptx_exec: kernel main$_omp_fn$0: finished
GOACC_data_end: restore mappings
GOACC_data_end: mappings restored
If compile it with PGI compiler the number of gangs is setter correctly,
main:
12, Generating copy(pi)
14, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
14, #pragma acc loop gang(32), worker(8), vector(8) /*blockIdx.x threadIdx.y threadIdx.x */
16, Generating implicit reduction(+:pi)
Some ideas about the problem ?