opencl optimal group size -


i'm running mandelbrot generator (2d image static params) on opencl. program straightforward:

__kernel void mandelbrot(__global uchar * output,                  const float xstep,                 const float xoffset,                 const float ystep,                 const float yoffset,                 const int maxiter) {     int gid_y = get_global_id(1);     int gid_x = get_global_id(0);      //calculate x , y on fly every pixel.      //this fast reading precalculated rulers global memory.     float x = gid_x * xstep + xoffset;     float y = gid_y * ystep + yoffset;      float real = 0;     float imag = 0;      int out = 0;      for(int curiter = 0; curiter < maxiter; curiter++) {         float nreal = real*real - imag*imag + x;         imag = 2* real*imag + y;         real = nreal;          if (real*real + imag*imag > 4.0f) {             out = curiter;             break;         }     }      //normalize output     out *= 256.0 / (float)maxiter;     output[gid_y * get_global_size(0) + gid_x] = out; 

}

[edit] [posted full kernel, , swapped rows , columns suggested. way gained 18% performance on amd, 0% on nvidia. original code was

output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out; 

[/edit]

i'm running on nvidia quadro 1000m, has 2 compute units , 96 cuda cores (48 cores per compute unit).

i'm playing around changing local group size when enqueuing kernel. these performance results different sizes when generating 400mpixel image. numbers opencl profiler , exclude final memory copy os. image 40992x10272 - both height , width divisible 48.

rows x columns 8x8: 397 mpixel/s 8x12: 505 mpixel/s 8x16: 523 mpixel/s 8x24: 521 mpixel/s 8x32: 520 mpixel/s 8x48: 520 mpixel/s  1x48: 321 mpixel/s 2x32: 424 mpixel/s 2x48: 523 mpixel/s 4x24: 519 mpixel/s 3x32: 525 mpixel/s 4x32: 525 mpixel/s 4x48: 525 mpixel/s  12x8: 490 mpixel/s 12x12:464 mpixel/s 12x24:505 mpixel/s 12x32:508 mpixel/s 12x48:433 mpixel/s  16x8: 499 mpixel/s 16x12:499 mpixel/s 16x16:472 mpixel/s 16x24:450 mpixel/s 16x32:440 mpixel/s 16x48:418 mpixel/s 

some of these numbers leave me baffled. while clear why best results 48 columns (thanks how simd operations work), don't understand:

  1. why performance degrade dramatically when use 16 rows per group?
  2. why poor performance 1x48?
  3. why in heaven top performance 3x32, 4x32, , 8x32?!? have expected 33% of simd processors sit idle, , instead looks workgroup sitting in between 2 compute units?!?
  4. why preferred_work_group_size_multiple return 32 instead of 48?
  5. is there non-empirical way figure out geometry top performance on gpu (ati/nvidia/intel hd), given acquire opencl info structures?

thanks in advance

i answered similar question here might find interesting before reading following.

why performance degrade dramatically when use 16 rows per group?

actually degrades when use 12 rows. memory access works transaction. transaction fetch number of bytes in 1 shot. if several workitems try access several contiguous elements in array means 1 transaction might enough serve them all.

because access memory in way:

output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;

it means bigger local size in dimension 0, bigger number of transaction since have access non contiguous elements (separated get_global_size(1) elements). , global memory access expensive.

so in case of 12/16 rows, have @ least 12/16 transactions needed. lead second question:

why poor performance 1x48?

based on i've said before, seems performance should great, since number of transactions minimal.

but here comes problem of idling threads. information got regarding 48 cores per sm wrong pointed out others. threads executed in group (called warp nvidia) of 32 on nvidia hardware. note these groups called wavefront , can 64 threads amd. since have in case workgroup composed of 48 threads (1 48), means 64 threads scheduled. number of threads multiple of 32 scheduled because can't execute fraction of warp.

therefore in case have fourth of threads nothing. , when compare result obtained 2x32 (still 64 threads - 2 warps, utilized) 321 mpixel/s pretty 3/4 of 424 mpixel/s.

it worth noting result: 2x48: 523 mpixel/s. in case workgroup size 96 multiple of 32. no idling threads.

why in heaven top performance 3x32, 4x32, , 8x32?!?

well, answer comes 2 previous ones: use multiple of 32, , keep number of threads in dimension 0 relatively small. let's have closer results:

2x32:  424 mpixel/s 3x32:  525 mpixel/s 4x32:  525 mpixel/s 8x32:  520 mpixel/s 16x32: 440 mpixel/s 

the decrease of performance 2 last lines explained said. however, increase of performance between first , second line not.

the increase of performance comes form somewhere else in case. fact in second case enough warps run on same sm hide access memory latency. see referred_work_group_size_multiple value says should try use multiple of value best performance. several warps can scheduled on same sm @ same time.

so, how work? let's take 3x32 case. have workgroup composed of 3 warps. since belong same workgroup scheduled on same sm required opencl standard (if wasn't case, sync between threads within workgroup wouldn't possible).

the first warp starts run until gets stall because memory access needed. meanwhile warp 1 waits memory transactions complete, warp 2 can start run. since there lot of registers on sm, sm can , switch context run others warps. variables of warp 1 stay on registers allocated warp 1. warp 2 hits line memory access required , gets stall. @ moment, next ready run warp can start running. warp 3 warp 1 if memory access completed. in case seems warp 3 runs, since have difference between 2x32 , 3x32. in first case there not enough warps scheduled hide memory accesses though in second case there are.

as mater of fact, influence bad performance 1x48 size question 2.

why preferred_work_group_size_multiple return 32 instead of 48?

already answered.

is there non-empirical way figure out geometry top performance on gpu (ati/nvidia/intel hd), given acquire opencl info structures?

it's other languages. when know how works under hood, helps produce first code. you'll still have benchmark it, , go through process of trial , errors tweak it. keeping in mind i've written small part of things matter performance. querying info opencl combined understanding of cpu/gpu help... that's it.

because lot of parameters influencing performance antagonists, you'll gain inone side, lost in other.

therefore keep benchmarking ;).


Comments

Popular posts from this blog

Need help in packaging app using TideSDK on Windows -

java - Why does my date parsing return a weird date? -