CUDA "convolution" as slow as OpenMP version -


i'm trying "convolve" featwidth * featheight * 31 cube modelwidth * modelheight * 31 cube. problem kernel quite slow (well, manage quicker sequential cpu code, slow openmp version). i'm using quadro fx 1800 (yeah, 64 cuda cores...).

__constant__ float d_model[31*22*22]; #define imul(a,b) ( __mul24((a), (b)) ) #define imad(a,b,c) ( __mul24((a), (b)) + (c) ) __global__ void dmatch(float *score, const int featwidth, const int featheight, const int modelwidth, const int modelheight, const int scorewidth, const int scoreheight) {   const int x = imad(blockidx.x, blockdim.x, threadidx.x);   const int y = imad(blockidx.y, blockdim.y, threadidx.y);   if(x < scorewidth && y < scoreheight)   {    const int scoreidx = imad(x, scoreheight, y);    score[scoreidx] = 0.f;    const int basefeatidx = imul(x,scoreheight) + imad(modelheight-1, x, y);    for(int z = 0; z < 31; ++z)    {      // index positionning      int featidx =  imad(z, imul(featwidth,featheight), basefeatidx);      int modelidx = imul(z, imul(modelwidth,modelheight));       float value = 0.f;       // filter      for(int xx=0; xx<modelwidth; xx++)      {        const int xxmodelidx = imad(xx, modelheight, modelidx);        const int xxfeatidx = imad(xx, featheight, featidx);        for(int yy=0; yy<modelheight; yy++)        {          value += d_model[xxmodelidx+yy] * tex1dfetch(texfeatures,xxfeatidx+yy);        }      }      score[scoreidx] += value;   }  } } 

anyway, launch kernel 8*8 threads in block , grid size of (scorewidth/8)*(scoreheight/8) (scorewidth , scoreheight resulting matrix sizes) . i'd know if have clue of what's wrong or rather slow in code.

edit:

a faster version (150 ms drop 480 ms process!) tera:

__global__ void dmatch(float *score, const int featwidth, const int featheight, const int modelwidth, const int modelheight, const int scorewidth, const int scoreheight) {     const int y = imul(4,imad(blockidx.x, blockdim.x, threadidx.x));     const int x = imad(blockidx.y, blockdim.y, threadidx.y);     if(x < scorewidth && y < scoreheight)     {     const int scoreidx = imad(x, scoreheight, y);     const int basefeatidx = imul(x,scoreheight) + imad(modelheight-1, x, y);     float value=0.f, value1 = 0.f, value2 = 0.f, value3 = 0.f;     float feat,feat1,feat2,feat3;      // index positionning     int featidx =  0;     int modelidx = 0;     int xxmodelidx;     int xxfeatidx;      float val;     for(int z = 0; z < 31; ++z)     {         featidx = imad(z,imul(featwidth,featheight),basefeatidx);         modelidx = imul(z,imul(modelwidth,modelheight));          // filter         for(int xx=0; xx<modelwidth; xx++)         {             xxmodelidx  = imad(xx, modelheight, modelidx);             xxfeatidx = imad(xx, featheight, featidx);             feat=tex1dfetch(texfeatures,xxfeatidx+0);             feat1=tex1dfetch(texfeatures,xxfeatidx+1);             feat2=tex1dfetch(texfeatures,xxfeatidx+2);             feat3=tex1dfetch(texfeatures,xxfeatidx+3);             for(int yy=0; yy<modelheight; yy++)             {                 val = d_model[xxmodelidx+yy];                 value += val * feat;                 value1 += val * feat1;                 value2 += val * feat2;                 value3 += val * feat3;                 feat = feat1;                 feat1 = feat2;                 feat2 = feat3;                 feat3 = tex1dfetch(texfeatures,xxfeatidx+yy+4);             }         }     }     score[scoreidx] = value;     if(y+1 < scoreheight)         score[scoreidx+1] = value1;     if(y+2 < scoreheight)         score[scoreidx+2] = value2;     if(y+3 < scoreheight)         score[scoreidx+3] = value3; } 

launched dim3 threads(16,16); dim3 grid(divup(scoreheight,64), divup(scorewidth,16));.

what profiler says? nvidia nsight(plugin visual studio on windows , eclipse on linux) allows 2 see stalls , provides various hints optimize performance.

my guess (without looking on profiler) blocks have small. there 32 threads inside warp basic scheduling element. nvidia gpu able fast can hide latency operating on other threads while current 1 doing previous instruction. while possible have 8 blocks per sm (on tesla , fermi) or 16 (on kepler) still have 16-32 warps @ peaks can quite small (i may wrong launching block have latency). consider using larger blocks.

the texture fetch sub-optimal if understand code correctly - threads in warp differs modelheight - 1 in basefeatid , therefore in featidx , xxfeatidx. therefore texture fetch entirely random , not exploit data locality. reversing x , y make more efficient.

however rule check profiler - if problem compute bound on gpu should concentrate on computing side. if problem memory bound should on memory access patter. there might several other parts seems spots optimization won't know until see bottleneck is. once know might want read specific chapter on best practices guide.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -