My home-made CUDA kernel for convolutions

Implementing and improving the AlphaZero algorithm
Post Reply
Rémi Coulom
Posts: 122
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

My home-made CUDA kernel for convolutions

Post by Rémi Coulom » Sat Nov 09, 2019 10:19 pm

Hi,

I wrote a CUDA kernel for convolution, and it outperforms cuDNN for small tensors.

cuDNN is efficient for large batches, but slow for small batches. Here are some numbers for my Go program. This is a network with 80 layers of 128 channels, 19x19 board, V100 (PCIExpress), half-precision, NHWC tensor format:

Code: Select all

cuDNN:  Time      Boards/s
N =   8: 0.434614  1840.71
N =  16: 0.44284   3613.04
N =  32: 0.61234   5225.85
N =  64: 1.17127   5464.16
N = 128: 1.77406   7215.07
N = 256: 2.98986   8562.28
N = 512: 5.92637   8639.35
and here are the numbers for my code

Code: Select all

N =  8: 0.35589   2247.88
N = 16: 0.521692  3066.94
So my improvement is very modest, and only for N = 8, but that is nice for a first try.

Here is how I do it:
  • Direct convolution: no GeMMM
  • Tensor in HWCN format
  • Blocks of 8 warps, each warp computes the output for one single point of the board
More precisely the matrix multiplication performed by the tensor cores is Output = Weights * Input. Weights has dimensions output_channels * input_channels, Input has dimensions input_channels * batch_size, and output has dimensions output_channels * batch_size. With the HWCN tensor format, there is no need to rearrange data to do the matrix multiplication: data is already stored in memory this way.

CUDA allows 3 geometries of matrix multiplication: 16x16 * 16x16, 8x16 * 16x32, 32x16 * 16x8. So for N = 8, my code handles inputs by groups of 16, and outputs by groups of 32. I use padding if the number of channels cannot be divided.

I do not use shared memory for the matrix multiplications, and read all the data directly from memory. I use shared memory only for writing the final result. My kernel uses 16-bit floats, but performs accumulation in 32-bit. The 32-bit accumulators are buffered in shared memory in order to be converted to half-precision and written to main memory.

I profiled the code, and the profiler says I am about at 25-30% of the speed of light, both in terms of computation and memory bandwidth. I am not sure how I could try to improve this. I guess it indicates that I do not run enough blocks for the latency to be hidden.

Post Reply