### My home-made CUDA kernel for convolutions

Posted:

**Sat Nov 09, 2019 10:19 pm**Hi,

I wrote a CUDA kernel for convolution, and it outperforms cuDNN for small tensors.

cuDNN is efficient for large batches, but slow for small batches. Here are some numbers for my Go program. This is a network with 80 layers of 128 channels, 19x19 board, V100 (PCIExpress), half-precision, NHWC tensor format:

and here are the numbers for my code
So my improvement is very modest, and only for N = 8, but that is nice for a first try.

Here is how I do it:

CUDA allows 3 geometries of matrix multiplication: 16x16 * 16x16, 8x16 * 16x32, 32x16 * 16x8. So for N = 8, my code handles inputs by groups of 16, and outputs by groups of 32. I use padding if the number of channels cannot be divided.

I do not use shared memory for the matrix multiplications, and read all the data directly from memory. I use shared memory only for writing the final result. My kernel uses 16-bit floats, but performs accumulation in 32-bit. The 32-bit accumulators are buffered in shared memory in order to be converted to half-precision and written to main memory.

I profiled the code, and the profiler says I am about at 25-30% of the speed of light, both in terms of computation and memory bandwidth. I am not sure how I could try to improve this. I guess it indicates that I do not run enough blocks for the latency to be hidden.

I wrote a CUDA kernel for convolution, and it outperforms cuDNN for small tensors.

cuDNN is efficient for large batches, but slow for small batches. Here are some numbers for my Go program. This is a network with 80 layers of 128 channels, 19x19 board, V100 (PCIExpress), half-precision, NHWC tensor format:

Code: Select all

```
cuDNN: Time Boards/s
N = 8: 0.434614 1840.71
N = 16: 0.44284 3613.04
N = 32: 0.61234 5225.85
N = 64: 1.17127 5464.16
N = 128: 1.77406 7215.07
N = 256: 2.98986 8562.28
N = 512: 5.92637 8639.35
```

Code: Select all

```
N = 8: 0.35589 2247.88
N = 16: 0.521692 3066.94
```

Here is how I do it:

- Direct convolution: no GeMMM
- Tensor in HWCN format
- Blocks of 8 warps, each warp computes the output for one single point of the board

CUDA allows 3 geometries of matrix multiplication: 16x16 * 16x16, 8x16 * 16x32, 32x16 * 16x8. So for N = 8, my code handles inputs by groups of 16, and outputs by groups of 32. I use padding if the number of channels cannot be divided.

I do not use shared memory for the matrix multiplications, and read all the data directly from memory. I use shared memory only for writing the final result. My kernel uses 16-bit floats, but performs accumulation in 32-bit. The 32-bit accumulators are buffered in shared memory in order to be converted to half-precision and written to main memory.

I profiled the code, and the profiler says I am about at 25-30% of the speed of light, both in terms of computation and memory bandwidth. I am not sure how I could try to improve this. I guess it indicates that I do not run enough blocks for the latency to be hidden.