# Performance Optimizations in CUDA

IOANNIS E. VENETIS

Assistant Professor Department of Informatics University of Piraeus, Greece

## Achieving high performance on GPUs

Modern GPUs have the potential of achieving high performance

- ≈33.5 TFLOPS for Double Precision arithmetic (H100 and H200 GPUs)
- ≈90.5 TFLOPS for Single Precision arithmetic (L40 GPU)
- ≈989.4 TFLOPS for Half Precision arithmetic (H100 GPU)

Programming in a GPU hardware-agnostic manner will not unleash the full potential of GPUs

Programming for high performance on a GPU is not an easy task

- Good knowledge of the GPU hardware is required
- Good knowledge of possible performance bottlenecks is required
- Good knowledge of how to map programming constructs to the GPU hardware to avoid bottlenecks is required

#### Purpose of this presentation

Present the common bottlenecks in achieving high performance on GPUs

Present strategies to overcome these bottlenecks

Provide a starting point on where to look at, in case direct programming in CUDA and achieving high performance on the GPU is required

- Achieving high performance on the GPU does not always require direct programming in CUDA
- Highly optimized libraries are available that provide commonly required operations in a range of scientific domains
  - cuBLAS, cuSPARSE, CUDA Graphs, cuSOLVER, cuFFT, cuRAND, AmgX, ...

## Simple Matrix Multiplication



#### Simple Matrix Multiplication

We will use this algorithm to exploit usage of shared memory

• One of the strategies to improve performance in programs that execute on the GPU

#### Preliminaries

A common strategy when programming for the GPU is to assign calculations for each element of the result to a single CUDA thread

Matrix multiplication

• Each thread performs calculations for one element of the output matrix

How to assign elements of the output matrix to threads?

#### Correlating threads to matrix elements



EuroCC@Greece

x =

V =

#### Threads exceeding matrix limits



EuroCC@Greece

#### Row-major representation of matrices in C/C++

2-D matrices are stored in memory row-after-row

- Position in this 1-D representation:
  - Row \* Width + Column





#### Multiplication of square matrices

Assume the multiplication of square matrices: P = M \* N

- Matrices assumed square for simplicity in the example
- Size of matrices is WIDTH x WIDTH
- Our strategy will be for each CUDA thread to calculate one element of P
- Notice that each row of M is loaded WIDTH times from global memory
- Notice that each column of N is loaded WIDTH times from global memory



#### 2-D matrix multiplication CPU code



Ν

#### How to organize calculations

We will create blocks of threads of size TILE\_WIDTH × TILE\_WIDTH

Each block will calculate a tile of size TILE\_WIDTH × TILE\_WIDTH of the result matrix P

The 2-D grid must have size (WIDTH/TILE\_WIDTH) × (WIDTH/TILE\_WIDTH)



#### A larger example

| P <sub>0,0</sub>                     | P <sub>0,1</sub>                     | P <sub>0,2</sub>                     | P <sub>0,3</sub>                     | P <sub>0,4</sub> | P <sub>0,5</sub>                     | P <sub>0,6</sub>                     | P <sub>0,7</sub>                     |
|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|------------------|--------------------------------------|--------------------------------------|--------------------------------------|
|                                      |                                      |                                      |                                      | P <sub>1,4</sub> |                                      |                                      |                                      |
| P <sub>2,0</sub>                     | P <sub>2,1</sub>                     | P <sub>2,2</sub>                     | P <sub>2,3</sub>                     | P <sub>2,4</sub> | P <sub>2,5</sub>                     | P <sub>2,6</sub>                     | P <sub>2,7</sub>                     |
|                                      |                                      | P <sub>3,2</sub>                     |                                      |                  |                                      |                                      |                                      |
|                                      |                                      |                                      |                                      |                  |                                      |                                      |                                      |
|                                      |                                      |                                      |                                      | P <sub>4,4</sub> |                                      |                                      |                                      |
| P <sub>4,0</sub>                     | P <sub>4,1</sub>                     | P <sub>4,2</sub>                     | P <sub>4,3</sub>                     |                  | P <sub>4,5</sub>                     | P <sub>4,6</sub>                     | P <sub>4,7</sub>                     |
| P <sub>4,0</sub><br>P <sub>5,0</sub> | P <sub>4,1</sub><br>P <sub>5,1</sub> | P <sub>4,2</sub><br>P <sub>5,2</sub> | P <sub>4,3</sub><br>P <sub>5,3</sub> | P <sub>4,4</sub> | P <sub>4,5</sub><br>P <sub>5,5</sub> | P <sub>4,6</sub><br>P <sub>5,6</sub> | P <sub>4,7</sub><br>P <sub>5,7</sub> |

WIDTH = 8; TILE\_WIDTH = 2; Each block has 2 \* 2 = 4 threads

WIDTH / TILE\_WIDTH = 4; We will use 4 \* 4 = 16 blocks

#### Using a different block size

| P <sub>0,0</sub> | P <sub>0,1</sub>                     | P <sub>0,2</sub> | P <sub>0,3</sub> | P <sub>0,4</sub> | P <sub>0,5</sub> | P <sub>0,6</sub> | P <sub>0,7</sub> |
|------------------|--------------------------------------|------------------|------------------|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub>                     | P <sub>1,2</sub> | P <sub>1,3</sub> | P <sub>1,4</sub> | P <sub>1,5</sub> | P <sub>1,6</sub> | P <sub>1,7</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub>                     | P <sub>2,2</sub> | P <sub>2,3</sub> | P <sub>2,4</sub> | P <sub>2,5</sub> | P <sub>2,6</sub> | P <sub>2,7</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub>                     | P <sub>3,2</sub> | P <sub>3,3</sub> | P <sub>3,4</sub> | P <sub>3,5</sub> | P <sub>3,6</sub> | P <sub>3,7</sub> |
|                  |                                      |                  |                  |                  |                  |                  |                  |
| P <sub>4,0</sub> | P <sub>4,1</sub>                     | P <sub>4,2</sub> | P <sub>4,3</sub> | P <sub>4,4</sub> | P <sub>4,5</sub> | P <sub>4,6</sub> | P <sub>4,7</sub> |
|                  | P <sub>4,1</sub><br>P <sub>5,1</sub> |                  |                  |                  |                  |                  |                  |
|                  | P <sub>5,1</sub>                     |                  | P <sub>5,3</sub> | P <sub>5,4</sub> | P <sub>5,5</sub> | P <sub>5,6</sub> | P <sub>5,7</sub> |

WIDTH = 8; TILE\_WIDTH = 4; Each block has 4 \* 4 = 16 threads

WIDTH / TILE\_WIDTH = 2; We will use 2 \* 2 = 4 blocks

### Calculations on block (0, 0) for TILE\_WIDTH = 2





#### EuroCC@Greece

### Calculations on block (0, 1) for TILE WIDTH = 2

Col ω N



| Row | = | 0 |
|-----|---|---|
| Row | = | 1 |

| M <sub>0,0</sub> | M <sub>0,1</sub> | M <sub>0,2</sub> | М <sub>0,3</sub> |  |
|------------------|------------------|------------------|------------------|--|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |  |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |  |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> |  |

| N <sub>0,0</sub> | N <sub>0,1</sub> | N <sub>0</sub>        | ,2 | Ν | <b>I</b> <sub>0,</sub> | 3 |
|------------------|------------------|-----------------------|----|---|------------------------|---|
| N <sub>1,0</sub> | N <sub>1,1</sub> | <b>N</b> <sub>1</sub> | ,2 | Ν | I <sub>1,</sub>        | 3 |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N <sub>2</sub>        | ,2 | Ν | I <sub>2,</sub>        | 3 |
| N <sub>3,0</sub> | N <sub>3,1</sub> | N <sub>3</sub>        | ,2 | Ν | I <sub>3,</sub>        | 3 |
|                  |                  |                       |    |   |                        |   |
|                  |                  |                       |    |   |                        |   |

| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | Р <sub>0,В</sub> |
|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | <b>P</b> ,3      |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |

# A first, simple computational kernel for 2-D matrix multiplication

```
__global__ void MatrixMulKernel(float *M_d, float *N_d, float *P_d, int Width)
{
 // Calculate the row index of the P d element and M d
  int Row = blockIdx.y * blockDim.y + threadIdx.y;
 // Calculate the column index of P d and N d
  int Col = blockIdx.x * blockDim.x + threadIdx.x;
  if ((Row < Width) && (Col < Width)) {</pre>
   float Pvalue = 0.0;
   // Each thread computes one element of the block sub-matrix
   for (int k = 0; k < Width; k++) {
     Pvalue += M_d[Row * Width + k] * N_d[k * Width + Col];
   P d[Row * Width + Col] = Pvalue;
```

## Shared memory



#### Purpose

Understand how to use the memory hierarchy in CUDA

- Registers, shared memory, global memory
- Algorithms using "tiles"
- How to use barriers

# Memory hierarchy from the programmers point of view

Each thread:

- Reads/Writes to registers that belong to it (per thread registers) (~1 cycle)
- Reads/Writes to shared memory that belongs to a block (per block shared memory) (~30-50 cycles)
- Reads/Writes to global memory that belongs to the grid (per grid global memory) (~80-2750 cycles)
- Read from constant memory that belongs to the grid (per grid constant memory) (~1-7 cycles if data is in cache)



### Shared memory in CUDA

Special type of memory

- Its use must be explicitly specified in the source code
- Resides in each Streaming Multiprocessor (SM)
- Very fast access, compared to global memory
- Accessed in the same way that global memory is accessed

#### Data type specifiers in CUDA

| Variable declaration |           |                             | Memory   | Scope  | Lifetime    |
|----------------------|-----------|-----------------------------|----------|--------|-------------|
| int LocalVar;        |           | register                    | thread   | thread |             |
| device               | _shared   | <pre>int SharedVar;</pre>   | shared   | block  | block       |
| device               |           | int GlobalVar;              | global   | grid   | application |
| device               | _constant | <pre>int ConstantVar;</pre> | constant | grid   | application |

\_\_\_\_device\_\_\_ is optional when \_\_\_\_shared\_\_\_ or \_\_\_constant\_\_\_ is used

Automatic variables reside in registers

• Except arrays, which reside in global memory

# Declaring variables that reside in shared memory

\_global\_\_ void MatrixMulKernel(float \*M\_d, float \*N\_d, float \*P\_d, int Width)
{
 \_\_shared\_\_ float M\_ds[TILE\_WIDTH][TILE\_WIDTH];
 \_\_shared\_\_ float N\_ds[TILE\_WIDTH][TILE\_WIDTH];

#### Programming strategy

DRAM memory modules are used to implement global memory– slow access

Better strategy: Divide input data into smaller parts (tiles) in order to exploit shared memory

- Decide how input data will be divided into tiles, so that each tile fits into shared memory
- Handle each tile with a block of threads:
  - Copy tile from global to shared memory
    - Use multiple threads to exploit parallelism at the memory level
  - Perform computations on the tile that resides in shared memory
    - Each thread can (and must!) use multiple times each copied data element
  - Copy results from the shared to the global memory

#### When not to use shared memory

#### Assume that:

- N data elements must be read from memory
- Each element will be used only 1 time
- Access time per element
  - If in global memory: t<sub>g</sub>
  - $\,\circ\,\,$  If in shared memory:  $t_s\,{<<}\,t_g$

#### Total time to access elements

- From global memory
  - $T_g = N \cdot t_g$
- From shared memory
  - $\circ \quad \mathsf{T}_{\mathsf{s}} = \mathsf{N} \cdot \mathsf{t}_{\mathsf{g}} + \mathsf{N} \cdot \mathsf{t}_{\mathsf{s}} > \mathsf{T}_{\mathsf{g}}$

#### More time is required!

#### When to use shared memory

Assume now that:

- N elements must be read from memory
- Each element will be used K > 1 times

Total time to access elements

- From global memory
  - $T_g = N \cdot K \cdot t_g$
- From shared memory
  - $T_s = N \cdot t_g + N \cdot (K-1) \cdot t_s$

But because  $t_s \ll t_g$  there is a huge gain!

# 2-D matrix multiplication using shared memory



# Performance issues on the Fermi architecture

All threads access global memory to read input matrix elements

- 2 accesses (8 bytes) for one multiplication and one addition of single precision numbers
- 4 Bytes of memory transfers / FLOP (Floating-Point Operation)
- Max. performance of the architecture is 1000 GFLOPS
  - 4\*1000 = 4000 GB/sec memory bandwidth required to achieve max. performance
- But Fermi provides 150 GB/s
  - 150 / 4 = 37.5 GFLOPS can be achieved
  - Real code achieves only about 25 GFLOPS

Accesses to global memory have to be reduced drastically to get close to the max. of 1000 GFLOPS



#### Overview of the technique

Repeat

- Find a tile in global memory that is accessed by multiple threads
- Copy the tile from global to shared memory
- Make the threads access the data they need from shared memory

while more tiles are available

## Main idea: Use shared memory to reuse data

Every input element is read from WIDTH threads

Copy each element into shared memory

- Multiple threads will read it from there
- Reduces required bandwidth to global memory
  - Tiled algorithms





#### Tiled multiplication

Divide the execution of the computational kernel into phases

 Data access during each phase will be focused on a single tile of each of M and N

0

1

2

by

- bx, by: Block index on x and y dimensions
- tx, ty: Thread index on x and y dimensions

#### First step: Copy tile to shared memory

All threads of the block will participate

• Each thread will copy a single element from M\_d and a single element from N\_d

Try to make memory accesses to the global memory coalesced within a warp during the copy

- Better bandwidth
- More in a while

#### Second step: Perform partial multiplication

Using the elements copied into shared memory perform as many operations as possible towards the final result

 In the case of matrix multiplication, to calculate partially each element within the current block of threads

#### Third step: Write back final result

After a block of threads passes over all tiles, the final result is copied to global memory

#### Processing block (0,0)



P<sub>0,3</sub>

 $P_{1,3}$ 

P<sub>2,3</sub>

, P<sub>3,3</sub> ,

#### Processing block (0,0)

| M <sub>0,0</sub> | M <sub>0,1</sub> | M <sub>0,2</sub> | M <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> |

|                                                       | N <sub>0</sub><br>N <sub>1</sub> |                  | ,1               |                  |
|-------------------------------------------------------|----------------------------------|------------------|------------------|------------------|
| Shared memory                                         | Π                                |                  |                  |                  |
| <b>№1</b> <sub>0.0</sub> <b>№1</b> <sub>0.1</sub>     | <b>₽</b> <sub>0.0</sub>          | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> |
| <b>1</b> v1 <sub>1.0</sub> <b>1</b> v1 <sub>1.1</sub> | <b>₽</b> <sub>1.0</sub>          | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> |
|                                                       | P <sub>2,0</sub>                 | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
|                                                       | P <sub>3,0</sub>                 | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |

Shared memory

#### EuroCC@Greece

## Processing block (0,0)

| M <sub>0,0</sub> | M <sub>0,1</sub> | M <sub>0,2</sub> | M <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> |



# Processing block (0,0)

| N <sub>0,0</sub> | N <sub>0,1</sub> | N <sub>0,2</sub> | N <sub>0,3</sub> | Shared memory                     |
|------------------|------------------|------------------|------------------|-----------------------------------|
| N <sub>1,0</sub> | N <sub>1,1</sub> | N <sub>1,2</sub> | N <sub>1,3</sub> |                                   |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N <sub>2,2</sub> | N <sub>2,3</sub> | N <sub>0,0</sub> N <sub>0,1</sub> |
| N <sub>3,0</sub> | N <sub>3,1</sub> | N <sub>3,2</sub> | N <sub>3,3</sub> | N <sub>1,0</sub> N <sub>1,1</sub> |

Shared memory

| M <sub>0,0</sub> | M <sub>0.1</sub> | M <sub>0.2</sub> | М <sub>0.3</sub> |       | M <sub>0.1</sub> |
|------------------|------------------|------------------|------------------|-------|------------------|
| M <sub>1,0</sub> |                  |                  |                  | M_1,0 |                  |
| M <sub>2,0</sub> |                  |                  |                  |       |                  |
| M <sub>3,0</sub> |                  |                  |                  |       |                  |

| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |

## Processing block (0,0)

| N <sub>0,0</sub> | N <sub>0,1</sub> | N <sub>0,2</sub> | N <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| N <sub>1,0</sub> | N <sub>1,1</sub> | N <sub>1,2</sub> | N <sub>1,3</sub> |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N <sub>2,2</sub> | N <sub>2,3</sub> |
| N <sub>3,0</sub> | N <sub>3,1</sub> | N <sub>3,2</sub> | N <sub>3,3</sub> |

| M <sub>0,0</sub> | M <sub>0,1</sub> | M <sub>0,2</sub> | M <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> |

Shared memory N<sub>0.0</sub> '  $N_{01}$ Shared memory P<sub>0,3</sub> M<sub>0,0</sub> M<sub>0,</sub> P<sub>0,2</sub> P<sub>1,2</sub>  $P_{1,3}$ D P. P<sub>2,1</sub> P<sub>2,2</sub> P<sub>2,0</sub> P<sub>2,3</sub> ( P<sub>3,2</sub> ) <sub>1</sub> P<sub>3,3</sub> • P<sub>3,1</sub> P P<sub>3,0</sub>

## Required synchronization construct: Barrier

CUDA API function call

\_\_\_\_syncthreads()

All threads of a block must call \_\_\_\_\_syncthreads() before they can continue execution

Required in algorithms that use tiles

- Ensures that all elements of the tile have been loaded into shared memory
- Ensures that all elements of the tile have been used in computations



#### How to access tile 0

0

1

2

ty

by

Accessing tile 0 with 2-D addressing:

- M[Row][tx]
  - Row = by \* TILE\_WIDTH + ty
- N[ty][Col]
  - Col = bx \* TILE\_WIDTH + tx



#### How to access tile 1

0

1

2

ty

by

Accessing tile 1 with 2D addressing:

- M[Row][1 \* TILE\_WIDTH + tx]
  - Row = by \* TILE\_WIDTH + ty
- N[1 \* TILE\_WIDTH + ty][Col]
  - Col = bx \* TILE\_WIDTH + tx



#### EuroCC@Greece

bx

# 2-D matrix multiplication with tiles

```
__global__ void MatrixMulKernel(float *M_d, float *N_d, float* P_d, int Width)
{
```

```
__shared__ float M_ds[TILE_WIDTH][TILE_WIDTH];
__shared__ float N_ds[TILE_WIDTH][TILE_WIDTH];
```

```
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
```

```
// Identify the row and column of the Pd element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0.0;
```

# 2-D matrix multiplication with tiles

```
// Loop over the M_ds and N_ds tiles required to compute the P_ds element
for (int m = 0; m < Width / TILE_WIDTH; m++) {
    // Colaborative loading of M_ds and N_ds tiles into shared memory
    M_ds[ty][tx] = M_d[Row * Width + m * TILE_WIDTH + tx];
    N_ds[ty][tx] = N_d[Col + (m * TILE_WIDTH + ty) * Width];
    ___syncthreads();</pre>
```

```
for (int k = 0; k < TILE_WIDTH; k++) {
    Pvalue += M_ds[ty][k] * N_ds[k][tx];
}
synchthreads();</pre>
```

```
______synchrineaus();
```

```
}
P_d[Row * Width + Col] = Pvalue;
}
```

# Size of tiles

Each block of threads should have as many threads as possible

- TILE\_WIDTH = 16 gives 16\*16 = 256 threads per block
- TILE\_WIDTH = 32 gives 32\*32 = 1024 threads per block

For 16, each block performs 2\*256 = 512 transfers of float values from global memory and 256 \* (2\*16) = 8192 operations

• 16 operations/transfer

For 32, each block performs 2\*1024 = 2048 transfers of float values from global memory and 1024 \* (2\*32) = 65536 operations

• 32 operations/transfer

# Shared memory and threads

Each SM in Fermi has 16KB or 48KB shared memory

• The size depends on the exact model of GPU

For TILE\_WIDTH = 16, each thread uses 2\*16\*16\*4B = 2KB shared memory

 We can have up to 8 blocks of threads active (maximum allowed by Fermi)

For TILE\_WIDTH = 32, each thread uses 2\*32\*32\*4B= 8KB shared memory

• We can have up to 2 or 6 blocks of threads active

# What have we gained?

Using TILE\_WIDTH = 16 we manage to reduce the number of accesses to the global memory 16 times

- The available bandwidth of 150GB/s can now support (150/4)\*16 = 600 GFLOPS!
- Compare that to the 37.5 GFLOPS of the initial computational kernel

## More results – 8800 GT GPU

#### Matrix multiplication

- Single precision
- Size of 4096x4096



# Implications

Algorithm and source code has to be modified to take advantage of shared memory

Complicates writing and understanding of source code

Need to know the memory hierarchy of the GPU to achieve high performance



## Global memory bandwidth

What we need



What we get



## A small 64x4 DRAM bank

A word is 4 bits

Each row has  $4 (= 2^2)$  words

The bank has 16 (= 2<sup>4</sup>) rows

Memory addresses are 6 bits

- First 4 bits identify the row
- Last 2 bits identify word within row



## Accessing memory



## Accessing memory











## More results – 8800 GT GPU

#### Matrix multiplication

- Single precision
- Size of 4096x4096



# Implications

Algorithm and source code has to be modified to take advantage of coalesced memory accesses

Might complicate understanding of source code

Might complicate understanding of rationale behind the decision of writing the source code in that specific way

• Need to know the design of modern DRAM used for the GPU

# Bank conflicts in shared memory

# Shared memory

Shared memory is divided into banks

- 16 for Compute Capability 1.x
- 32 for Compute Capability  $\geq$  2.0

Continuous words of 32-bits are stored in continuous banks

Different banks can be accessed simultaneously

If multiple memory addresses being accessed belong to the same bank

- Bank conflict
- Access is serialized
- Bank conflicts exist only for accesses within the same bank
  - Or a half-warp for Compute Capability 1.x

#### For Compute Capability $\geq 2.0$

- There is no bank conflict if the memory address accessed is the same for a number of threads of the warp
  - Does not hold for Compute Capability 1.x

# Example

#### Left

- Linear addressing with a step of 1 for 32-bit words
- No bank conflict

#### Center

- Linear addressing with a step of 2 for 32-bit words
- 2-way bank conflict

#### Right

- Linear addressing with a step of 3 for 32-bit words
- No bank conflict



28

29

30

# Example

#### Left

- Random permutation
- No bank conflict

#### Center

- Threads 3, 4, 6, 7, 9 access the same 32-bit word in bank 5
- No bank conflict

#### Right

- All threads access the same 32-bit word in each bank
- No bank conflict



## More results – 8800 GT GPU

#### Matrix multiplication

- Single precision
- Size of 4096x4096



# Flow control



## Flow control

'if...else' instruction

- Threads are executed in warps
- Within each warp, the hardware cannot execute at the same time instructions of the 'if' and of the 'else' block
  - If there is no else', neither instructions that follow the 'if'

```
__global__ void function()
{
    ...
    if (condition) {
        ...
    } else {
        ...
    }
}
```



# How the hardware handles the situation

The hardware serializes execution of the different execution paths

Recommendations

- All threads within a warp should execute the same instructions
  - No divergence occurs if different execution paths are followed among different warps
- If divergence cannot be completely avoided
  - Try multiple continuous threads within a warp to execute the same instructions

# Summing all elements of a vector

A first implementation

- Add together the closest neighboring elements that contain a partial sum
- Continue until the final result is calculated

```
__shared__ float partialSum[];
int t = threadIdx.x;
```



#### Observations

During each repetition

- Two execution paths per warp
  - Threads that perform the addition and threads that don't

At most half of the threads in a warp contribute towards calculating the result

- All odd numbered threads don't contribute already from the first iteration!
- After the 5<sup>th</sup> repetition
  - Whole warps don't contribute
    - Although there is no branch diversion, there is poor exploitation of computational resources

#### A better implementation

```
__shared__ float partialSum[];
unsigned int t = threadIdx.x;
```

```
for (int stride = blockDim.x; stride > 1; stride >> 1) {
    _____syncthreads();
    if (t < stride)
        partialSum[t] += partialSum[t + stride];
}</pre>
```





#### Example for blocks of 512 threads

| Repetition | Threads that calculate | Warps |                 |
|------------|------------------------|-------|-----------------|
| 1          | 256                    | 16    |                 |
| 2          | 128                    | 8     |                 |
| 3          | 64                     | 4     | Threads > Warp  |
| 4          | 32                     | 2     |                 |
| 5          | 16                     | 1     |                 |
| 6          | 8                      | 1     | Threads < Warp  |
| 7          | 4                      | 1     |                 |
| 8          | 2                      | 1     |                 |
| 9          | 1                      | 1     |                 |
|            |                        |       | Warp divergence |

### Optimizing Parallel Reduction in CUDA

**Optimizing Parallel Reduction in CUDA** 

Goes over a total of 7(!) refinements for such a simple problem

• Refinements are tied to the architecture of the GPU

But leads to a 30x total speedup in execution time

# Synchronous and asynchronous execution



## Blocking and non-blocking functions

#### Synchronous vs. Asynchronous execution

- Synchronous:
  - Calling a function is blocking
  - Execution is performed serially
- Asynchronous:
  - Calling a function is non-blocking
  - The control of execution immediately returns to the host

#### Advantages of asynchronous execution

- Overlapping of processing and data transfer on different devices
  - Not only GPU and CPU
  - Accessing hard drive, transfer of data over the network, etc.

### Asynchronous execution in CUDA

Most functions of the CUDA API that are called from the host are blocking

Exceptions

- Calling computational kernels
- cudaMemcpy() within the same device (cudaMemcpyDeviceToDevice())
- cudaMemcpy() from the host to the device for up to 64kB of data
- Asynchronous copying of data

#### Asynchronous execution



| Time             |                  |         |         |                  |  |
|------------------|------------------|---------|---------|------------------|--|
| cudaMemcpy (H2D) | cudaMemcpy (H2D) | kernelA | kernelB | cudaMemcpy (D2H) |  |
|                  |                  |         |         |                  |  |

#### Asynchronous execution



# Simultaneous execution through pipelining

Latest GPU generations include subsystems for asynchronous execution of calculations and copying of data

- Allows data transfers while processing
- GPU architectures Maxwell and Kepler even have double subsystems for copying data
  - PCIe upstream (Device to Host D2H)
  - PCIe downstream (Host to Device H2D)
- Allows simultaneous execution of:
  - Copying part 'n-1' of results from the device to the host
  - Execution of computational kernel for part 'n' of the data
  - Copying part 'n+1' of data from the host to the device for next invocation of the computational kernel

Best strategy to achieve high performance is to overlap data transfers with computations

All GPUs with Compute Capability  $\geq$  2.0 have the ability to execute simultaneously multiple computational kernels

• Especially useful for problems with a relatively small problem size

#### **CUDA** Streams

Put on a queue the tasks to be performed on the GPU

- All calls to computational kernels are asynchronous
  - Control returns immediately to host for execution of next instructions
  - Can be another computational kernel invocation
- The GPU extracts task from streams when it has resources to execute them

Tasks within the same stream are executed in order(FIFO, no overlap)

Tasks among different streams don't have a global order of execution and can overlap

```
//create a handle for the stream
cudaStream_t stream;
```

```
//create the stream
cudaStreamCreate(&stream);
```

```
//do some work in the stream...
```

```
//destroy the stream (blocks host until stream is complete)
cudaStreamDestroy(&stream);
```

#### Assigning tasks to streams

When calling a computational kernel, a 4<sup>th</sup> parameter defines the stream to be used

The default stream requires special attention

• It is synchronous with all other streams

//execute kernel on device in specified stream
fooKernel<<<blocks, threads, 0, stream>>>();

fooKernel<<<blocks, threads, 0>>>();CPUbarKernel<<<blocks, threads, 0>>>();default streamfooKernelbarKernelbarKernelbarKernel



#### Levels of overlap

