CUDA Matrix Multiplication: Shared Memory

Classified in Computers

Written on February 24, 2025 in English with a size of 3.23 KB

CUDA Matrix Multiplication Using Shared Memory

This code demonstrates matrix multiplication in CUDA, leveraging shared memory for optimization. It includes two examples: a kernel using shared memory and a host-side implementation using the Thrust library.

CUDA Kernel with Shared Memory

The following CUDA kernel performs matrix multiplication using shared memory to optimize data access:


__global__ void matMulShared(int *A, int *B, int *C, int rowsA, int colsA, int colsB) {
    __shared__ int tile_A[TILE_SIZE][TILE_SIZE], tile_B[TILE_SIZE][TILE_SIZE];
    int row = blockIdx.y * TILE_SIZE + threadIdx.y, col = blockIdx.x * TILE_SIZE + threadIdx.x, temp = 0;
    for (int i = 0; i < (colsA + TILE_SIZE - 1) / TILE_SIZE; ++i) {
        if (row < rowsA && (i * TILE_SIZE + threadIdx.x) < colsA)
            tile_A[threadIdx.y][threadIdx.x] = A[row * colsA + i * TILE_SIZE + threadIdx.x];
        else
            tile_A[threadIdx.y][threadIdx.x] = 0;
        if (col < colsB && (i * TILE_SIZE + threadIdx.y) < colsA)
            tile_B[threadIdx.y][threadIdx.x] = B[(i * TILE_SIZE + threadIdx.y) * colsB + col];
        else
            tile_B[threadIdx.y][threadIdx.x] = 0;
        __syncthreads();
        for (int j = 0; j < TILE_SIZE; ++j)
            temp += tile_A[threadIdx.y][j] * tile_B[j][threadIdx.x];
        __syncthreads();
    }
    if (row < rowsA && col < colsB)
        C[row * colsB + col] = temp;
}

Thrust Implementation

This example uses the Thrust library to perform matrix-vector multiplication:


#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/inner_product.h>
#include <iostream>

int main() {
    const int rows = 100;
    const int cols = 100;

    // Initialize the matrix and vector
    thrust::device_vector<float> d_matrix(rows * cols, 1.0f);  // Matrix filled with 1.0
    thrust::device_vector<float> d_vector(cols, 1.0f);        // Vector filled with 1.0
    thrust::device_vector<float> d_result(rows, 0.0f);        // Result vector initialized to 0

    // Perform matrix-vector multiplication by computing the dot product for each row
    for (int i = 0; i < rows; ++i) {
        d_result[i] = thrust::inner_product(d_matrix.begin() + i * cols,
                                            d_matrix.begin() + (i + 1) * cols,
                                            d_vector.begin(),
                                            0.0f);
    }

    // Display the result for verification
    std::cout << "Result (first 10 elements): ";
    for (int i = 0; i < 10; ++i) {
        std::cout << d_result[i] << " ";
    }
    std::cout << std::endl;

    return 0; //Added return statement
}

Key improvements in this version:

Uses <pre> and <code> tags for better code formatting.
Corrected spelling and grammatical errors.
Added missing semicolons and a return statement in the Thrust example.
Clear separation of the two code examples.
Added bold and italic for better readability

Related entries:

Tags: