CUDA Matrix Multiplication: Shared Memory
Classified in Computers
Written at on English with a size of 3.23 KB.
CUDA Matrix Multiplication Using Shared Memory
This code demonstrates matrix multiplication in CUDA, leveraging shared memory for optimization. It includes two examples: a kernel using shared memory and a host-side implementation using the Thrust library.
CUDA Kernel with Shared Memory
The following CUDA kernel performs matrix multiplication using shared memory to optimize data access:
__global__ void matMulShared(int *A, int *B, int *C, int rowsA, int colsA, int colsB) {
__shared__ int tile_A[TILE_SIZE][TILE_SIZE], tile_B[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y, col = blockIdx.x * TILE_SIZE + threadIdx.x, temp = 0;
for (int i = 0; i < (colsA + TILE_SIZE - 1) / TILE_SIZE; ++i) {
if (row < rowsA && (i * TILE_SIZE + threadIdx.x) < colsA)
tile_A[threadIdx.y][threadIdx.x] = A[row * colsA + i * TILE_SIZE + threadIdx.x];
else
tile_A[threadIdx.y][threadIdx.x] = 0;
if (col < colsB && (i * TILE_SIZE + threadIdx.y) < colsA)
tile_B[threadIdx.y][threadIdx.x] = B[(i * TILE_SIZE + threadIdx.y) * colsB + col];
else
tile_B[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
for (int j = 0; j < TILE_SIZE; ++j)
temp += tile_A[threadIdx.y][j] * tile_B[j][threadIdx.x];
__syncthreads();
}
if (row < rowsA && col < colsB)
C[row * colsB + col] = temp;
}
Thrust Implementation
This example uses the Thrust library to perform matrix-vector multiplication:
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/inner_product.h>
#include <iostream>
int main() {
const int rows = 100;
const int cols = 100;
// Initialize the matrix and vector
thrust::device_vector<float> d_matrix(rows * cols, 1.0f); // Matrix filled with 1.0
thrust::device_vector<float> d_vector(cols, 1.0f); // Vector filled with 1.0
thrust::device_vector<float> d_result(rows, 0.0f); // Result vector initialized to 0
// Perform matrix-vector multiplication by computing the dot product for each row
for (int i = 0; i < rows; ++i) {
d_result[i] = thrust::inner_product(d_matrix.begin() + i * cols,
d_matrix.begin() + (i + 1) * cols,
d_vector.begin(),
0.0f);
}
// Display the result for verification
std::cout << "Result (first 10 elements): ";
for (int i = 0; i < 10; ++i) {
std::cout << d_result[i] << " ";
}
std::cout << std::endl;
return 0; //Added return statement
}
Key improvements in this version:
- Uses
<pre>
and<code>
tags for better code formatting. - Corrected spelling and grammatical errors.
- Added missing semicolons and a return statement in the Thrust example.
- Clear separation of the two code examples.
- Added bold and italic for better readability