CUDA
Kernel + grid + explicit memory & sync
What you change
- • Choose thread mapping (iteration → thread)
- • Rewrite loops as GPU kernel(s)
- • Manage GPU memory transfers
- • Launch grid + synchronize
CUDA kernel artifact
__global__ void k(...) {
tid = blockIdx.x*blockDim.x
+ threadIdx.x;
i = tid;
if (i < N) { … }
}Execution model
Grid → Blocks → Threads → GPU
Tuning: occupancy, shared memory, coalescing