In this lab you can work from the local computer in Cygwin, or later do
all the work from any other computer on canmpus by logging in remotely
using ssh.
You will explore parallel implementation using MPI and/or CUDA on GPUs.
This lab spans three weeks. Solve two projects. For those less
advanced in computing, I can count as one project i) (theoretical)
discussion/implementation of the handshake algorithm,
ii) summary/overview of all CUDA demos that you installed and viewed.
Read background information
There will be a class demonstration of both technologies in class. Please wait.
During the lab, I encourage all to log in to the Linux machines in the
back with NVIDIA CUDA and to follow instructions to get all the demos.
Project 1:
Modify your mypi.f program and submission scripts to test its behavior
with changing number of subintervals myn from the one given to a very
large number. Collect the timings.
Test your code on 2, 4, 8 processors. Do you have the same results ?
Do you expect to have the same results ? How does the error behave ?
Comment on
- accuracy of pi computed
- scalability of your code
To get the timings for your program (and to test scalability), use this
MPI_Wtime
function
Optional: design your own computation of pi using the algorithm from
Lab1. Compare with this one.
Project 2: Overlapping domain decomposition.
You may want to follow this introductory example on send/receive
before you go on.
- Get handshake.f
- Look at the code and it, As is, it will produce unpleasant results.
- Rig a submission script for it, compile the code and run it on the cluster.
Now that you know how to do global reduce operation (as in
mypi.f) and send-receive's as in handshake.f), you are ready to create
an algorithm for myjacobi_parallel.f which should be an extension of
the algorithm discussed in class.
Make sure first your single processor (serial)
version runs. Your parallel version should run the same way with
exactly the same results. Here is a template myjacobi_parallel.m
Implement the algorithm incrementally (in Fortran) and make
sure each step is correct before you proceed to the next:
- put the handshake-like message passing in it first, set maxiter
to 3 or 5 (so the code would finish soon), and test the code.
- put in Aleft, Aright, uleft, uright in both the matrix x vector
product and in Jacobi loop
- put in the global reduce to compute global norm of the residual
- run it with an appropriate value of maxiter
Project 3:: Newton-CG using CUDA implementation of BLAS.
CUBLAS documentation
I am assuming that you got settled on the machines with TeslaC1060
cards and ran some demos. Now follow the simpleCUDABLAS example to morph it
eventually into your Newton-CG code.
- copy the simpleCUDABLAs project to your own folder, remove other projects, (keep simpleCUDA for comparison), make the executable and run it.
- modify it
- print actual errors between host and device computations
- try some other blas functions; sdot, saxpy, snrm2
- load matrices with something else beside random values
- try matrix vector product sgemv
- now implement cg for -u''=f ... you can use the mycg.m template
- now implement Newton-CG for -u'' = f(u) as in LAB3.html
Project 4:
Implement all the various ways to compute pi from scratch in CUDA.
- write the code so it can work with different numbers of thread blocks.
- write the code so it can work with different GPUs.
| |