MTH 654 LAB 7

In this lab you can work from the local computer in Cygwin, or later do all the work from any other computer on canmpus by logging in remotely using ssh.
You will explore parallel implementation using MPI and/or CUDA on GPUs.
This lab spans three weeks. Solve two projects. For those less advanced in computing, I can count as one project i) (theoretical) discussion/implementation of the handshake algorithm, ii) summary/overview of all CUDA demos that you installed and viewed.

Read background information

There will be a class demonstration of both technologies in class. Please wait.
During the lab, I encourage all to log in to the Linux machines in the back with NVIDIA CUDA and to follow instructions to get all the demos.

Project 1: Modify your mypi.f program and submission scripts to test its behavior with changing number of subintervals myn from the one given to a very large number. Collect the timings.
Test your code on 2, 4, 8 processors. Do you have the same results ? Do you expect to have the same results ? How does the error behave ? Comment on

accuracy of pi computed
scalability of your code

To get the timings for your program (and to test scalability), use this MPI_Wtime function

Optional: design your own computation of pi using the algorithm from Lab1. Compare with this one.

Project 2: Overlapping domain decomposition.

You may want to follow this introductory example on send/receive before you go on.

Get handshake.f
Look at the code and CORRECT it, As is, it will produce unpleasant results.
Rig a submission script for it, compile the code and run it on the cluster.

Now that you know how to do global reduce operation (as in mypi.f) and send-receive's as in handshake.f), you are ready to create an algorithm for myjacobi_parallel.f which should be an extension of the algorithm discussed in class.

Make sure first your single processor (serial) version runs. Your parallel version should run the same way with exactly the same results. Here is a template myjacobi_parallel.m

Implement the algorithm incrementally (in Fortran) and make sure each step is correct before you proceed to the next:

put the handshake-like message passing in it first, set maxiter to 3 or 5 (so the code would finish soon), and test the code.
put in Aleft, Aright, uleft, uright in both the matrix x vector product and in Jacobi loop
put in the global reduce to compute global norm of the residual
run it with an appropriate value of maxiter

Project 3:: Newton-CG using CUDA implementation of BLAS. CUBLAS documentation I am assuming that you got settled on the machines with TeslaC1060 cards and ran some demos. Now follow the simpleCUDABLAS example to morph it eventually into your Newton-CG code.

copy the simpleCUDABLAs project to your own folder, remove other projects, (keep simpleCUDA for comparison), make the executable and run it.
modify it
- print actual errors between host and device computations
- try some other blas functions; sdot, saxpy, snrm2
- load matrices with something else beside random values
- try matrix vector product sgemv
now implement cg for -u''=f ... you can use the mycg.m template
now implement Newton-CG for -u'' = f(u) as in LAB3.html

Project 4: Implement all the various ways to compute pi from scratch in CUDA.

write the code so it can work with different numbers of thread blocks.
write the code so it can work with different GPUs.