Intro to Parallel Computing

Math 481/581 Intro to Parallel Computing

Overview

This is a very cursory introduction to Parallel Computing. These notes are based on material appearing in Ian Foster's "Designing and Building Parallel Programs", published by Addison Wesley.

Parallel computing is covered in some detail in a companion course, Math 687A Advanced Scientific Computing , however, here we will limit ourselves to the 4 most important concepts in good parallel program design:

Concurrency Using many processors to accomplish a task

Scalability Good program design should run on any number of processors and should not be penalized in performance for doing so.

Locality Should exploit the local nature of stored information in order to optimize speed.

Modularity Modular program design makes codes more portable and easier to interface and maintain.

Parallel Computer

Is a set of processors able to work cooperatively to solve a computational problem. Parallelism: Offers the potential to concentrate processing, memory, and I/O capabilities of many machines to accomplish a computational task.

Uses

Complex simulations (physics, chemistry, engineering)
Data Processing (GIS, graphics processing, video)
Data Mining and data base manipulation
Control and Robotics

Generally, wherever large problems need to be tackled. Example: in 3D video, a current typical object has 1024 cubed or 10^9 elements, roughly, which if processed at 30 fps with a coding/decoding operation count of 200 would require roughly 10^12 operations per second. Another example: a 10 year simulation of global climate dynamics is done in ten days, requiring 10^20 flops and generates 10^11 bytes of data.

Trends in Computer Design

Between 1945 and today, the speed of computers has increased tenfold for every 5 years. Peak Performance of some of the fastest supercomputers

The speed in which a problem is solved depends on the time required to execute a single operation and on the number of concurrent operations. While computers are getting faster, it is clear that concurrency needs to be exploited to attain greater speeds: the speed of a basic operation cannot exceed the clock cycle . Even if a machine's information traveled at the fastest speed, the speed of light, the time required for a basic operation to take place would be T=D/c, where D is the distance on the chip that a signal must travel, and c is the speed of light. Since D is proportional to A^(1/2), where A is the surface area on a chip, the only way to decrease the time of computation is by making chips smaller. For example, to increase the speed by 2 would require that the chip be smaller by 4.

An alternative:

While chips are getting smaller, another way to increase the speed of a computation is by putting more processors capable of working concurrently. Clock Cycle Times

Trends in Networking

Concurrent computing invariably requires that processors have access to data stored remotely. To make distributed computing a feasable computing paradigm we need high speed communication or high speed networks. Not long ago, communication was achieved at a rate of 1.5Mbits per second. In the immediate future, this number will be close to 1000Mbits per second. Ideally, the speed of communication depends on message length, but this is clearly not so. In addition, beyond speed, other challenges of networking are reliability and security. Currently, the slowest part of concurrent computing is COMMUNICATION. Obviously, if there are many processors and the code requires a lot of communication you can reach diminishing returns.

A defining attribute of parallel machines is that local access to memory is faster than remote access, in a ratio of 10:1 to 1000:1. So locality is very desirable, in addition to concurrency and scalability.

Parallel programming and architecture is necessarily complex due to synchronization requirements and inter-node communication. Abstraction is essential in order to design robust algorithms. An important vehicle for abstraction is modularity . Abstraction is the primary motivation for developing object oriented languages, which by design have a certain amount of modularity already in place. Modularity is also a good design practice, since it tends to produce codes that can be easily diagnozed and linked to other programs. In addition, if codes are made portable, they will tend to run on many types of computers without requiring large changes in the code. The following is a typical parallel computing algorithm:

A parallel computation consists of 1 or more tasks . Tasks execute concurrently.

A Task: encapsulates a sequential program and uses local memory. It interfaces to the outside with inports and outports. In addition to reading/writing, a task can also send/receive messages, create new tasks, or terminate.

Sends are asynchronous. Receives are synchronous.

Inports/outports are connected by message queues by channels . The channels can be created/deleted, or referenced.

Tasks are mapped to physical processors in such a way that performance, measured in speed and/or storage use are optimized.

In summary, four important aspects of parallel computing are:

Concurrency: speed through parallel computing
Scalability: good code design aims for recilience of the algorithm to the number of processors involved.
Locality: good code design keeps remote communication at a minimum.
Modularity: portable and modular programs are easily connected to other modular programs.

Parallel Machine Models

A Processor is composed of a CPU and its memory storage device. This is the von Neumann computer.

MIMD (Multiple instruction multiple data) come in two basic types: Multicomputer Architecture and Multiprocessor Architecture .

The multicomputer (distributed memory device) is such that each node is a processor and can execute a separate stream of instructions on its own local data. Distributed memory means that data is distributed among many processors, rather than held in some central memory device. Here the cost of sending/receiving is then dependent on the node location and on network traffic. Some machines of this type are: IBM-SP, Cray 3TD, Meiko CS-2, nCUBE. Schematic of a multicomputer

The multiprocessor (shared memory device) is such that all nodes share a common centrally located memory device. Here, cache (the smallest and most local form of memory as far as the CPU is concerned) is exploited to load frequently used data on all of the processors. Examples are SGI Power Challenge, Sequent Symmetry.

Comparison of a distributed memory machine, shared memory machine and local area network

SIMD (single instruction multiple data): all processors execute the same instruction stream on a different piece of data. It has the potential of reducing considerably the complexity of both hardware and software, but it is usually appropriate only for specific problems, i.e. such things are specific image processing, certain numerical calculations. Examples are MasPar MP, Thinking Machines CM1. These machines are not as popular as they once were.

Parallel Networks

Fast networks that are commonly used are:

LAN (Local access network): all machines attached to the network are local.

WAN (Wide access network): machines may be geographically distributed. In both of these instances, technology such as ethernet and ATM (asynchronous transfer mode) are exploited. In both cases issues of speed, reliability and security are issues. Hence, a heterogeneous network of workstations can be used with a parallel instruction code to tackle parallel computing tasks. For example, to check out the Beowulf (or commodity machine) which is housed in the math department, click here . Parallel machine builders usually will have their own processor communication network which are built specifically for their own products. These tend to be the fastest networks (or switches).

Parallel Computing Software

This software offers a minimum instruction set with which parallel codes to run on MIMD machines can be built. The most widely used ones are MPI, p4, PVM.

Further Sources and Tools

Designing and Building Parallel Programs, I. Foster

Advanced Scientific Computing Course

Designing and building parallel programs

Globus

MPI

MPICH

Ptools

Upshot

PETSc

Concurrency	Using many processors to accomplish a task
Scalability	Good program design should run on any number of processors and should not be penalized in performance for doing so.
Locality	Should exploit the local nature of stored information in order to optimize speed.
Modularity	Modular program design makes codes more portable and easier to interface and maintain.