Intro to Parallel Computing





Math 481/581 Intro to Parallel Computing



Overview

This is a very cursory introduction to Parallel Computing. These notes are based on material appearing in Ian Foster's "Designing and Building Parallel Programs", published by Addison Wesley.

Parallel computing is covered in some detail in a companion course, Math 687A Advanced Scientific Computing , however, here we will limit ourselves to the 4 most important concepts in good parallel program design:
Concurrency  Using many processors to accomplish a task
Scalability  Good program design should run on any number of processors and should not be penalized in performance for doing so.
Locality  Should exploit the local nature of stored information in order to optimize speed.
Modularity  Modular program design makes codes more portable and easier to interface and maintain.


Parallel Computer

Is a set of processors able to work cooperatively to solve a computational problem. Parallelism: Offers the potential to concentrate processing, memory, and I/O capabilities of many machines to accomplish a computational task.

Uses

Generally, wherever large problems need to be tackled. Example: in 3D video, a current typical object has 1024 cubed or 10^9 elements, roughly, which if processed at 30 fps with a coding/decoding operation count of 200 would require roughly 10^12 operations per second. Another example: a 10 year simulation of global climate dynamics is done in ten days, requiring 10^20 flops and generates 10^11 bytes of data.

Trends in Computer Design

Between 1945 and today, the speed of computers has increased tenfold for every 5 years. Peak Performance of some of the fastest supercomputers

The speed in which a problem is solved depends on the time required to execute a single operation and on the number of concurrent operations. While computers are getting faster, it is clear that concurrency needs to be exploited to attain greater speeds: the speed of a basic operation cannot exceed the clock cycle . Even if a machine's information traveled at the fastest speed, the speed of light, the time required for a basic operation to take place would be T=D/c, where D is the distance on the chip that a signal must travel, and c is the speed of light. Since D is proportional to A^(1/2), where A is the surface area on a chip, the only way to decrease the time of computation is by making chips smaller. For example, to increase the speed by 2 would require that the chip be smaller by 4.

An alternative:

While chips are getting smaller, another way to increase the speed of a computation is by putting more processors capable of working concurrently. Clock Cycle Times

Trends in Networking

Concurrent computing invariably requires that processors have access to data stored remotely. To make distributed computing a feasable computing paradigm we need high speed communication or high speed networks. Not long ago, communication was achieved at a rate of 1.5Mbits per second. In the immediate future, this number will be close to 1000Mbits per second. Ideally, the speed of communication depends on message length, but this is clearly not so. In addition, beyond speed, other challenges of networking are reliability and security. Currently, the slowest part of concurrent computing is COMMUNICATION. Obviously, if there are many processors and the code requires a lot of communication you can reach diminishing returns.

A defining attribute of parallel machines is that local access to memory is faster than remote access, in a ratio of 10:1 to 1000:1. So locality is very desirable, in addition to concurrency and scalability.

Parallel programming and architecture is necessarily complex due to synchronization requirements and inter-node communication. Abstraction is essential in order to design robust algorithms. An important vehicle for abstraction is modularity . Abstraction is the primary motivation for developing object oriented languages, which by design have a certain amount of modularity already in place. Modularity is also a good design practice, since it tends to produce codes that can be easily diagnozed and linked to other programs. In addition, if codes are made portable, they will tend to run on many types of computers without requiring large changes in the code. The following is a typical parallel computing algorithm:

  • A parallel computation consists of 1 or more tasks . Tasks execute concurrently.
  • A Task: encapsulates a sequential program and uses local memory. It interfaces to the outside with inports and outports. In addition to reading/writing, a task can also send/receive messages, create new tasks, or terminate.
  • Sends are asynchronous. Receives are synchronous.
  • Inports/outports are connected by message queues by channels . The channels can be created/deleted, or referenced.
  • Tasks are mapped to physical processors in such a way that performance, measured in speed and/or storage use are optimized.

    In summary, four important aspects of parallel computing are:

    Parallel Machine Models

    A Processor is composed of a CPU and its memory storage device. This is the von Neumann computer.

    MIMD (Multiple instruction multiple data) come in two basic types: Multicomputer Architecture and Multiprocessor Architecture .

    The multicomputer (distributed memory device) is such that each node is a processor and can execute a separate stream of instructions on its own local data. Distributed memory means that data is distributed among many processors, rather than held in some central memory device. Here the cost of sending/receiving is then dependent on the node location and on network traffic. Some machines of this type are: IBM-SP, Cray 3TD, Meiko CS-2, nCUBE. Schematic of a multicomputer

    The multiprocessor (shared memory device) is such that all nodes share a common centrally located memory device. Here, cache (the smallest and most local form of memory as far as the CPU is concerned) is exploited to load frequently used data on all of the processors. Examples are SGI Power Challenge, Sequent Symmetry.

    Comparison of a distributed memory machine, shared memory machine and local area network

    SIMD (single instruction multiple data): all processors execute the same instruction stream on a different piece of data. It has the potential of reducing considerably the complexity of both hardware and software, but it is usually appropriate only for specific problems, i.e. such things are specific image processing, certain numerical calculations. Examples are MasPar MP, Thinking Machines CM1. These machines are not as popular as they once were.

    Parallel Networks

    Fast networks that are commonly used are:
  • LAN (Local access network): all machines attached to the network are local.
  • WAN (Wide access network): machines may be geographically distributed. In both of these instances, technology such as ethernet and ATM (asynchronous transfer mode) are exploited. In both cases issues of speed, reliability and security are issues. Hence, a heterogeneous network of workstations can be used with a parallel instruction code to tackle parallel computing tasks. For example, to check out the Beowulf (or commodity machine) which is housed in the math department, click here . Parallel machine builders usually will have their own processor communication network which are built specifically for their own products. These tend to be the fastest networks (or switches).

    Parallel Computing Software

    This software offers a minimum instruction set with which parallel codes to run on MIMD machines can be built. The most widely used ones are MPI, p4, PVM.

    Further Sources and Tools

  • Designing and Building Parallel Programs, I. Foster
  • Advanced Scientific Computing Course
  • Designing and building parallel programs
  • Globus
  • MPI
  • MPICH
  • p4
  • Ptools
  • Upshot
  • PETSc