[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Lecture 21 Notes



These are the notes on compilers and netlib

-- 
Title: Math 481/581 Lecture 21: Netlib, C and FORTRAN Compilers

Math 481/581 Lecture 21: Netlib, C and FORTRAN Compilers

© 1998 by Mark Hays <hays@math.arizona.edu>. All rights reserved.


Today we'll cover Netlib and the usage of C and FORTRAN compilers.


Compilers versus Interpreters

UNIX systems contain a large number of programs. Some of these programs are shell scripts, some are awk scripts, and are native machine executables.

If you run emacs on a script, you see the entire source code for the program. You can copy the script into your account, modify it, run it, and immediately see the effects of your changes.

If you run emacs on a native machine executable (hereafter referred to as a "binary" [because it ain't ASCII a'tall]), you will see gibberish. If you alter some of the gibberish and save your changes, the program will almost certainly cease to function --- when you run it, you will probably get a "core dumped" message.

One of the differences between these two types of programs is that scripts are interpreted and binaries are compiled.

The executable file for a program such as "fgrep" contains raw CPU instructions. When you run the fgrep program, the contents of the executable file are loaded into memory (well, not really, but close enough for us) and the CPU directly executes the in-memory copy as a sequence of CPU instructions.

In other words, the gibberish you see when you type emacs /usr/bin/fgrep has meaning to the computer's CPU. It is important to know that different types of CPUs interpret the gibberish differently -- so a machine executable that works on a Sun will not work on a PC or SGI. Different operating systems or even different versions of the same OS may also have different binary executables.

So how do "they" come up with machine executables? On UNIX systems, most binaries some from some code written in the C programming language. There is a program called a C compiler that translates C source code into binary executables that can run on the local CPU.

In other words, the compiler translates source code for some language into raw CPU instructions. C, C++, and FORTRAN all work this way. In a minute, we'll see why you might want to do this.

Shell scripts and things like maple, mathematica, and matlab programs are interpreted as opposed to compiled. For example, your login shell prompts you to enter a command. When you do so, the shell figures out what to do and then does it for you. In other words, interpreters execute program statements directly. Another example: when you invoke the svd function in matlab, matlab ends up calling a C/FORTRAN routine on your behalf that computes the singular value decomposition for you. This routine is probably very long and horribly complex, but you don't care -- all you have to say is "svd".


When to Use What

In the preceding section we saw that interpreters are very nice. They generally implement fairly high level constructs, like "svd" in matlab. Also, you can directly execute the source code for an interpreted language, so you can quickly see the effects of your changes. In particular, this means that debugging your code is fairly straightforward.

So why in the world would you want to write code in C or FORTRAN? There are a number of reasons:

The "much" factors here depend on what your code does and what interpreted language it's written in. There are no hard and fast rules.

Of course, there are downsides to using compiled languages:

You may have heard the word "RAD" (in a sentence not containing the words "dude" and/or "gnarly"), which stands for "Rapid Application Development". This term is usually used to describe specific programming languages, but it also embodies a program development philosophy that's best explained by example.

If you are going to write a code to solve the 3D Navier-Stokes equation, you could start by scribbling an outline of your program on a piece of toilet paper, and then fire up emacs and begin:

	int main(int argc, char *argv[])
	{
	/* your code here */
	return 0;
	}
In a couple of months, you might have a working code that consists of several thousand lines. Or you might end up with several thousand lines of junk.

If you use the RAD approach, you will probably prototype your code in something like matlab. You can develop pieces of the code as small, independent scripts that can be tuned and debugged separately. You'll usually run "small" test cases to check yourself as you go along.

Once your algorithm is working properly (e.g., near the parameter values of interest), you need to decide if the code you have:

If the answer to both of these questions is "yes", STOP. You have a code that works and can compute what you need in a reasonable amount of time. There is no reason to translate it to Java.

If the answer to either question is "no", you need to decide whether it is worthwhile translating your code into C or FORTRAN (or something else).

As always, you have to make intelligent choices. For example, writing your own SVD in C probably isn't going to gain you any speed over matlab's SVD. Your best bets for making the choice between a compiled or interpreted programs are to

A word on the speed issue: if you decide to recode your application in a compiled language because it runs too slowly, remember to factor in your own programming time. For example, you might have a code that you need to run five times. If it takes two days per run, this works out to ten days of runtime, plus the week it took you to develop it, for seventeen days total.

If you recode it in C, you might get it down to six hours of runtime, which appears to be a Big Win, until you add in the three months of effort it took to recode the thing!

On many systems, you enjoy a large gain in speed if you write your code in FORTRAN instead of C. By "large", I mean a factor of two to three (based on personal experience).


C and FORTRAN Compilers

On many UNIX systems, the compilers have the following names:

LanguageExecutable Name
Ccc
C++CC, c++, C++, or cxx
FORTRAN 77f77
FORTRAN 90f90

If your system has the FSF's GNU compilers installed, you can reach the compilers at:

LanguageExecutable Name
GNU Cgcc
GNU C++g++
GNU FORTRAN 77g77

You'll need to consult your system's manpages for details on your particular platform.

Sometimes, you'll have the GNU compilers available to you and sometimes not. For example, cc is gcc on Linux systems.

Some evil code requires gcc in order to compile, but this is pretty rare. As a rule of thumb, you'll want to use your vendor-supplied compilers instead of gcc on DEC UNIX and SGI IRIX systems.

I'm not sure about Sun Solaris 2+, IBM AIX, or HP/UX, as I have never used these operating systems. On SunOS 4.x (aka Solaris 1.x), you definitely want to use gcc if it is available.

My experience has been that following the above recommendations will gain you about a factor of two in speed (i.e., code compiled with gcc on DEC UNIX systems runs half as fast as if you had compiled it with cc; on SunOS 4.x, the opposite is true).

Finally, it appears to be the case that gcc version 2.8.x produces binaries run about half as fast as those compiled with gcc version 2.7.x. You can type gcc -v to find out what version you have. I have not had an opportunity to compare different versions of vendor-supplied compilers.

The important thing is that choosing the right compiler for the language your code is written in can gain you a significant speedup. We'll touch on this briefly in the section on the optimizer.


Phases of Compilation

Compilers usually recognize dozens of command line options. These purpose and syntax of these options are described in excruciating detail in the compiler's online manpage. In this section we'll cover a small subset of these options that are common to all compilers I know of.

Before getting into this, let's take a brief look at the two phases of compilation. First, you run the compiler on your source code to produce object files. Next you run the compiler again to combine all the object files and support libraries into the executable.

During the first compilation phase, your source code is translated into machine instructions by the compiler. For example, if you have a file of C program source called prog.c, you can achieve this with the following command:

   cc -c prog.c
The "-c" flag tells the compiler to translate prog.c to machine instructions and place these instructions in the object file prog.o. The command for FORTRAN, etc. is analogous.

When you have created an object file for each source file, you are ready to build the final executable. Although it is possible to invoke the linker directly (available via the ld command on many systems), it is best to let the compiler invoke it for you. Continuing the above example, we'd link prog.o to produce an executable named prog with the following:

   cc -o prog prog.o -lm
The "-o" flag tells the compiler what to name the output file. If you don't use this flag, the compiler picks a default output filename, usually a.out.

The "-lm" bit tells the compiler to also link in the math runtime library. Basically, a library is an archive of plain old object files that perform a bunch of operations for you. In the C language, the math library contains the trig and transcendental functions such as sin, cos, log, etc. If your source code calls any of these functions and you forget to add in the "-lm", the linker will be unable to produce the final executable and will issue an error message similar to "unresolved symbol" or "unresolved external".

You can link against your own libraries, too. For example, either of the following commands will link your code against a library called libjunk.a in the mystuff subdirectory in your account:

   cc -o prog prog.o $HOME/mystuff/libjunk.a -lm
   cc -o prog prog.o -L$HOME/mystuff -ljunk -lm

You need to know two commands to create your own library: ar and ranlib.

Making a library archive is pretty simple. If you want to build a library called libjunk.a from the object files funcs1.o, funcs2.o, and funcs3.o (these files presumably contain a bunch of functions or subroutines that you commonly use), type:

   ar crv libjunk.a funcs1.o funcs2.o funcs3.o

On some systems, you need to "bless" the library you just built with ranlib. Move the library to its final resting place (for example, the directory $HOME/mystuff) and type:

   ranlib $HOME/mystuff/libjunk.a
If your system has a ranlib command, you probably need to use it; otherwise, you can skip this step.

You don't have to name your libraries lib<whatever>.a; you can, in fact, call them whatever you like. The advantage of using the lib<whatever>.a scheme is that you can use the -L<path> -l<whatever> syntax (which is considered Good UNIX Form).

There are a couple of good reasons to make your own libraries. First, you save a little compilation time becuase you don't have recompile the files funcs1.c, funcs2.c, and funcs3.c every time you rebuild one of your programs.

There is an even nastier problem with having separate copies of the three "funcs" files in each of your source code directories. If you find a bug in, say, funcs2.c you will have to make changes in every copy of this file. It is very easy to forget to do this --- the end result is that you'll get bitten by the same bug six months from now!

By putting your functions into a library and maintaining the library's source code in a single location, you largely avoid this problem. If you modify the library's source code, you simply need to relink each program that uses the library.


The Optimizer

Compilers have the ability to reorganize your code in such a way that the resulting program will execute more quickly. Unless your code tickles a bug in the compiler, you should always make use of these options (the odds of finding a compiler bug are very small).

To compile a code with optimization, use the "-O" flag. Most compilers support various levels of optimization; the respective manpage will give you all the details. Optimization only affects the compile stage; it has no effect when linking.

Here's how to turn on the optimizer:

   cc -O -c prog.c
   cc -o prog prog.o -lm
In general, your code will run twice as fast if you compile it with optimization.


The Core Dump and Debugging

Sometimes things go terribly wrong when your code runs. There are at least two types of problems that can occur: logic errors in your code and the ubiquitous "core dump".

Finding and correcting logic errors is beyond the scope of this course. Normally code containing such errors will execute, but give the wrong answer. Logic errors usually reflect a problem with your algorithm or your implementation of the algorithm. Going over your code with a fine toothed comb is usually your only recourse in this case.

Core dumps occur when your code performs some illegal operation, such as attempting to access memory that is outside your process' address space.

When this happens, the operating system causes your program to abort and write the process' entire virtual address space into a file called "core".

This file can be very useful for debugging. In particular, it can help you figure out exactly where your code performed the fatal operation.

For most compilers, you need to turn off the optimizer and compile with the "-g" flag in order to make use of the core file. Our ongoing example looks like:

   cc -g -c prog.c
   cc -o prog prog.o -lm
When your code dumps core, you can figure out where it died by typing dbx ./prog core. This will get you to the dbx debugger prompt. If you type "where", a stack trace will be printed. Type "quit" to get out of dbx. Dbx has online help available in its manpage. You can also type "help" at the prompt for detailed help.

If you compile your code with the GNU compilers, you'll want to use the GNU debugger "gdb" instead of dbx.

Finally, a word of warning: NEVER attempt to link object files produced with your vendor's C compiler against object files produced with gcc. Sometimes it works, but sometimes it doesn't. When it doesn't work, it isn't at all clear what's wrong -- in other words, it will probably take you a long, long time to figure out what the problem is.


The Netlib Software Repository

The Netlib software repository at:

http://www.netlib.org

is a fairly comprehensive resource for all sorts of freely available mathematical software. Most of the code is written in FORTRAN-77, but some things have been ported to C, C++, and FORTRAN-9x.

The quality of the code is generally very high (from a numerical analysis standpoint). Unfortunately, the code tends to be a little difficult to use. If you need the best available implementation of an algorithm (e.g., your problem is numerically sensitive), Netlib is a good place to look.

If you're just getting started, or want to get something up and running quickly, I'd recommend prototyping in something like Matlab. If you need extra speed, you can port the Matlab code to C or FORTRAN and use something like Numerical Recipes. If Numerical Recipes doesn't supply the algorithm you want, or if you decide that their implementation is not robust enough, you can probably find some code on Netlib that'll do what you need.

The Guide to Available Mathematical Software, or GAMS, is at:

http://gams.cam.nist.gov

GAMS can be extremely helpful when you are trying to locate something on Netlib.