Wednesday, December 26, 2007

Functions and Global Variables

2.2 Functions and Global Variables
The program expand processes the files named as its arguments (or its standard input if no file arguments are specified) by expanding hard tab characters(\t, ASCII character9) to a number of spaces. The default behavior is to set tab stops every eight characters; this can be overridden by a comma or space-separated numeric list specified using the -t option. An interesting aspect of the program's implementation, and the reason we are examining it, is that it uses all of the control flow statements available in the C family of languages. Figure 2.2 contains the variable and function declarations of expand,[10] Figure 2.3 contains the main code body,[11] and Figure 2.5 (in Section 2.5) contains the two supplementary functions used.[12]
[10] netbsdsrc/usr.bin/expand/expand.c:36–62
[11] netbsdsrc/usr.bin/expand/expand.c:64–151
[12] netbsdsrc/usr.bin/expand/expand.c:153–185
When examining a nontrivial program, it is useful to first identify its major constituent parts. In our case, these are the global variables (Figure 2.2:1) and the functions main (Figure 2.3), getstops (see Figure 2.5:1), and usage (see Figure 2.5:8).
The integer variable nstops and the array of integers tabstops are declared as global variables, outside the scope of function blocks. They are therefore visible to all functions in the file we are examining.
The three function declarations that follow (Figure 2.2:2) declare functions that will appear later within the file. Since some of these functions are used before they are defined, in C/C++ programs the declarations allow the compiler to verify the arguments passed to the function and their return values and generate correct corresponding code. When no forward declarations are given, the C compiler will make assumptions about the function return type and the arguments when the function is first used; C++ compilers will flag such cases as errors. If the following function definition does not match these assumptions, the compiler will issue a warning or error message. However, if the wrong declaration is supplied for a function defined in another file, the program may compile without a problem and fail at runtime.
Figure 2.2 Expanding tab stops (declarations).
<-- a
#include
#include
#include
#include
#include

int nstops;
int tabstops[100];

static void getstops(char *);
int main(int, char *);
static void usage (void);
(a) Header files
Global variables
Forward function declarations
Notice how the two functions are declared as static while the variables are not. This means that the two functions are visible only within the file, while the variables are potentially visible to all files comprising the program. Since expand consists only of a single file, this distinction is not important in our case. Most linkers that combine compiled C files are rather primitive; variables that are visible to all program files (that is, not declared as static) can interact in surprising ways with variables with the same name defined in other files. It is therefore a good practice when inspecting code to ensure that all variables needed only in a single file are declared as static.
Let us now look at the functions comprising expand. To understand what a function (or method) is doing you can employ one of the following strategies.
Guess, based on the function name.
Read the comment at the beginning of the function.
Examine how the function is used.
Read the code in the function body.
Consult external program documentation.
In our case we can safely guess that the function usage will display program usage information and then exit; many command-line programs have a function with the same name and functionality. When you examine a large body of code, you will gradually pick up names and naming conventions for variables and functions. These will help you correctly guess what they do. However, you should always be prepared to revise your initial guesses following new evidence that your code reading will inevitably unravel. In addition, when modifying code based on guesswork, you should plan the process that will verify your initial hypotheses. This process can involve checks by the compiler, the introduction of assertions, or the execution of appropriate test cases.
Figure 2.3 Expanding tab stops (main part).
int
main(int argc, char *argv)
{
int c, column;
int n;

while ((c = getopt (argc, argv, "t:")) != -1) {
switch (c) {
case 't':
getstops(optarg);
break;
case '?': default: <-- a
usage();
}
}
argc -= optind;
argv += optind;
do {

if (argc > 0) {
if (freopen(argv[0], "r", stdin) == NULL) {
perror(argv[0]);
exit(1);
}
argc--, argv++;
}

column = 0;
while ((c = getchar()) != EOF) {
switch (c) {
case '\t': <-- b
if (nstops == 0) {
do {
putchar(' ');
column++;
} while (column & 07);
continue;
}
if (nstops == 1) {
do {
putchar(' ');
column++;
} while (((column - 1) % tabstops[0]) != (tabstops[0] - 1));
continue;
}
for (n = 0; n < nstops; n++)
if (tabstops[n] > column)
break;
if (n == nstops) {
putchar(' ');
column++;
continue;
}
while (column < tabstops[n]) {
putchar(' ');
column++;
}
continue;
case '\b': <-- c
if (column)
column--;
putchar('\b');
continue;
default: <-- d
putchar(c);
column++;
continue;
case '\n': <-- e
putchar(c);
column = 0;
continue;
} <-- f
} <-- g
} while (argc > 0);) <-- h
exit(0);
}
Variables local to main
Argument processing using getopt
Process the -t option
(a) Switch labels grouped together
End of switch block
At least once
(7) Process remaining arguments
Read characters until EOF
(b) Tab character
Process next character
(c) Backspace
(d) All other characters
(e) Newline
(f) End of switch block
(g) End of while block
(h) End of do block
The role of getstops is more difficult to understand. There is no comment, the code in the function body is not trivial, and its name can be interpreted in different ways. Noting that it is used in a single part of the program (Figure 2.3:3) can help us further. The program part where getstops is used is the part responsible for processing the program's options (Figure 2.3:2). We can therefore safely (and correctly in our case) assume that getstops will process the tab stop specification option. This form of gradual understanding is common when reading code; understanding one part of the code can make others fall into place. Based on this form of gradual understanding you can employ a strategy for understanding difficult code similar to the one often used to combine the pieces of a jigsaw puzzle: start with the easy parts.
Exercise 2.7 Examine the visibility of functions and variables in programs in your environment. Can it be improved (made more conservative)?
Exercise 2.8 Pick some functions or methods from the book's CD-ROM or from your environment and determine their role using the strategies we outlined. Try to minimize the time you spend on each function or method. Order the strategies by their success rate.

2.4 switch Statements
The normal return values of getopt are handled by a switch statement. You will find switch statements used when a number of discrete integer or character values are being processed. The code to handle each value is preceded by a case label. When the value of the expression in the switch statement matches the value of one of the case labels, the program will start to execute statements from that point onward. If none of the label values match the expression value and a default label exists, control will transfer to that point; otherwise, no code within the switch block will get executed. Note that additional labels encountered after transferring execution control to a label will not terminate the execution of statements within the switch block; to stop processing code within the switch block and continue with statements outside it, a break statement must be executed. You will often see this feature used to group case labels together, merging common code elements. In our case when getopt returns 't', the statements that handle -t are executed, with break causing a transfer of execution control immediately after the closing brace of the switch block (Figure 2.3:4). In addition, we can see that the code for the default switch label and the error return value ´?´ is common since the two corresponding labels are grouped together.
When the code for a given case or default label does not end with a statement that transfers control out of the switch block (such as break, return, or continue), the program will continue to execute the statements following the next label. When examining code, look out for this error. In rare cases the programmer might actually want this behavior. To alert maintainers to that fact, it is common to mark these places with a comment, such as FALLTHROUGH, as in the following example.[17]

[17] netbsdsrc/bin/ls/ls.c:173–178
case 'a':
fts_options |= FTS–SEEDOT;
/* FALLTHROUGH */
case 'A':
f_listdot = 1;
break;

The code above comes from the option processing of the Unix ls command, which lists files in a directory. The option -A will include in the list files starting with a dot (which are, by convention, hidden), while the option -a modifies this behavior by adding to the list the two directory entries. Programs that automatically verify source code against common errors, such as the Unix lint command, can use the FALLTHROUGH comment to suppress spurious warnings.
A switch statement lacking a default label will silently ignore unexpected values. Even when one knows that only a fixed set of values will be processed by a switch statement, it is good defensive programming practice to include a default label. Such a default label can catch programming errors that yield unexpected values and alert the program maintainer, as in the following example.[18]

[18] netbsdsrc/usr.bin/at/at.c:535–561
switch (program) {
case ATQ:
[...]
case BATCH:
writefile(time(NULL), 'b');
break;
default:
panic("Internal error");
break;
}
In our case the switch statement can handle two getopt return values.
't' is returned to handle the -t option. Optind will point to the argument of -t. The processing is handled by calling the function getstops with the tab specification as its argument.
'?' is returned when an unknown option or another error is found by getopt. In that case the usage function will print program usage information and exit the program.
A switch statement is also used as part of the program's character-processing loop (Figure 2.3:7). Each character is examined and some characters (the tab, the newline, and the backspace) receive special processing.
Exercise 2.13 The code body of switch statements in the source code collection is formatted differently from the other statements. Express the formatting rule used, and explain its rationale.
Exercise 2.14 Examine the handling of unexpected values in switch statements in the programs you read. Propose changes to detect errors. Discuss how these changes will affect the robustness of programs in a production environment.
Exercise 2.15 Is there a tool or a compiler option in your environment for detecting missing break statements in switch code? Use it, and examine the results on some sample programs.
2.5 for Loops
To complete our understanding of how expand processes its command-line options, we now need to examine the getstops function. Although the role of its single cp argument is not obvious from its name, it becomes apparent when we examine how getstops is used. getstops is passed the argument of the -t option, which is a list of tab stops, for example, 4, 8, 16, 24. The strategies outlined for determining the roles of functions (Section 2.2) can also be employed for their arguments. Thus a pattern for reading code slowly emerges. Code reading involves many alternative strategies: bottom-up and top-down examination, the use of heuristics, and review of comments and external documentation should all be tried as the problem dictates.

After setting nstops to 0, getstops enters a for loop. Typically a for loop is specified by an expression to be evaluated before the loop starts, an expression to be evaluated before each iteration to determine if the loop body will be entered, and an expression to be evaluated after the execution of the loop body. for loops are often used to execute a body of code a specific number of times.[19]

[19] cocoon/src/java/org/apache/cocoon/util/StringUtils.java:85
for (i = 0; i < len; i++) {
Loops of this type appear very frequently in programs; learn to read them as "execute the body of code len times." On the other hand, any deviation from this style, such as an initial value other than 0 or a comparison operator other than <, should alert you to carefully reason about the loop's behavior. Consider the number of times the loop body is executed in the following examples.
Loop extrknt + 1 times:[20]

[20] netbsdsrc/usr.bin/fsplit/fsplit.c:173
for (i = 0; i <= extrknt; i++)
Loop month - 1 times:[21]

[21] netbsdsrc/usr.bin/cal/cal.c:332
for (i = 1; i < month; i++)
Loop nargs times:[22]

[22] netbsdsrc/usr.bin/apply/apply.c:130
for (i = 1; i <= nargs; i++)
Note that the last expression need not be an increment operator. The following line will loop 256 times, decrementing code in the process:[23]

[23] netbsdsrc/usr.bin/compress/zopen.c:510
for (code = 255; code >= 0; code--) {
In addition, you will find for statements used to loop over result sets returned by library functions. The following loop is performed for all files in the directory dir.[24]

[24] netbsdsrc/usr.bin/ftp/complete.c:193–198
if ((dd = opendir(dir)) == NULL)
return (CC_ERROR);
for (dp = readdir(dd); dp != NULL; dp = readdir(dd)) {

The call to opendir returns a value that can be passed to readdir to sequentially access each directory entry of dir. When there are no more entries in the directory, readdir will return NULL and the loop will terminate.
The three parts of the for specification are expressions and not statements. Therefore, if more than one operation needs to be performed when the loop begins or at the end of each iteration, the expressions cannot be grouped together using braces. You will, however, often find expressions grouped together using the expression-sequencing comma (,) operator.[25]

[25] netbsdsrc/usr.bin/vi/vi/vs smap.c:389
for (cnt = 1, t = p; cnt <= cnt–orig; ++t, ++cnt) {
The value of two expressions joined with the comma operator is just the value of the second expression. In our case the expressions are evaluated only for their side effects: before the loop starts, cnt will be set to 1 and t to p, and after every loop iteration t and cnt will be incremented by one.
Any expression of a for statement can be omitted. When the second expression is missing, it is taken as true. Many programs use a statement of the form for (;;) to perform an "infinite" loop. Very seldom are such loops really infinite. The following example—taken out of init, the program that continuously loops, controlling all Unix processes—is an exception.[26]

[26] netbsdsrc/sbin/init/init.c:540–545

Figure 2.5 Expanding tab stops (supplementary functions).

static void
getstops(char *cp)
{
int i;

nstops = 0;
for (;;) {
i = 0;
while (*cp >= '0' && *cp <= '9')
i = i * 10 + *cp++ - '0';
if (i <= 0 || i > 256) {
bad:
fprintf(stderr, "Bad tab stop spec\n");
exit(1);
}
if (nstops > 0 && i <= tabstops[nstops-1])
goto bad;
tabstops[nstops++] = i;
if (*cp == 0)
break;
if (*cp != ',' && *cp != ' ')
goto bad;
cp++;
}
}

static void
usage(void)
{
(void)fprintf (stderr, "usage: expand [-t tablist] [file ...]\n");
exit(1);
}
Parse tab stop specification
Convert string to number
Complain about unreasonable specifications
Verify ascending order
Break out of the loop
Verify valid delimiters
break will transfer control here
Print program usage and exit
for (;;) {
s = (state_t) (*s)();
quiet = 0;
}
In most cases an "infinite" loop is a way to express a loop whose exit condition(s) cannot be specified at its beginning or its end. These loops are typically exited either by a return statement that exits the function, a break statement that exits the loop body, or a call to exit or a similar function that exits the entire program. C++, C#, and Java programs can also exit such loops through an exception (see Section 5.2).
A quick look through the code of the loop in Figure 2.5 provides us with the possible exit routes.
A bad stop specification will cause the program to terminate with an error message (Figure 2.5:3).
The end of the tab specification string will break out of the loop.
Exercise 2.16 The for statement in the C language family is very flexible. Examine the source code provided to create a list of ten different uses.
Exercise 2.17 Express the examples in this section using while instead of for. Which of the two forms do you find more readable?
Exercise 2.18 Devise a style guideline specifying when while loops should be used in preference to for loops. Verify the guideline against representative examples from the book's CD-ROM.
2.6 break and continue Statements
A break statement will transfer the execution to the statement after the innermost loop or switch statement (Figure 2.5:7). In most cases you will find break used to exit early out of a loop. A continue statement will continue the iteration of the innermost loop without executing the statements to the end of the loop. A continue statement will reevaluate the conditional expression of while and do loops. In for loops it will evaluate the third expression and then the conditional expression. You will find continue used where a loop body is split to process different cases; each case typically ends with a continue statement to cause the next loop iteration. In the program we are examining, continue is used after processing each different input character class (Figure 2.3:8).
Note when you are reading Perl code that break and continue are correspondingly named last and next.[27]

[27] perl/lib/unicode/mktables.PL:415–425
while () {
chomp;
if (s/0x[\d\w]+\s+\((.*?)\)// and $wanted eq $1) {
[...]
last;
}
}

To determine the effect of a break statement, start reading the program upward from break until you encounter the first while, for, do,or switch block that encloses the break statement. Locate the first statement after that loop; this will be the place where control will transfer when break is executed. Similarly, when examining code that contains a continue statement, start reading the program upward from continue until you encounter the first while, for, or do loop that encloses the continue statement. Locate the last statement of that loop; immediately after it (but not outside the loop) will be the place where control will transfer when continue is executed. Note that continue ignores switch statements and that neither break nor continue affect the operation of if statements.

There are situations where a loop is executed only for the side effects of its controlling expressions. In such cases continue is sometimes used as a placeholder instead of the empty statement (expressed by a single semicolon). The following example illustrates such a case.[28]

[28] netbsdsrc/usr.bin/error/pi.c:174–175
for (; *string && isdigit(*string); string++)
continue;
In Java programs break and continue can be followed by a label identifier. The same identifier, followed by a colon, is also used to label a loop statement. The labeled form of the continue statement is then used to skip an iteration of a nested loop; the label identifies the loop statement that the corresponding continue will skip. Thus, in the following example, the continue skip; statement will skip one iteration of the outermost for statement.[29]

[29] jt4/jasper/src/share/org/apache/jasper/compiler/JspReader.java:472–482
skip:
for ( [...]) {
if ( ch == limit.charAt(0)) {
for (int i = 1 ; i < limlen ; i++) {
if ( [...] )
continue skip;
}
return ret;
}
}

Similarly, the labeled form of the break statement is used to exit from nested loops; the label identifies the statement that the corresponding break will terminate. In some cases a labeled break or continue statements is used, even when there are no nested loops, to clarify the corresponding loop statement.[30]

[30] cocoon/src/scratchpad/src/org/apache/cocoon/treeprocessor/MapStackResolver.java:201–244
comp : while(prev < length) {
[...]
if (pos >= length || pos == -1) {
[...]
break comp;
}
}
Exercise 2.19 Locate ten occurrences of break and continue in the source code provided with the book. For each case indicate the point where execution will transfer after the corresponding statement is executed, and explain why the statement is used. Do not try to understand in full the logic of the code; simply provide an explanation based on the statement's use pattern.

No comments: