Thursday, December 27, 2007

Boolean Expressions , goto Statements

2.7 Character and Boolean Expressions
The body of the for loop in the getstops function starts with a block of code that can appear cryptic at first sight (Figure 2.5:2). To understand it we need to dissect the expressions that comprise it. The first, the condition in the while loop, is comparing *cp (the character cp points to) against two characters: ´0´ and ´9´. Both comparisons must be true and both of them involve *cp combined with a different inequality operator and another expression. Such a test can often be better understood by rewriting the comparisons to bring the value being compared in the middle of the expression and to arrange the other two values in ascending order. This rewriting in our case would yield
while ( '\'0' <= *cp && *cp <= '9')
This can then be read as a simple range membership test for a character c.
0 c 9
Note that this test assumes that the digit characters are arranged sequentially in ascending order in the underlying character set. While this is true for the digits in all character sets we know, comparisons involving alphabetical characters may yield surprising results in a number of character sets and locales. Consider the following typical example.[31]

[31] netbsdsrc/games/hack/hack.objnam.c:352–253
if ('a' <= *s && *s <= 'z')
*s -= ( 'a' - 'A');
the code attempts to convert lowercase characters to uppercase by subtracting from each character found to be lowercase (as determined by the if test) the character set distance from ´a´ to ´A´. This fragment will fail to work when there are lowercase characters in character set positions outside the range a...z, when the character set range a...z contains non lowercase characters, and when the code of each lower case character is not a fixed distance away from the corresponding uppercase character. Many non-ASCII character sets exhibit at least one of these problems.
The next line in the block (Figure 2.5:2) can also appear daunting. It modifies the variable i based on the values of i and *cp and two constants: 10 and ´0´ while at the same time incrementing cp. The variable names are not especially meaningful, and the program author has not used macro or constant definitions to document the constants; we have to make the best of the information available.
We can often understand the meaning of an expression by applying it on sample data. In our case we can work based on the initial value of i (0) and assume that cp points to a string containing a number (for example, 24) based on our knowledge of the formatting specifications that expand accepts. We can then create a table containing the values of all variables and expression parts as each expression part is evaluated. We use the notation i' and *cp' to denote the variable value after the expression has been evaluated.

Iteration
i
i*10
*cp
*cp-'0'
i'
*cp'

0
0
0
'2'
2
2
'4'

1
2
20
'4'
4
24
0
The expression*cp - '0' uses a common idiom: by subtracting the ordinal value of '0' from *cp the expression yields the integer value of the character digit pointed to by *cp. Based on the table we can now see a picture emerging: after the loop terminates, i will contain the decimal value of the numeric string pointed to by cp at the beginning of the loop.
Armed with the knowledge of what i stands for (the integer value of a tab-stop specification), we can now examine the lines that verify the specification. The expression that verifies i for reasonable values is straightforward. It is a Boolean expression based on the logical OR (||) of two other expressions. Although this particular expression reads naturally as English text (print an error message if i is either less than or equal to zero, or greater than 255), it is sometimes useful to transform Boolean expressions to a more readable form. If, for example, we wanted to translate the expression into the range membership expression we used above, we would need to substitute the logical OR with a logical AND (&&). This can easily be accomplished by using De Morgan's rules.[32]

[32] We use the operator <=> to denote that two expressions are equivalent. This is not a C/C++/C#/Java operator.
!(a || b) <=> !a && !b
!(a && b) <=> !a || !b
We can thus transform the testing code as follows:
i <= 0 || i > 256 <=>
!(!(i <= 0) && !(i > 256)) <=>
!(i > 0 && i <= 256) <=>
!(0 < i && i <= 256) <=>
¬(0 < i 256)+

In our example we find both the initial and final expressions equally readable; in other cases you may find that De Morgan's rules provide you a quick and easy way to disentangle a complicated logical expression.
When reading Boolean expressions, keep in mind that in many modern languages Boolean expressions are evaluated only to the extent needed. In a sequence of expressions joined with the && operator (a conjunction), the first expression to evaluate to false will terminate the evaluation of the whole expression yielding a false result. Similarly, in a sequence of expressions joined with the || operator (a disjunction), the first expression to evaluate to true will terminate the evaluation of the whole expression yielding a true result. Many expressions are written based on this short-circuit evaluation property and should be read in the same way. When reading a conjunction, you can always assume that the expressions on the left of the expression you are examining are true; when reading a disjunction, you can similarly assume that the expressions on the left of the expression you are examining are false. As an example, the expression in the following if statement will become true only when all its constituent elements are true, and t->type will be evaluated only when t is not NULL.[33]

[33] netbsdsrc/bin/ksh/main.c:606–607
if (t != NULL && t->type != TEOF && interactive && really–exit)
really–exit = 0;

Conversely, in the following example argv[1] will be checked for being NULL only if argv is not NULL .[34]

[34] netbsdsrc/lib/libedit/term.c:1212–1213
if (argv == NULL || argv[1] == NULL || argv[2] == NULL)
return -1;
In both cases, the first check guards against the subsequent dereference of a NULL pointer. Our getstops function also uses short-circuit evaluation when checking that a delimiter specified (i) is larger than the previous one (tabstops[nstops-1]) (Figure 2.5:4). This test will be performed only if at least one additional delimiter specification has been processed (nstops > 0). You can depend on the short-circuit evaluation property in most C-derived languages such as C++, Perl, and Java; on the other hand, Fortran, Pascal, and most Basic dialects will always evaluate all elements of a Boolean expression.
Exercise 2.20 Locate expressions containing questionable assumptions about character code values in the book's CD-ROM. Read about the Java Character class test and conversion methods such as isUpper and toLowerCase or the corresponding ctype family of C functions (isupper, tolower , and so on). Propose changes to make the code less dependent on the target architecture character set.
Exercise 2.21 Find, simplify, and reason about five nontrivial Boolean expressions in the source code base. Do not spend time on understanding what the expression elements mean; concentrate on the conditions that will make the expression become true or false. Where possible, identify and use the properties of short-circuit evaluation.
Exercise 2.22 Locate and reason about five nontrivial integer or character expressions in the source code base. Try to minimize the amount of code you need to comprehend in order to reason about each expression.
Figure 2.6 The goto statement used for a common error handler.
static int
gen_init(void)
{
[...]
if ((sigaction(SIGXCPU, &n_hand, &o_hand) < 0) &&
(o_hand.sa_handler == SIG_IGN) &&
(sigaction(SIGXCPU, &o_hand, &o_hand) < 0))
goto out;
n_hand.sa_handler = SIG_IGN;
if ((sigaction(SIGPIPE, &n_hand, &o_hand) < 0) ||
(sigaction(SIGXFSZ, &n_hand, &o_hand) < 0))
goto out;
return(0);
out:
syswarn(1, errno, "Unable to set up signal handler");
return(-1);
}

Failure; exit with an error
Failure; exit with an error
Normal function exit (success)
Common error handling code
2.8 goto Statements
The code segment that complains about unreasonable tab specifications (Figure 2.5:3) begins with a word followed by a colon. This is a label: the target of a goto instruction. Labels and goto statements should immediately raise your defenses when reading code. They can be easily abused to create "spaghetti" code: code with a flow of control that is difficult to follow and figure out. Therefore, the designers of Java decided not to support the goto statement. Fortunately, most modern programs use the goto statement in a small number of specific circumstances that do not adversely affect the program's structure.
You will find goto often used to exit a program or a function after performing some actions (such as printing an error message or freeing allocated resources). In our example the exit(1) call at the end of the block will terminate the program, returning an error code (1) to the system shell. Therefore all goto statements leading to the bad label are simply a shortcut for terminating the program after printing the error message. In a similar manner, the listing in Figure 2.6[35] illustrates how a common error handler (Figure 2.6:4) is used as a common exit point in all places where an error is found (Figure 2.6:1, Figure 2.6:2). A normal exit route for the function, located before the error handler (Figure 2.6:3), ensures that the handler will not get called when no error occurs.

[35] netbsdsrc/bin/pax/pax.c:309–412
Figure 2.7 The use of goto to reexecute code.
again:
if ((p = fgets(line, BUFSIZ, servf)) == NULL) <-- a
return (NULL);

if (*p == '#') <-- b
goto again;
cp = strpbrk(p, "#\n");
if (cp == NULL) <-- c
goto again;
*cp = '\0'; <-- d
[...]
return (&serv);
(a) Read a line; return on EOF
(b) Comment? Retry
(c) Incomplete line? Retry
(d) Complete entry
You will also find the goto statement often used to reexecute a portion of code, presumably after some variables have changed value or some processing has been performed. Although such a construct can often be coded by using a structured loop statement (for example, for (;;)) together with break and continue, in practice the coder's intent is sometimes better communicated by using goto. A single label, almost invariably named again or retry, is used as the goto target. The example in Figure 2.7,[36] which locates the entry of a specific service in the system's database while ignoring comments and overly large lines, is a typical case. (Interestingly, the code example also seems to contain a bug. If a partial line is read, it continues by reading the remainder as if it were a fresh line, so that if the tail of a long line happened to look like a service definition it would be used. Such oversights are common targets for computer security exploits.)

[36] netbsdsrc/lib/libc/net/getservent.c:65–104
Finally, you will find the goto statement used to change the flow of control in nested loop and switch statements instead of using break and continue, which affect only the control flow in the innermost loop. Sometimes goto is used even if the nesting level would allow the use of a break or continue statement. This is used in large, complex loops to clarify where the flow of control will go and to avoid the possibility of errors should a nested loop be added around a particular break or continue statement. In the example in Figure 2.8[37] the statement goto have–msg is used instead of break to exit the for loop.

[37] netbsdsrc/sys/dev/ic/ncr5380sbc.c:1575–1654
Exercise 2.23 Locate five instances of code that use the goto statement in the code base. Categorize its use (try to locate at least one instance for every one of the possible uses we outlined), and argue whether each particular goto could and should be replaced with a loop or other statement.

Figure 2.8 Exiting a loop using the goto statement.
<-- a
for (;;) {
[...]
if ((sc->sc_state & NCR_DROP_MSGIN) == 0) {
if (n >= NCR_MAX_MSG_LEN) {
ncr_sched_msgout(sc, SEND_REJECT);
sc->sc_state |= NCR_DROP_MSGIN;
} else {
[...]
if (n == 1 && IS1BYTEMSG(sc->sc_imess[0]))
goto have_msg; <-- b
[...]
}
}
[...]
}

have_msg: <-- c
(a) for loop
(b) Exit the for loop
(c) goto target
Exercise 2.24 The function getstops produces the same error message for a number of different errors. Describe how you could make its error reporting more user-friendly while at the same time eliminating the use of the goto statement. Discuss when such source code changes are appropriate and when they should be avoided.

2.9 Refactoring in the Small
The rest of the getstops code is relatively straightforward. After checking that each tab stop is greater than the previous one (Figure 2.5:4), the tab stop offset is stored in the tabstops array. After a single tab stop number has been converted into an integer (Figure 2.5:2), cp will point to the first nondigit character in the string (that is, the loop will process all digits and terminate at the first nondigit). At that point, a series of checks specified by if statements control the program's operation. If cp points to the end of the tab stop specification string (the character with the value 0, which signifies the end of a C string), then the loop will terminate (Figure 2.5:5). The last if (Figure 2.5:6) will check for invalid delimiters and terminate the program operation (using the goto bad statement) if one is found.
The body of each one of the if statements will transfer control somewhere else via a goto or break statement. Therefore, we can also read the sequence as:

if (*cp == 0)
break;
else if (*cp != ',' && *cp != ' ')
goto bad;
else
cp++;

This change highlights the fact that only one of the three statements will ever get executed and makes the code easier to read and reason about. If you have control over a body of code (that is, it is not supplied or maintained by an outside vendor or an open-source group), you can profit by reorganizing code sections to make them more readable. This improvement of the code's design after it has been written is termed refactoring. Start with small changes such as the one we outlined—you can find more than 70 types of refactoring changes described in the relevant literature. Modest changes add up and often expose larger possible improvements.
As a further example, consider the following one-line gem.[38]
[38] netbsdsrc/games/worms/worms.c:419
op = &(!x ? (!y ? upleft : (y == bottom ? lowleft : left)) :
(x == last ? (!y ? upright : (y == bottom ? lowright : right)) :
(!y ? upper : (y == bottom ? lower : normal))))[w->orientation];
The code makes excessive use of the conditional operator ?:. Read expressions using the conditional operator like if code. As an example, read the expression[39]
[39] netbsdsrc/bin/csh/set.c:852
sign ? -n : n
as follows:
"If sign is true, then the value of the expression is -n; otherwise, the value of the expression is n".
Since we read an expression like an if statement, we can also format it like an if statement; one that uses x ? instead of if (x), parentheses instead of curly braces, and : instead of else. To reformat the expression, we used the indenting features of our editor in conjunction with its ability to show matching parentheses. You can see the result in Figure 2.9 (left).
Reading the conditional expression in its expanded form is certainly easier, but there is still room for improvement. At this point we can discern that the x and y variables that control the expression evaluation are tested for three different values:
0 (expressed as !x or !y)
bottom or last
All other values
Figure 2.9. A conditional expression formatted like an if statement (left) and like cascading if–else statements (right). op = &(
!x ? (
!y ?
upleft
: (
y == bottom ?
lowleft
:
left
)
) : (
x == last ? (
!y ?
upright
: (
y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: (
y == bottom ?
lower
:
normal
)
)
))[w->orientation];
op = &(
!x ? (
!y ?
upleft
: ( y == bottom ?
lowleft
:
left
)
) : ( x == last ? (
!y ?
upright
: ( y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: ( y == bottom ?
lower
:
normal
)
)
))[w->orientation];

We can therefore rewrite the expression formatted as a series of cascading if–else statements (expressed using the ?: operator) to demonstrate this fact. You can see the result in Figure 2.9 (right).
The expression's intent now becomes clear: the programmer is selecting one of nine different location values based on the combined values of x and y. Both alternative formulations, however, visually emphasize the punctuation at the expense of the semantic content and use an inordinate amount of vertical space. Nevertheless, based on our newly acquired insight, we can create a two-dimensional array containing these location values and index it using offsets we derive from the x and y values. You can see the new result in Figure 2.10. Notice how in the initialization of the array named locations, we use a two-dimensional textual structure to illustrate the two-dimensional nature of the computation being performed. The initializers are laid out two-dimensionally in the program text, the array is indexed in the normally unconventional order [y][x], and the mapping is to integers "0, 2, 1" rather than the more obvious "0, 1, 2", so as to make the two-dimensional presentation coincide with the semantic meanings of the words upleft, upper, and so on.
Figure 2.10 Location detection code replacing the conditional expression.
struct options *locations[3][3] = {
<-- a
{upleft, upper, upright},
{left, normal, right},
{lowleft, lower, lowright},
};
int xlocation, ylocation; <-- b
<-- c
if (x == 0)
xlocation = 0;
else if (x == last)
xlocation = 2;
else
xlocation = 1;
<-- d
if (y == 0)
ylocation = 0;
else if (y == bottom)
ylocation = 2;
else
ylocation = 1;
op = &(locations[ylocation][xlocation])[w->orientation];
(a) Location map
(b) To store the x, y map offsets
(c) Determine x offset
(d) Determine y offset
The code, at 20 lines, is longer than the original one-liner but still shorter by 7 lines from the one-liner's readable cascading-else representation. In our eyes it appears more readable, self-documenting, and easier to verify. One could argue that the original version would execute faster than the new one. This is based on the fallacy that code readability and efficiency are somehow incompatible. There is no need to sacrifice code readability for efficiency. While it is true that efficient algorithms and certain optimizations can make the code more complicated and therefore more difficult to follow, this does not mean that making the code compact and unreadable will make it more efficient. On our system and compiler the initial and final versions of the code execute at exactly the same speed: 0.6 ms. Even if there were speed differences, the economics behind software maintenance costs, programmer salaries, and CPU performance most of the time favor code readability over efficiency.
However, even the code in Figure 2.10 can be considered a mixed blessing: it achieves its advantages at the expense of two distinct disadvantages. First, it separates the code into two chunks that, while shown together in Figure 2.10, would necessarily be separated in real code. Second, it introduces an extra encoding (0, 1, 2), so that understanding what the code is doing requires two mental steps rather than one (map "0, last, other" to "0, 2, 1" and then map a pair of "0, 2, 1" values to one of nine items). Could we somehow directly introduce the two-dimensional structure of our computation into the conditional code? The following code fragment[40] reverts to conditional expressions but has them carefully laid out to express the computation's intent.

[40] Suggested by Guy Steele.
op =
&( !y ? (!x ? upleft : x!=last ? upper : upright) :
y!=bottom ? (!x ? left : x!=last ? normal : right) :
(!x ? lowleft : x!=last ? lower : lowright)
)[w->orientation];

The above formulation is a prime example on how sometimes creative code layout can be used to improve code readability. Note that the nine values are right-justified within their three columns, to make them stand out visually and to exploit the repetition of "left" and "right" in their names. Note also that the usual practice of putting spaces around operators is eschewed for the case of != in order to reduce the test expressions to single visual tokens, making the nine data values stand out more. Finally, the fact that the whole expression fits in five lines makes the vertical alignment of the first and last parentheses more effective, making it much easier to see that the basic structure of the entire statement is of the form
op = &( )[w->orientation];
The choice between the two new alternative representations is largely a matter of taste; however, we probably would not have come up with the second formulation without expressing the code in the initial, more verbose and explicit form.
The expression we rewrote was extremely large and obviously unreadable. Less extreme cases can also benefit from some rewriting. Often you can make an expression more readable by adding whitespace, by breaking it up into smaller parts by means of temporary variables, or by using parentheses to amplify the precedence of certain operators.
You do not always need to change the program structure to make it more readable. Often items that do not affect the program's operation (such as comments, the use of whitespace, and the choice of variable, function, and class names) can affect the program's readability. Consider the work we did to understand the code for the getstops function. A concise comment before the function definition would enhance the program's future readability.

/*
* Parse and verify the tab stop specification pointed to by cp
* setting the global variables nstops and tabstops[].
* Exit the program with an error message on bad specifications.
*/

When reading code under your control, make it a habit to add comments as needed.
In Sections 2.2 and 2.3 we explained how names and indentation can provide hints for understanding code functionality. Unfortunately, sometimes programmers choose unhelpful names and indent their programs inconsistently. You can improve the readability of poorly written code with better indentation and wise choice of variable names. These measures are extreme: apply them only when you have full responsibility and control over the source code, you are sure that your changes are a lot better than the original code, and you can revert to the original code if something goes wrong. Using a version management system such as the Revision Control System (RCS), the Source Code Control System (SCCS), the Concurrent Versions System (CVS), or Microsoft's Visual SourceSafe can help you control the code modifications. The adoption of a specific style for variable names and indentation can appear a tedious task. When modifying code that is part of a larger body to make it more readable, try to understand and follow the conventions of the rest of the code (see Chapter 7). Many organizations have a specific coding style; learn it and try to follow it. Otherwise, adopt one standard style (such as one of those used by the GNU[41] or BSD[42] groups) and use it consistently. When the code indentation is truly inconsistent and cannot be manually salvaged, a number of tools (such as indent) can help you automatically reindent it to make it more readable (see Section 10.7). Use such tools with care: the judicious use of whitespace allows programmers to provide visual clues that are beyond the abilities of automated formatting tools. Applying indent to the code example in Figure 2.10 would definitely make it less readable.
[41] http://www.gnu.org/prep/standardstoc.html
[42] netbsdsrc/share/misc/style:1–315
Keep in mind that although reindenting code may help readability, it also messes up the program's change history in the revision control system. For this reason it is probably best not to combine the reformatting with any actual changes to the program's logic. Do the reformat, check it in, and then make the other changes. In this way future code readers will be able to selectively retrieve and review your changes to the program's logic without getting overwhelmed by the global formatting changes. On the flip side of the coin, when you are examining a program revision history that spans a global reindentation exercise using the diff program, you can often avoid the noise introduced by the changed indentation levels by specifying the -w option to have diff ignore whitespace differences.
Exercise 2.25 Provide five examples from your environment or the book's CD-ROM where the code structure can be improved to make it more readable.
Exercise 2.26 You can find tens of intentionally unreadable C programs at the International Obfuscated C Code Contest Web site.[43] Most of them use several layers of obfuscation to hide their algorithms. See how gradual code changes can help you untangle their code. If you are not familiar with the C preprocessor, try to avoid programs with a large number of #define lines.

[43] http://www.ioccc.org
Exercise 2.27 Modify the position location code we examined to work on the mirror image of a board (interchange the right and left sides). Time yourself in modifying the original code and the final version listed in Figure 2.10. Do not look at the readable representations; if you find them useful, create them from scratch. Calculate the cost difference assuming current programmer salary rates (do not forget to add overheads). If the readable code runs at half the speed of the original code (it does not), calculate the cost of this slowdown by making reasonable assumptions concerning the number of times the code will get executed over the lifetime of a computer bought at a given price.
Exercise 2.28 If you are not familiar with a specific coding standard, locate one and adopt it. Verify local code against the coding standard.

No comments: