3.1 Pointers
Most people learning C initially view pointers with a mixture of awe and fear. It is true that pointer-based code is sometimes inscrutable and often a source of subtle errors and unsightly crashes. The main reason for this is that a pointer is a low-level, powerful, yet blunt tool. Learn to distinguish its uses and common coding patterns will emerge; view the remaining few code instances you cannot classify with suspicion.
You will find pointers used in C programs in the following ways:
To construct linked data structures
To reference dynamically allocated data structures
To implement call by reference
To access data elements and iterate through them
When passing arrays as arguments
For referring to functions
As an alias for another value
To represent character strings
For direct access to system memory
We will examine some representative cases for each use in turn. Where appropriate we will point toward sections that discuss a particular use in detail. The purpose of the following paragraphs is to provide a summary, taxonomy, and road map for reading pointer-based code.
3.1.1 Linked Data Structures
At the lowest level, a pointer is just a memory address. It is therefore uniquely suited for representing links between data structure elements. At the conceptual level, data structures such as linked lists, trees, and graphs consist of elements (nodes) joined together by vertices or edges. A pointer can represent a directed vertex between two nodes by storing in one node the address of another. In addition, pointers are used when processing such data structures for accessing their elements and performing house-keeping operations. We examine linked data structures in Section 4.7 and onward.
3.1.2 Dynamic Allocation of Data Structures
Vector-type data structures are often allocated dynamically to have their sizes correspond to runtime information about the number of elements required. We examine this in detail in Section 3.4. Pointers are used to store the starting addresses of such data structures. In addition, programs often dynamically construct C structures to pass them between functions or use them as building elements for linked data structures. In this case pointers are used to store the memory location of the allocated struct, as is the case in the following example.[1]
[1] netbsdsrc/sbin/fsck/preen.c:250–280
static struct diskentry *
finddisk(const char *name)
{
struct diskentry *d;
[...]
d = emalloc(sizeof(*d));
d->d_name = estrdup(name);
d->d_name[len] = '\0';
[...]
return d;
}
The code for allocating memory equal to the size of a struct and casting the result to an appropriate pointer is often delegated to a macro with a name such as new after the equivalent C++ operator.[2],[3]
[2] XFree86-3.3/xc/doc/specs/PEX5/PEX5.1/SI/xref.h:113
[3] XFree86-3.3/xc/doc/specs/PEX5/PEX5.1/SI/xref.c:268
#define new(type) (type *) calloc(sizeof(type), 1)
[...]
node = new(struct codeword_entry);
3.1.3 Call by Reference
Pointers are also used in functions that receive arguments passed by reference. Function arguments passed by reference are used for returning function results or for avoiding the overhead of copying the function's argument. When pointer-type arguments are used for returning function results, you will see these arguments assigned a value within the function body, as is the case with the following function that sets gid to the group identifier of the given group's name.[4]
[4] netbsdsrc/bin/pax/cache.c:430–490
int
gid_name(char *name, gid_t *gid)
{
[...]
*gid = ptr->gid = gr->gr_gid;
return(0);
}
The function's return value is used only to indicate an error condition. Many functions of the Windows API use such a convention. On the caller side you will typically see the respective argument passed to the function by applying the address-of operator (&) to a variable.
Pointer-type arguments are also used to avoid the overhead of copying large elements on each function call. Such arguments are typically structures; seldom, they are double-precision floating-point numbers. Arrays are always passed by reference, and in most architectures other primitive C types are more efficiently copied on a function call than passed by reference. The following function uses the two structure arguments (now and then) passed by reference only to calculate the results without modifying the structure contents. It was presumably coded in that way to avoid the timeval structure copy overhead on each call.[5]
[5] netbsdsrc/sbin/ping/ping.c:1028–1034
static double
diffsec(struct timeval *now,
struct timeval *then)
{
return ((now->tv_sec - then->tv_sec)*1.0
+ (now->tv_usec - then->tv_usec)/1000000.0);
}
In modern C and C++ programs you can often identify arguments passed by reference for efficiency purposes because they are identified by a const declarator.[6]
[6] netbsdsrc/bin/stty/print.c:56
static char *ccval (const struct cchar *, int);
The role of each argument is also sometimes identified using a comment marked IN or OUT, as is the case in the following (pre-ANSI C) function definition.[7]
[7] netbsdsrc/sys/netinet/if_atm.c:223–231
int
atmresolve(rt, m, dst, desten)
register struct rtentry *rt;
struct mbuf *m;
register struct sockaddr *dst;
register struct atm_pseudohdr *desten; /* OUT */
{
Table 3.1. Index and Pointer Code for Accessing an Array a with Elements of Type T Array Index Code
Pointer Code
int i;
T *p;
i = 0
p = a or p = &a[0]
a[i]
*p
a[i].f
p->f
i++
p++
i += K
p += K
i == N
p == &a[N] or p == a + N
Figure 3.1 Pointer access for an array-based stack.
stackp = de_stack; <-- a
[...]
*stackp++ = finchar; <-- b
[...]
do {
if (count-- == 0)
return (num);
*bp++ = *--stackp; <-- c
} while (stackp > de_stack); <-- d
(a) Initialize stack
(b) Push finchar into stack
(c) Pop from stack into *bp
(d) Check if stack is empty
3.1.4 Data Element Access
In Section 4.2 we present how to use a pointer as a database cursor to access table elements. The key ideas for reading pointer-based array access code are that a pointer to an array element address can be used to access the element at the specific position index and that arithmetic on array element pointers has the same semantics as arithmetic on the respective array indices. Table 3.1 shows examples for most common operations used when accessing an array a with elements of type T. Figure 3.1[8] provides an example of pointer-based code for accessing a stack.
[8] netbsdsrc/usr.bin/compress/zopen.c:523–555
3.1.5 Arrays as Arguments and Results
In C and C++ programs, pointers crop up when arrays are passed to functions and returned as results. In C code when the name of an array is used as a function argument, all that is passed to the function is the address of the array's first element. As a result, any modifications made to the array data while the function is executing will affect the elements of the array passed by the function's caller. This implicit call by reference behavior for arrays is different from the way all other C types are passed to functions and can therefore be a source of confusion.
Similarly, C functions can return only a pointer to an array element, not a complete array. Thus, when a function builds a result within an array and then returns an appropriate pointer, it is important to ensure that the array is not a local variable allocated on that function's stack. In such a case the array's space may be overwritten once the function exits, and consequently the function's result will be invalidated. One way to avoid this problem is to declare such arrays as static, as is the case in the following function that converts an Internet address into its dotted decimal representation.[9]
[9] netbsdsrc/libexec/identd/identd.c:120–137
char *inet_ntoa(struct in_addr ad)
{
unsigned long int s_ad;
int a, b, c, d;
static char addr[20];
s_ad = ad.s_addr;
d = s_ad % 256;
s_ad /= 256;
c = s_ad % 256;
s_ad /= 256;
b = s_ad % 256;
a = s_ad / 256;
sprintf(addr, "%d.%d.%d.%d", a, b, c, d);
return addr;
}
The function builds the result in the addr buffer. If that buffer were not declared as static, its contents (the readable representation of the Internet address) would be invalidated once the function returned. Even the above construct is not completely safe. Functions using global or static local variables are in most cases not reentrant. This means that the function cannot be called from another program execution thread while another instance of that function is executing. Even worse, in our case, the function's result has to be saved into another place (using, for example, a call to strdup) before the function is called again; otherwise it will be overwritten by the new result. As an example, the implementation of the inet_ntoa function we listed could not have been used in place of naddr_ntoa in the following context.[10]
[10] netbsdsrc/sbin/routed/rdisc.c:121–125
(void)fprintf(ftrace, "%s Router Ad"
" from %s to %s via %s life=%d\n",
act, naddr_ntoa(from), naddr_ntoa(to),
ifp ? ifp->int_name : "?",
ntohs(p->ad.icmp_ad_life));
In this case, to circumvent the problem we described, the function naddr_ntoa is used as a wrapper for inet_ntoa, storing its result into a circular list of four different temporary buffers.[11]
[11] netbsdsrc/sbin/routed/trace.c:123–139
char *
naddr_ntoa(naddr a)
{
#define NUM_BUFS 4
static int bufno;
static struct {
char str[16]; /* xxx.xxx.xxx.xxx\0 */
} bufs[NUM_BUFS];
char *s;
struct in_addr addr;
addr.s_addr = a;
s = strcpy(bufs[bufno].str, inet_ntoa(addr));
bufno = (bufno+1) % NUM_BUFS;
return s;
}
3.1.6 Function Pointers
It is often useful to parameterize a function by passing to it another function as an argument. The C language, however, does not allow functions to be passed to others as arguments. It does, however, allow passing as an argument a pointer to a function. In Figure 3.2[12] the function getfile, which is used to process files during a backup recovery process, receives as parameters fill and skip; these are used to indicate how to read or skip over data. The function is called (Figure 3.2:1) with different arguments (Figure 3.2:3) to perform an initial scan over the data or to recover or skip over data files.
[12] netbsdsrc/sbin/restore/tape.c:177–837
Figure 3.2 Parameterization using function arguments.
/*
* Verify that the tape drive can be accessed and
* that it actually is a dump tape.
*/
void
setup()
{
[...]
getfile(xtrmap, xtrmapskip);
[...]
}
/* Prompt user to load a new dump volume. */
void
getvol(int nextvol)
{
[...]
getfile(xtrlnkfile, xtrlnkskip);
[...]
getfile(xtrfile, xtrskip);
[...]
}
/*
* skip over a file on the tape
*/
void
skipfile()
{
curfile.action = SKIP;
getfile(xtrnull, xtrnull);
}
/* Extract a file from the tape. */
void
getfile(void (*fill)(char *, long), (*skip)(char *, long))
{
[...]
(*fill)((char *)buf, (long)(size > TP_BSIZE ?
fssize : (curblk - 1) * TP_BSIZE + size));
[...]
(*skip)(clearedbuf, (long)(size > TP_BSIZE ?
TP_BSIZE : size));
[...]
(*fill)((char *)buf, (long)((curblk * TP_BSIZE) + size));
[...]
}
/* Write out the next block of a file. */
static void
xtrfile(char *buf, long size)
{ [...] }
/* Skip over a hole in a file. */
static void
xtrskip(char *buf, long size)
{ [...] }
/* Collect the next block of a symbolic link. */
static void
xtrlnkfile(char *buf, long size)
{ [...] }
/* Skip over a hole in a symbolic link */
static void
xtrlnkskip(char *buf, long size)
{ [...] }
/* Collect the next block of a bit map. */
static void
xtrmap(char *buf, long size)
{ [...] }
/* Skip over a hole in a bit map */
static void
xtrmapskip(char *buf, long size)
{ [...] }
/* Noop, when an extraction function is not needed. */
void
xtrnull(char *buf, long size)
{ return; }
Call with function arguments
Call function argument
Functions to be passed as argument parameters
A number of C library files, such as qsort and bsearch, receive function arguments that specify how they operate.[13]
[13] netbsdsrc/usr.bin/gprof/gprof.c:216–536
getnfile()
{
[...]
qsort(nl, nname, sizeof(nltype), valcmp);
[...]
}
valcmp(nltype *p1, nltype *p2)
{
if ( p1 -> value < p2 -> value) {
return LESSTHAN;
}
if ( p1 -> value > p2 -> value) {
return GREATERTHAN;
}
return EQUALTO;
}
In the above example the way qsort will compare the elements of the array it sorts is specified by the valcmp function.
Finally, function pointers can be used to parameterize control within a body of code. In the example that follows, closefunc is used to store the function that will be called to close the fin stream, depending on how it was opened.[14]
[14] netbsdsrc/libexec/ftpd/ftpd.c:792–860
void
retrieve(char *cmd, char *name)
{
int (*closefunc)(FILE *) = NULL;
[...]
if (cmd == 0) {
fin = fopen(name, "r"), closefunc = fclose;
[...]
if (cmd) {
[...]
fin = ftpd_popen(line, "r", 1), closefunc = ftpd_pclose;
[...]
(*closefunc)(fin);
}
3.1.7 Pointers as Aliases
Pointers are often used to create an alias for a particular value.[15]
[15] netbsdsrc/bin/sh/output.c:81–84
struct output output = {NULL, 0, NULL, OUTBUFSIZ, 1, 0};
[...]
struct output *out1 = &output;
In the example code above, the dereferenced pointer out1 can be used in the place of the original value of the variable output. You will find aliasing used for a number of different reasons.
Efficiency Concerns
Assigning a pointer is more efficient than assigning a larger object. In the following example curt could have been a structure variable instead of a pointer to such a structure. However, the corresponding assignment statement would have been less efficient since it would have to copy the contents of the entire structure.[16]
[16] netbsdsrc/lib/libcurses/tty.c:66–171
static struct termios cbreakt, rawt, *curt;
[...]
curt = useraw ? &rawt : &cbreakt;
Reference to Statically Initialized Data
A variable is used to point to different values of static data. The most common example in this case involves character pointers set to point to different strings.[17]
[17] netbsdsrc/games/rogue/room.c:629–635
char *s;
[...]
s = *(opt->bval) ? "True" : "False";
Implementation of Variable Reference Semantics in a Global Context
The daunting title merely refers to the equivalent of the call-by-reference pointer usage, but using a global variable instead of a function argument. Thus, a global pointer variable is used to refer to data that is to be accessed and modified in another place. In the following random-number generator, fptr is initialized to point to an entry in a table of random-number seeds; it is also set in a similar fashion in srrandom(). Finally, in rrandom() the variable fptr is used to modify the value it points to.[18]
[18] netbsdsrc/games/rogue/random.c:62–109
static long rntb[32] = {
3, 0x9a319039, 0x32d9c024, 0x9b663182, 0x5da1f342,
[...]
};
static long *fptr = &rntb[4];
[...]
void
srrandom(int x)
{
[...]
fptr = &state[rand_sep];
[...]
}
long
rrandom()
{
[...]
*fptr += *rptr;
i = (*fptr >> 1) & 0x7fffffff;
The equivalent result could have been obtained by passing fptr as an argument to srrandom() and rrandom().
A similar approach is also often used to access a modifiable copy of global data.[19]
[19] netbsdsrc/games/rogue/curses.c:121–157
WINDOW scr_buf;
WINDOW *curscr = &scr_buf;
[...]
move(short row, short col)
{
curscr->_cury = row;
curscr->_curx = col;
screen_dirty = 1;
}
3.1.8 Pointers and Strings
In the C language string literals are represented by a character array terminated with the null character '\0'. As a consequence, character strings are represented by a pointer to the first character of a null-terminated sequence. In the following code excerpt you can see how the code of the strlen C library function advances a pointer through the string passed as the function's parameter until it reaches the string's end. It then subtracts the pointer to the string's start from the pointer to its terminating null character to calculate and return the string's length.[20]
[20] netbsdsrc/lib/libc/string/strlen.c:51–59
size_t
strlen(const char *str)
{
register const char *s;
for (s = str; *s; ++s)
;
return(s - str);
}
When reading string-based code, be careful to distinguish between character pointers and character arrays. Although you will find both often used to represent strings (since a character array is automatically converted into a pointer to the first character of the array when passed to a function), the underlying types and operations you can perform on them are different. As an example, the following code fragment defines pw_file as a character pointer pointing to a string constant that contains the sequence "/etc/passwd".[21]
[21] netbsdsrc/distrib/utils/libhack/getpwent.c:45
static char *pw_file = "/etc/passwd";
The size of the pw_file variable on our machine is 4, and it can be modified to point elsewhere, but trying to change the contents of the memory it points to will result in undefined behavior.
On the other hand, the following line defines line as an array of characters initialized to contain the sequence "/dev/XtyXX" followed by a null.[22]
[22] netbsdsrc/lib/libutil/pty.c:71
static char line[] = "/dev/XtyXX";
Applying the sizeof operator on line returns 11; line will always refer to the same storage area, and the elements it contains can be freely modified.[23]
[23] netbsdsrc/lib/libutil/pty.c:81
line[5] = 'p';
Keeping the distinction between character arrays and pointers in mind while reading code is important since they are both used, often interchangeably, for similar purposes. Crucially, all C compilers we are aware of will not warn you when the two are accidentally mixed up across different files. Consider the following two (unrelated) definitions and the associated declaration.[24],[25],[26]
[24] netbsdsrc/usr.bin/less/less/version.c:575
[25] netbsdsrc/libexec/identd/version.c:3
[26] netbsdsrc/libexec/identd/identd.c:77
char version[] = "332";
char *version = "2.1.2";
extern char *version;
Both definitions are used to define a version string that is probably subsequently printed by passing it to a printf function. However, the extern char *version declaration can be used only to access the variable defined as char *version. Although linking the source file with the file containing the char version[] definition will typically not produce an error, the resulting program will fail at runtime. The version variable, instead of pointing to a memory area that contains the version string, will probably point to the memory area whose address is represented by the bit pattern of the character string "332" (0x33333200 on some processor architectures).
3.1.9 Direct Memory Access
Since a pointer is internally represented as a memory address, you will find low-level code using pointers to access hardware-specific memory areas. Many peripherals, such as graphics and network adapters, use a commonly accessible memory area to interchange data with the system's main processor. A variable initialized to point to that area can be conveniently used to communicate with the given device. In the following example, the video variable is set to point to a memory area occupied by the screen adapter (0xe08b8000). Writing to that area at an offset determined by the specified row and column (vid_ypos and vid_xpos) will make the corresponding character (c) appear on the screen.[27]
[27] netbsdsrc/sys/arch/pica/pica/machdep.c:951–958
static void
vid_wrchar(char c)
{
volatile unsigned short *video;
video = (unsigned short *)(0xe08b8000) +
vid_ypos * 80 + vid_xpos;
*video = (*video & 0xff00) | 0x0f00 | (unsigned short)c;
}
Keep in mind that modern operating systems prevent user programs from direct hardware access without prior arrangements. You are likely to encounter such code only when examining embedded system code or kernel and device-driver internals.
Exercise 3.1 Reimplement the pointer-based stack in the compress library[28] by using an array index. Measure the speed difference and comment on the readability of the two implementations.
[28] netbsdsrc/usr.bin/compress/zopen.c
Exercise 3.2 Locate in the book's CD-ROM three instances of code using pointers for each one of the reasons we outlined.
Exercise 3.3 If you are familiar with C++ or Java, explain how it is possible to minimize (in C++) or avoid (in Java) the use of pointers.
Exercise 3.4 How does a C pointer differ from a memory address? How does that affect your understanding of code? Which tools take advantage of the difference?
Exercise 3.5 Should program version strings be represented as character arrays or as pointers to a string? Justify your answer.
3.2 Structures
A C struct is a grouping of data elements that makes it possible to use them as a whole. You will find structures used in C programs in the fo llowing ways:
To group together data elements typically used as a whole
To return multiple data elements from a function
To construct linked data structures (see Section 4.7)
To map the organization of data on hardware devices, network links, and storage media
To implement abstract data types (Section 9.3.5)
To program in an object-oriented fashion
The following paragraphs expand on uses of structures we do not cover in other sections.
3.2.1 Grouping Together Data Elements
Structures are often used to form a group out of related elements typically used as a whole. The prototypical example is the representation of the coordinates of a location on the screen.[29]
[29] netbsdsrc/games/snake/snake/snake.c:75–77
struct point {
int col, line;
};
Other cases you will encounter include complex numbers and fields forming table rows.
3.2.2 Returning Multiple Data Elements from a Function
When the result of a function is expressed using more than a single basic data type, the multiple result elements can be returned either through function arguments passed to the function by reference (see Section 3.1.3) or by grouping them together into a structure and returning that structure. In the following example difftv returns the difference between the two times passed in seconds (tv_sec) and microseconds (tv_usec) in the form of the timeval structure.[30]
[30] netbsdsrc/usr.sbin/named/dig/dig.c:1221–1233
static struct timeval
difftv(struct timeval a, struct timeval b)
{
static struct timeval diff;
diff.tv_sec = b.tv_sec - a.tv_sec;
if ((diff.tv_usec = b.tv_usec - a.tv_usec) < 0) {
diff.tv_sec--;
diff.tv_usec += 1000000;
}
return(diff);
}
3.2.3 Mapping the Organization of Data
When data is moved over a network or transferred to or from secondary storage, or when programs interface directly with hardware, structures are often used to represent how data is organized on that other medium. The following structure represents an Intel EtherExpress network interface card command block.[31]
[31] netbsdsrc/sys/dev/pci/if_fxpreg.h:102–107
struct fxp_cb_nop {
void *fill[2];
volatile u_int16_t cb_status;
volatile u_int16_t cb_command;
volatile u_int32_t link_addr;
};
The volatile qualifier is used to denote that the underlying memory fields are used by entities outside the control of the program (in this case, the network interface card). The compiler is thus prohibited from performing optimizations on these fields such as removing redundant references.
In certain cases a bit field may be declared to specify a precise range of bits used to hold a particular value on the given device.[32]
[32] netbsdsrc/sys/dev/pci/if_fxpreg.h:116–125
struct fxp_cb_config {
[...]
volatile u_int8_t byte_count:6,
:2;
volatile u_int8_t rx_fifo_limit:4,
tx_fifo_limit:3,
:1;
In the above example the byte_count to be transferred is specified to occupy 6 bits, while the receiver and the transmitter FIFO queue limits occupy 4 and 3 bits, respectively, on the hardware device.
Network data packets are also often encoded using a C structure to delineate their elements, as demonstrated by the following classic definition of the TCP packet header.[33]
[33] netbsdsrc/sys/netinet/tcp.h:43–47
struct tcphdr {
u_int16_t th_sport; /* source port */
u_int16_t th_dport; /* destination port */
tcp_seq th_seq; /* sequence number */
tcp_seq th_ack; /* acknowledgement number */
[...]
Finally, structures are also used to map how data is stored on peripheral media such as disks and tapes. As an example, the disk characteristics of MS-DOS disk partitions are determined by the so-called BIOS parameter block. Its fields are specified using the following structure.[34]
[34] netbsdsrc/sys/msdosfs/bpb.h:22–34
struct bpb33 {
u_int16_t bpbBytesPerSec; /* bytes per sector */
u_int8_t bpbSecPerClust; /* sectors per cluster */
u_int16_t bpbResSectors; /* number of reserved sectors */
u_int8_t bpbFATs; /* number of FATs */
u_int16_t bpbRootDirEnts; /* number of root directory entries */
u_int16_t bpbSectors; /* total number of sectors */
u_int8_t bpbMedia; /* media descriptor */
u_int16_t bpbFATsecs; /* number of sectors per FAT */
u_int16_t bpbSecPerTrack; /* sectors per track */
u_int16_t bpbHeads; /* number of heads */
u_int16_t bpbHiddenSecs; /* number of hidden sectors */
};
The way fields are ordered within a structure is architecture- and compiler-dependent. In addition, the representation of various elements within the structure is architecture- and operating system–dependent (the operating system may force a particular processor to adopt a specific byte ordering). Even simple data types such as integers can have their bytes stored in different ways. Therefore, the use of structures to map external data is inherently nonportable.
3.2.4 Programming in an Object-Oriented Fashion
In C programs structures are sometimes used to create object-like entities by grouping together data elements and function pointers to simulate a class's fields and methods. In the following example, the domain structure, representing different domains of network protocols (for example, Internet, SNA, IPX), groups together data about the particular domain, such as its family dom_family, and methods for operating on it, such as the routing table initialization method dom_rtattach.[35]
[35] netbsdsrc/sys/sys/domain.h:50–65
struct domain {
int dom_family; /* AF_xxx */
char *dom_name;
void (*dom_init)(void); /* initialize domain data structures */
[...]
int (*dom_rtattach)(void **, int);
/* initialize routing table */
int dom_rtoffset; /* an arg to rtattach, in bits */
int dom_maxrtkey; /* for routing layer */
}
Following such a declaration, variables of the specific structure type or, more often, pointers to the structure type can be treated—after being suitably initialized—in ways resembling the use of C++ or Java objects.[36]
[36] netbsdsrc/sys/kern/vfs_subr.c:1436–1440
for (dom = domains; dom; dom = dom->dom_next)
if (dom->dom_family == i && dom->dom_rtattach) {
dom->dom_rtattach((void **)&nep->ne_rtable[i],
dom->dom_rtoffset);
break;
Since "objects" can be initialized with different "methods" (function pointers) and yet be used through the same interface (calls through the same structure member names), the technique we outlined implements the virtual methods available in object-oriented languages and polymorphic programming. Infact, because objects of the same class share their methods (but not their fields), the method pointers are often shared between different objects by storing in the "objects" structure only a pointer to another structure containing pointers to the actual methods.[37]
[37] netbsdsrc/sys/sys/file.h:51–73
struct file {
[...]
short f_type; /* descriptor type */
short f_count; /* reference count */
short f_msgcount; /* references from message queue */
struct ucred *f_cred; /* credentials associated with descriptor */
struct fileops {
int (*fo_read)(struct file *fp, struct uio *uio,
struct ucred *cred);
int (*fo_write)(struct file *fp, struct uio *uio,
struct ucred *cred);
int (*fo_ioctl)(struct file *fp, u_long com,
caddr_t data, struct proc *p);
int (*fo_poll)(struct file *fp, int events,
struct proc *p);
int (*fo_close)(struct file *fp, struct proc *p);
} *f_ops;
off_t f_offset;
caddr_t f_data; /* vnode or socket */
};
In the above example, each object representing an open file or socket shares its read, write, ioctl, poll, and close methods through the structure member f_ops pointing to the shared fileops structure.
Exercise 3.6 An alternative way for passing array-type data to functions by value and returning it as a real value is to embed the array within a structure. Locate such examples in the book's CD-ROM. Suggest reasons explaining why this approach is not used more often.
Exercise 3.7 Locate 20 separate instances of structures in the book's CD-ROM and classify the reasons behind their use. Familiarize yourself with a general-purpose data structure library such as the C++ STL or java.util and indicate which instances could have been coded using library functionality. Minimize the time you spend on each instance.
Exercise 3.8 A number of technologies, libraries, and tools support the portable encoding of data for transfer between applications. Outline the technologies applicable in your environment, and compare them against using ad hoc approaches based on C structures.
3.3 Unions
A C union groups together items that share the same storage area. Only one of the items sharing the area can be accessed at the time. You will find unions used in C programs in the following ways:
To use storage efficiently
To implement polymorphism
To access data using different internal representations
3.3.1 Using Storage Efficiently
A common justification for unions you will encounter concerns sharing the same area of storage for two different purposes; the intent is to save a few precious memory bytes. There are cases where unions are used exactly for this purpose. However, in an industry where even embedded devices sport several megabytes of memory, this use is becoming increasingly rare. Apart from legacy code, you will also encounter cases where a very large number of objects based on a union justifies the extra effort needed to write the appropriate code. The following example from the standard C library memory allocation function malloc is a typical case.[38]
[38] netbsdsrc/lib/libc/stdlib/malloc.c:78–92
union overhead {
union overhead *ov_next; /* when free */
struct {
u_char ovu_magic; /* magic number */
u_char ovu_index; /* bucket # */
#ifdef RCHECK
u_short ovu_rmagic; /* range magic number */
u_long ovu_size; /* actual block size */
#endif
} ovu;
#define ov_magic ovu.ovu_magic
#define ov_index ovu.ovu_index
#define ov_rmagic ovu.ovu_rmagic
#define ov_size ovu.ovu_size
};
Memory blocks can be either free or occupied. Different values need to be stored in each case, and since a block cannot be at the same time both free and occupied these can share the same memory space. The extra programming cost for maintaining this arrangement is amortized over the trillions of items that will be allocated by calls to the C library over its lifetime.
Note the macro definitions following the definition of the ovu structure; such definitions are often used as shortcuts for referring directly to structure members within a union without prepending the union member name in front. Thus, in the code you see[39]
[39] netbsdsrc/lib/libc/stdlib/malloc.c:213
op->ov_index = bucket;
instead of more verbose references to op->ovu.ovu_index.
3.3.2 Implementing Polymorphism
By far the most common use of unions you will encounter concerns the implementation of polymorphism. Here the same object (typically represented by a C structure) is used to represent different types. The data for those different types is stored into separate union members. Polymorphic data stored in a union also saves space over alternative memory arrangements; however, the union is used in this case to express our intent of having an object store different data types rather than show off our frugality. Unions used in this way are typically embedded within a C structure containing a type field that denotes what kind of data is stored in the union. This field is often represented by using an enumerated type, and its name is often, predictably enough, type. The following example, part of the Sun RPC (Remote Procedure Call) library, contains the structure used for representing an RPC message. Two different message types are supported: a call and a reply. The msg_type enum distinguishes between the two types, while a union contains a structure with the data elements for each type.[40]
[40] netbsdsrc/include/rpc/rpc_msg.h:54–158
enum msg_type {
CALL=0,
REPLY=1
};
[...]
struct rpc_msg {
u_int32_t rm_xid;
enum msg_type rm_direction;
union {
struct call_body RM_cmb;
struct reply_body RM_rmb;
}ru;
};
3.3.3 Accessing Different Internal Representations
The final use of unions we will examine involves storing data into one union field and accessing another in order to transfer data between different internal representations. Although this use is inherently nonportable, there are some conversions that are safe to perform and others that are useful in specific machine-dependent ways. The following structure definition is used by the tar file archive program to represent the information about each file in the archive.[41]
[41] netbsdsrc/usr.bin/file/tar.h:36–54
union record {
char charptr[RECORDSIZE];
struct header {
char name[NAMSIZ];
char mode[8];
char uid[8];
char gid[8];
char size[12];
char mtime[12];
char chksum[8];
char linkflag;
char linkname[NAMSIZ];
char magic[8];
char uname[TUNMLEN];
char gname[TGNMLEN];
char devmajor[8];
char devminor[8];
} header;
};
To detect data corruption, a checksum field chksum contains the sum of all record bytes comprising a file (including the header record). Programs use the charptr union member to iterate over the header data byte-by-byte to calculate the checksum and the header member to access specific header fields. Since integral types in C (including characters) are valid over all possible bit patterns they can represent, accessing the internal representation of other C types (pointers, floating-point numbers, other integral types) as an integral type is guaranteed to be a legal operation. However, the converse operation—the generation of another type through its integral representation—will not always yield proper values. Under most architectures one safe operation you may encounter involves generating nonintegral types from data that was an older copy of their value.
The second, inherently nonportable use of structures is to access architecture-specific data elements in some other format. This can be useful for interpreting a data type based on its representation or for creating a data type from its representation. In the example in Figure 3.3[42] the union u is used to access the internal representation of the floating-point number v as a mantissa, an exponent, and a sign. The conversion is used to break floating-point numbers into a normalized fraction and an integral power of 2.
[42] netbsdsrc/lib/libc/arch/i386/gen/frexp.c:48–72
Exercise 3.9 Locate 20 separate instances of unions in the book's CD-ROM and classify the reasons behind their use. Minimize the time you spend on each instance. Create a graph depicting the frequency with which constructs are used for a particular reason.
Figure 3.3 Accessing internal type representation by using a union.
double
frexp(double value, int *eptr) <-- a
{
union {
double v; <-- b
<-- c
struct {
u_int u_mant2 : 32; <-- d
u_int u_mant1 : 20;
u_int u_exp : 11; <-- e
u_int u_sign : 1;
} s;
} u;
if (value) {
u.v = value; <-- f
*eptr = u.s.u_exp - 1022; <-- g
u.s.u_exp = 1022; <-- h
return(u.v); <-- i
} else {
*eptr = 0;
return((double)0);
}
}
(a) Exponent to return
(b) Store value in this field
(c) Access internal representation from this field
(d) Mantissa
(e) Exponent
(f) Store value
(g) Retrieve and set signed exponent
(h) Zero exponent
(i) Return normalized mantissa
Exercise 3.10 Propose portable alternatives for implementing the nonportable constructs used with structures and unions. Discuss your proposal in terms of implementation cost, maintainability, and efficiency.
3.4 Dynamic Memory Allocation
A data structure whose size is not known when the program is written, or one that grows while the program is running, is stored in memory allocated dynamically while the program is running. Programs refer to dynamically allocated memory through the use of pointers. In this section we examine how memory can be dynamically allocated for vector structures; however, the code sequences we list are very similar or identical to those used for storing other data structures.
Figure 3.4[43] is a typical example of how data space is dynamically allocated and used. In this case the code allocates space to store a sequence of integers as an array, so RRlen is first defined as a pointer to these integers (Figure 3.4:1). Pointer-type variables have to be initialized by making them point to a valid memory area. In this case the program uses the C library malloc function to obtain from the system the address of a memory area large enough to store c integers. The argument of malloc is the number of bytes to allocate. Its calculation here is typical: it consists of the product of the number of elements to allocate (c) multiplied by the size of each element (sizeof(int)). If a system's memory is exhausted the malloc function indicates this by returning NULL; robustly written programs should always check for such a case (Figure 3.4:2). From that point onward the program can dereference RRlen (using the [] or the * operator) as if it were an array consisting of c elements. Keep in mind, however, that the two are not equivalent: most notably sizeof applied to an array variable will return the size of the array (for example, 40 for an array of 10 four-byte integers), whereas the same operator when applied on a pointer will return only the storage required to store that pointer in memory (for example, 4 in many modern architectures). Finally, when the storage allocated is no longer needed, it must be freed by calling the free function. From that point onward trying to dereference the pointer will lead to undefined results. Failing to free the memory is also a mistake that can lead to programs exhibiting memory leaks that gradually appropriate and waste a system's memory resources.
[43] netbsdsrc/usr.sbin/named/named/ns_validate.c:871–1231
Figure 3.4 Dynamic storage allocation.
int
update_msg(uchar *msg, int *msglen, int Vlist[], int c)
{
[...]
int *RRlen;
[...]
RRlen = (int *)malloc((unsigned)c*sizeof(int)); <-- a <-- b
if (!RRlen)
panic(errno, "malloc(RRlen)");
[...]
for (i = 0; i < c; i++) { <-- c
[...]
RRlen[i] = dn_skipname(cp, msg + *msglen); <-- d
[...]
}
[...]
free((char *)RRlen); <-- e
return (n);
}
Pointer to an integer
(a) Number of elements
(b) Size of each element
Handle memory exhaustion
(c) Iterate over all elements
(d) Use RRlen as an array
(e) Free allocated memory
When memory is exhausted there is not a lot a program can do; in most cases printing an error message and exiting is the only viable strategy. For this reason, and to avoid the burden of checking the return value of malloc after every call, malloc is wrapped inside a function (typically named xmalloc) that performs this check and is always called instead of malloc, as in the following example.[44]
[44] netbsdsrc/usr.bin/sed/misc.c:63–107
Figure 3.5 Memory allocation readjustment.
void
remember_rup_data(char *host, struct statstime *st)
{
if (rup_data_idx >= rup_data_max) { <-- a
rup_data_max += 16; <-- b
rup_data = realloc (rup_data, <-- c
rup_data_max * sizeof(struct rup_data));
if (rup_data == NULL) {
err (1, "realloc");
}
}
rup_data[rup_data_idx].host = strdup(host); <-- d
rup_data[rup_data_idx].statstime = *st;
rup_data_idx++; <-- e
}
(a) Index larger than allocated size?
(b) New size
(c) Adjust allocation
(d) Store data
(e) New index
void *
xmalloc(u_int size)
{
void *p;
if ((p = malloc(size)) == NULL)
err(FATAL, "%s", strerror(errno));
return (p);
}
[...]
oe = xmalloc(s);
(void)regerror(errcode, preg, oe, s);
In some cases the size of an array is determined while elements are actually being stored in it. This is typically the case when processing input data: the data is stored in an array, but the number of elements that need to be stored may not be known until all elements have been read. Figure 3.5[45] demonstrates such a case. Before an element is stored, the current array index is checked against the size of the elements in the allocated memory block. If more space is needed, the realloc C library function is called to adjust the space pointed to by its first argument to the new amount specified by the function's second argument. The function returns a pointer to the adjusted memory block because its address may not be the same as that of the original block. In that case, the contents of the original block are copied to the new location. Keep in mind that any pointer variables pointing to locations in the original block will now point to undefined data. The example in Figure 3.5 linearly increments the size of the memory block by 16 bytes each time the allocated memory is exhausted. An exponential increase of the allocated memory space is also very common.[46]
[45] netbsdsrc/usr.bin/rup/rup.c:146–164
[46] netbsdsrc/usr.sbin/amd/hlfsd/homedir.c:521–525
if (cur_pwtab_num + 1 > max_pwtab_num) {
/* need more space in table */
max_pwtab_num *= 2;
pwtab = (uid2home_t *) xrealloc(pwtab,
sizeof(uid2home_t) * max_pwtab_num);
3.4.1 Managing Free Memory
Earlier we mentioned that failing to free a memory block leads to programs exhibiting memory leaks. However, you will often encounter programs that allocate memory and never free it.[47] Such programs get away with it when they are short-lived; when a program terminates, all memory it has allocated is automatically reclaimed by the operating system. Such is the case in the implementation of the skeyinit program.[48]
[47] netbsdsrc/bin/mv/mv.c:260
[48] netbsdsrc/usr.bin/skeyinit/skeyinit.c:34–233
int
main(int argc, char *argv[])
{
[...]
skey.val = (char *)malloc(16 + 1);
[... no call to free(skey.val) ]
exit(1);
}
The skeyinit program is used to change a password or add a user in Bellcore's S/Key authentication system. It performs its work and then immediately exits, thus relinquishing the memory it allocates. This careless coding practice, however, may create problems when the same code is reused in a program with a larger lifespan (for example, as part of a router software system). In such a case the program will appear to leak memory, a fact that can often be ascertained by using a process view command such as ps or top under Unix or the task manager under Windows. Note that in the example above skeyinit could have simply set skey.val to point to a fixed size array, allocated as a local variable on the stack.
main(int argc, char *argv[])
{
char valspace[16 + 1];
[...]
skey.val = valspace;
It is a common misconception of novice C/C++ programmers that all pointers must be initialized to malloc-allocated memory blocks. Although code written in this style is not wrong, the resulting programs are often more difficult to read and maintain.
In a few cases you may encounter programs that contain a garbage collector to automatically reclaim unused storage. Such a technique is often used when the memory blocks allocated are difficult to follow throughout their lifetime because, for example, they are shared between different variables. In such cases you will see a reference count associated with each block. This can be maintained by incrementing it every time a new reference for the block is created[49]
[49] XFree86-3.3/xc/lib/Xt/Selection.c:1540–1542
req.ctx = ctx;
req.event.time = time;
ctx->ref_count++;
and decrementing it every time a reference is destroyed.[50]
[50] XFree86-3.3/xc/lib/Xt/Selection.c:744–746
XtFree((char*)req);
ctx->req = NULL;
ctx->ref_count--;
When the reference count reaches 0, the block is no longer used and can be freed.[51]
[51] XFree86-3.3/xc/lib/Xt/Selection.c:563–564
if (--ctx->ref_count == 0 && ctx->free_when_done)
XtFree((char*)ctx);
Another, less frequently used approach involves using a conservative garbage collector that scans all the memory of a process looking for addresses that match existing allocated memory blocks. Any blocks not encountered in the scan are then freed.
Finally, some versions of the C library implement a nonstandard function called alloca. This function allocates a memory block using the same interface as malloc, but instead of allocating the block in the program's heap (the general-purpose memory area belonging to the program), it allocates it on the program's stack (the area used for storing function return addresses and local variables).[52]
[52] netbsdsrc/sys/dev/ofisa/ofisa.c:563–564
int
ofisa_intr_get(int phandle, struct ofisa_intr_desc *descp,
int ndescs)
{
char *buf, *bp;
[...]
buf = alloca(i);
The block returned by alloca is automatically reclaimed when the function it was allocated in returns; there is no need to call free to dispose the allocated block. Naturally, the address of alloca-allocated memory should never be passed to the callers of the function it was allocated in, as it will be invalid when the function returns. The use of alloca is discouraged in some programming environments like FreeBSD because it is considered nonportable and machine-dependent; in other cultures like GNU its use is encouraged because it reduces accidental memory leaks.
3.4.2 Structures with Dynamically Allocated Arrays
Sometimes a single dynamically allocated structure is used to store some fields and an array containing structure-specific, variable-length data. This construct is used to avoid the pointer indirection and memory overhead associated with having a structure element contain a pointer to the variable-length data. Thus instead of a definition like:[53]
[53] XFree86-3.3/xc/include/extensions/record.h:99–108
typedef struct {
XID id_base;
[...]
unsigned char *data;
unsigned long data_len; /* in 4-byte units */
} XRecordInterceptData;
you will see a definition like:[54]
[54] netbsdsrc/bin/ls/ls.h:69–74
typedef struct {
char *user;
char *group;
char *flags;
char data[1];
} NAMES;
The data array structure element is used as a placeholder for the actual data. When memory for holding the structure is allocated, its size is adjusted upward based on the number of elements in the data array. From that point onward the array element is used as if it contained space for those elements.[55]
[55] netbsdsrc/bin/ls/ls.c:470–475
if ((np = malloc(sizeof(NAMES) +
ulen + glen + flen + 3)) == NULL)
err(1, "%s", "");
np->user = &np->data[0];
(void)strcpy(np->user, user);
Notice how the example above apparently allocates one byte more than what is actually needed. The size of the memory block is calculated as the sum of the structure size and the associated data elements: the size of the three strings (ulen, glen, flen) and their corresponding null terminating characters (3). However, the structure size already includes one placeholder byte for the associated data that is not taken into account when calculating the memory block size. The micromanagement of memory storage structures is a perilous and error-fraught activity.
Exercise 3.11 Locate in the book's CD-ROM a medium-sized C program that depends on dynamic memory allocation. Measure the percentage of the program's code that is used for managing the dynamically allocated memory structures. Estimate the percentage again, assuming that memory disposal was automatically managed by a garbage collector, as is the case in Java programs.
Exercise 3.12 Most modern programming environments offer specialized libraries, compilation options, or other tools for detecting memory leaks. Identify the facility available in your environment, and use it on three different programs. Comment on the results you obtain.
Thursday, December 27, 2007
Boolean Expressions , goto Statements
2.7 Character and Boolean Expressions
The body of the for loop in the getstops function starts with a block of code that can appear cryptic at first sight (Figure 2.5:2). To understand it we need to dissect the expressions that comprise it. The first, the condition in the while loop, is comparing *cp (the character cp points to) against two characters: ´0´ and ´9´. Both comparisons must be true and both of them involve *cp combined with a different inequality operator and another expression. Such a test can often be better understood by rewriting the comparisons to bring the value being compared in the middle of the expression and to arrange the other two values in ascending order. This rewriting in our case would yield
while ( '\'0' <= *cp && *cp <= '9')
This can then be read as a simple range membership test for a character c.
0 c 9
Note that this test assumes that the digit characters are arranged sequentially in ascending order in the underlying character set. While this is true for the digits in all character sets we know, comparisons involving alphabetical characters may yield surprising results in a number of character sets and locales. Consider the following typical example.[31]
[31] netbsdsrc/games/hack/hack.objnam.c:352–253
if ('a' <= *s && *s <= 'z')
*s -= ( 'a' - 'A');
the code attempts to convert lowercase characters to uppercase by subtracting from each character found to be lowercase (as determined by the if test) the character set distance from ´a´ to ´A´. This fragment will fail to work when there are lowercase characters in character set positions outside the range a...z, when the character set range a...z contains non lowercase characters, and when the code of each lower case character is not a fixed distance away from the corresponding uppercase character. Many non-ASCII character sets exhibit at least one of these problems.
The next line in the block (Figure 2.5:2) can also appear daunting. It modifies the variable i based on the values of i and *cp and two constants: 10 and ´0´ while at the same time incrementing cp. The variable names are not especially meaningful, and the program author has not used macro or constant definitions to document the constants; we have to make the best of the information available.
We can often understand the meaning of an expression by applying it on sample data. In our case we can work based on the initial value of i (0) and assume that cp points to a string containing a number (for example, 24) based on our knowledge of the formatting specifications that expand accepts. We can then create a table containing the values of all variables and expression parts as each expression part is evaluated. We use the notation i' and *cp' to denote the variable value after the expression has been evaluated.
Iteration
i
i*10
*cp
*cp-'0'
i'
*cp'
0
0
0
'2'
2
2
'4'
1
2
20
'4'
4
24
0
The expression*cp - '0' uses a common idiom: by subtracting the ordinal value of '0' from *cp the expression yields the integer value of the character digit pointed to by *cp. Based on the table we can now see a picture emerging: after the loop terminates, i will contain the decimal value of the numeric string pointed to by cp at the beginning of the loop.
Armed with the knowledge of what i stands for (the integer value of a tab-stop specification), we can now examine the lines that verify the specification. The expression that verifies i for reasonable values is straightforward. It is a Boolean expression based on the logical OR (||) of two other expressions. Although this particular expression reads naturally as English text (print an error message if i is either less than or equal to zero, or greater than 255), it is sometimes useful to transform Boolean expressions to a more readable form. If, for example, we wanted to translate the expression into the range membership expression we used above, we would need to substitute the logical OR with a logical AND (&&). This can easily be accomplished by using De Morgan's rules.[32]
[32] We use the operator <=> to denote that two expressions are equivalent. This is not a C/C++/C#/Java operator.
!(a || b) <=> !a && !b
!(a && b) <=> !a || !b
We can thus transform the testing code as follows:
i <= 0 || i > 256 <=>
!(!(i <= 0) && !(i > 256)) <=>
!(i > 0 && i <= 256) <=>
!(0 < i && i <= 256) <=>
¬(0 < i 256)+
In our example we find both the initial and final expressions equally readable; in other cases you may find that De Morgan's rules provide you a quick and easy way to disentangle a complicated logical expression.
When reading Boolean expressions, keep in mind that in many modern languages Boolean expressions are evaluated only to the extent needed. In a sequence of expressions joined with the && operator (a conjunction), the first expression to evaluate to false will terminate the evaluation of the whole expression yielding a false result. Similarly, in a sequence of expressions joined with the || operator (a disjunction), the first expression to evaluate to true will terminate the evaluation of the whole expression yielding a true result. Many expressions are written based on this short-circuit evaluation property and should be read in the same way. When reading a conjunction, you can always assume that the expressions on the left of the expression you are examining are true; when reading a disjunction, you can similarly assume that the expressions on the left of the expression you are examining are false. As an example, the expression in the following if statement will become true only when all its constituent elements are true, and t->type will be evaluated only when t is not NULL.[33]
[33] netbsdsrc/bin/ksh/main.c:606–607
if (t != NULL && t->type != TEOF && interactive && really–exit)
really–exit = 0;
Conversely, in the following example argv[1] will be checked for being NULL only if argv is not NULL .[34]
[34] netbsdsrc/lib/libedit/term.c:1212–1213
if (argv == NULL || argv[1] == NULL || argv[2] == NULL)
return -1;
In both cases, the first check guards against the subsequent dereference of a NULL pointer. Our getstops function also uses short-circuit evaluation when checking that a delimiter specified (i) is larger than the previous one (tabstops[nstops-1]) (Figure 2.5:4). This test will be performed only if at least one additional delimiter specification has been processed (nstops > 0). You can depend on the short-circuit evaluation property in most C-derived languages such as C++, Perl, and Java; on the other hand, Fortran, Pascal, and most Basic dialects will always evaluate all elements of a Boolean expression.
Exercise 2.20 Locate expressions containing questionable assumptions about character code values in the book's CD-ROM. Read about the Java Character class test and conversion methods such as isUpper and toLowerCase or the corresponding ctype family of C functions (isupper, tolower , and so on). Propose changes to make the code less dependent on the target architecture character set.
Exercise 2.21 Find, simplify, and reason about five nontrivial Boolean expressions in the source code base. Do not spend time on understanding what the expression elements mean; concentrate on the conditions that will make the expression become true or false. Where possible, identify and use the properties of short-circuit evaluation.
Exercise 2.22 Locate and reason about five nontrivial integer or character expressions in the source code base. Try to minimize the amount of code you need to comprehend in order to reason about each expression.
Figure 2.6 The goto statement used for a common error handler.
static int
gen_init(void)
{
[...]
if ((sigaction(SIGXCPU, &n_hand, &o_hand) < 0) &&
(o_hand.sa_handler == SIG_IGN) &&
(sigaction(SIGXCPU, &o_hand, &o_hand) < 0))
goto out;
n_hand.sa_handler = SIG_IGN;
if ((sigaction(SIGPIPE, &n_hand, &o_hand) < 0) ||
(sigaction(SIGXFSZ, &n_hand, &o_hand) < 0))
goto out;
return(0);
out:
syswarn(1, errno, "Unable to set up signal handler");
return(-1);
}
Failure; exit with an error
Failure; exit with an error
Normal function exit (success)
Common error handling code
2.8 goto Statements
The code segment that complains about unreasonable tab specifications (Figure 2.5:3) begins with a word followed by a colon. This is a label: the target of a goto instruction. Labels and goto statements should immediately raise your defenses when reading code. They can be easily abused to create "spaghetti" code: code with a flow of control that is difficult to follow and figure out. Therefore, the designers of Java decided not to support the goto statement. Fortunately, most modern programs use the goto statement in a small number of specific circumstances that do not adversely affect the program's structure.
You will find goto often used to exit a program or a function after performing some actions (such as printing an error message or freeing allocated resources). In our example the exit(1) call at the end of the block will terminate the program, returning an error code (1) to the system shell. Therefore all goto statements leading to the bad label are simply a shortcut for terminating the program after printing the error message. In a similar manner, the listing in Figure 2.6[35] illustrates how a common error handler (Figure 2.6:4) is used as a common exit point in all places where an error is found (Figure 2.6:1, Figure 2.6:2). A normal exit route for the function, located before the error handler (Figure 2.6:3), ensures that the handler will not get called when no error occurs.
[35] netbsdsrc/bin/pax/pax.c:309–412
Figure 2.7 The use of goto to reexecute code.
again:
if ((p = fgets(line, BUFSIZ, servf)) == NULL) <-- a
return (NULL);
if (*p == '#') <-- b
goto again;
cp = strpbrk(p, "#\n");
if (cp == NULL) <-- c
goto again;
*cp = '\0'; <-- d
[...]
return (&serv);
(a) Read a line; return on EOF
(b) Comment? Retry
(c) Incomplete line? Retry
(d) Complete entry
You will also find the goto statement often used to reexecute a portion of code, presumably after some variables have changed value or some processing has been performed. Although such a construct can often be coded by using a structured loop statement (for example, for (;;)) together with break and continue, in practice the coder's intent is sometimes better communicated by using goto. A single label, almost invariably named again or retry, is used as the goto target. The example in Figure 2.7,[36] which locates the entry of a specific service in the system's database while ignoring comments and overly large lines, is a typical case. (Interestingly, the code example also seems to contain a bug. If a partial line is read, it continues by reading the remainder as if it were a fresh line, so that if the tail of a long line happened to look like a service definition it would be used. Such oversights are common targets for computer security exploits.)
[36] netbsdsrc/lib/libc/net/getservent.c:65–104
Finally, you will find the goto statement used to change the flow of control in nested loop and switch statements instead of using break and continue, which affect only the control flow in the innermost loop. Sometimes goto is used even if the nesting level would allow the use of a break or continue statement. This is used in large, complex loops to clarify where the flow of control will go and to avoid the possibility of errors should a nested loop be added around a particular break or continue statement. In the example in Figure 2.8[37] the statement goto have–msg is used instead of break to exit the for loop.
[37] netbsdsrc/sys/dev/ic/ncr5380sbc.c:1575–1654
Exercise 2.23 Locate five instances of code that use the goto statement in the code base. Categorize its use (try to locate at least one instance for every one of the possible uses we outlined), and argue whether each particular goto could and should be replaced with a loop or other statement.
Figure 2.8 Exiting a loop using the goto statement.
<-- a
for (;;) {
[...]
if ((sc->sc_state & NCR_DROP_MSGIN) == 0) {
if (n >= NCR_MAX_MSG_LEN) {
ncr_sched_msgout(sc, SEND_REJECT);
sc->sc_state |= NCR_DROP_MSGIN;
} else {
[...]
if (n == 1 && IS1BYTEMSG(sc->sc_imess[0]))
goto have_msg; <-- b
[...]
}
}
[...]
}
have_msg: <-- c
(a) for loop
(b) Exit the for loop
(c) goto target
Exercise 2.24 The function getstops produces the same error message for a number of different errors. Describe how you could make its error reporting more user-friendly while at the same time eliminating the use of the goto statement. Discuss when such source code changes are appropriate and when they should be avoided.
2.9 Refactoring in the Small
The rest of the getstops code is relatively straightforward. After checking that each tab stop is greater than the previous one (Figure 2.5:4), the tab stop offset is stored in the tabstops array. After a single tab stop number has been converted into an integer (Figure 2.5:2), cp will point to the first nondigit character in the string (that is, the loop will process all digits and terminate at the first nondigit). At that point, a series of checks specified by if statements control the program's operation. If cp points to the end of the tab stop specification string (the character with the value 0, which signifies the end of a C string), then the loop will terminate (Figure 2.5:5). The last if (Figure 2.5:6) will check for invalid delimiters and terminate the program operation (using the goto bad statement) if one is found.
The body of each one of the if statements will transfer control somewhere else via a goto or break statement. Therefore, we can also read the sequence as:
if (*cp == 0)
break;
else if (*cp != ',' && *cp != ' ')
goto bad;
else
cp++;
This change highlights the fact that only one of the three statements will ever get executed and makes the code easier to read and reason about. If you have control over a body of code (that is, it is not supplied or maintained by an outside vendor or an open-source group), you can profit by reorganizing code sections to make them more readable. This improvement of the code's design after it has been written is termed refactoring. Start with small changes such as the one we outlined—you can find more than 70 types of refactoring changes described in the relevant literature. Modest changes add up and often expose larger possible improvements.
As a further example, consider the following one-line gem.[38]
[38] netbsdsrc/games/worms/worms.c:419
op = &(!x ? (!y ? upleft : (y == bottom ? lowleft : left)) :
(x == last ? (!y ? upright : (y == bottom ? lowright : right)) :
(!y ? upper : (y == bottom ? lower : normal))))[w->orientation];
The code makes excessive use of the conditional operator ?:. Read expressions using the conditional operator like if code. As an example, read the expression[39]
[39] netbsdsrc/bin/csh/set.c:852
sign ? -n : n
as follows:
"If sign is true, then the value of the expression is -n; otherwise, the value of the expression is n".
Since we read an expression like an if statement, we can also format it like an if statement; one that uses x ? instead of if (x), parentheses instead of curly braces, and : instead of else. To reformat the expression, we used the indenting features of our editor in conjunction with its ability to show matching parentheses. You can see the result in Figure 2.9 (left).
Reading the conditional expression in its expanded form is certainly easier, but there is still room for improvement. At this point we can discern that the x and y variables that control the expression evaluation are tested for three different values:
0 (expressed as !x or !y)
bottom or last
All other values
Figure 2.9. A conditional expression formatted like an if statement (left) and like cascading if–else statements (right). op = &(
!x ? (
!y ?
upleft
: (
y == bottom ?
lowleft
:
left
)
) : (
x == last ? (
!y ?
upright
: (
y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: (
y == bottom ?
lower
:
normal
)
)
))[w->orientation];
op = &(
!x ? (
!y ?
upleft
: ( y == bottom ?
lowleft
:
left
)
) : ( x == last ? (
!y ?
upright
: ( y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: ( y == bottom ?
lower
:
normal
)
)
))[w->orientation];
We can therefore rewrite the expression formatted as a series of cascading if–else statements (expressed using the ?: operator) to demonstrate this fact. You can see the result in Figure 2.9 (right).
The expression's intent now becomes clear: the programmer is selecting one of nine different location values based on the combined values of x and y. Both alternative formulations, however, visually emphasize the punctuation at the expense of the semantic content and use an inordinate amount of vertical space. Nevertheless, based on our newly acquired insight, we can create a two-dimensional array containing these location values and index it using offsets we derive from the x and y values. You can see the new result in Figure 2.10. Notice how in the initialization of the array named locations, we use a two-dimensional textual structure to illustrate the two-dimensional nature of the computation being performed. The initializers are laid out two-dimensionally in the program text, the array is indexed in the normally unconventional order [y][x], and the mapping is to integers "0, 2, 1" rather than the more obvious "0, 1, 2", so as to make the two-dimensional presentation coincide with the semantic meanings of the words upleft, upper, and so on.
Figure 2.10 Location detection code replacing the conditional expression.
struct options *locations[3][3] = {
<-- a
{upleft, upper, upright},
{left, normal, right},
{lowleft, lower, lowright},
};
int xlocation, ylocation; <-- b
<-- c
if (x == 0)
xlocation = 0;
else if (x == last)
xlocation = 2;
else
xlocation = 1;
<-- d
if (y == 0)
ylocation = 0;
else if (y == bottom)
ylocation = 2;
else
ylocation = 1;
op = &(locations[ylocation][xlocation])[w->orientation];
(a) Location map
(b) To store the x, y map offsets
(c) Determine x offset
(d) Determine y offset
The code, at 20 lines, is longer than the original one-liner but still shorter by 7 lines from the one-liner's readable cascading-else representation. In our eyes it appears more readable, self-documenting, and easier to verify. One could argue that the original version would execute faster than the new one. This is based on the fallacy that code readability and efficiency are somehow incompatible. There is no need to sacrifice code readability for efficiency. While it is true that efficient algorithms and certain optimizations can make the code more complicated and therefore more difficult to follow, this does not mean that making the code compact and unreadable will make it more efficient. On our system and compiler the initial and final versions of the code execute at exactly the same speed: 0.6 ms. Even if there were speed differences, the economics behind software maintenance costs, programmer salaries, and CPU performance most of the time favor code readability over efficiency.
However, even the code in Figure 2.10 can be considered a mixed blessing: it achieves its advantages at the expense of two distinct disadvantages. First, it separates the code into two chunks that, while shown together in Figure 2.10, would necessarily be separated in real code. Second, it introduces an extra encoding (0, 1, 2), so that understanding what the code is doing requires two mental steps rather than one (map "0, last, other" to "0, 2, 1" and then map a pair of "0, 2, 1" values to one of nine items). Could we somehow directly introduce the two-dimensional structure of our computation into the conditional code? The following code fragment[40] reverts to conditional expressions but has them carefully laid out to express the computation's intent.
[40] Suggested by Guy Steele.
op =
&( !y ? (!x ? upleft : x!=last ? upper : upright) :
y!=bottom ? (!x ? left : x!=last ? normal : right) :
(!x ? lowleft : x!=last ? lower : lowright)
)[w->orientation];
The above formulation is a prime example on how sometimes creative code layout can be used to improve code readability. Note that the nine values are right-justified within their three columns, to make them stand out visually and to exploit the repetition of "left" and "right" in their names. Note also that the usual practice of putting spaces around operators is eschewed for the case of != in order to reduce the test expressions to single visual tokens, making the nine data values stand out more. Finally, the fact that the whole expression fits in five lines makes the vertical alignment of the first and last parentheses more effective, making it much easier to see that the basic structure of the entire statement is of the form
op = &( )[w->orientation];
The choice between the two new alternative representations is largely a matter of taste; however, we probably would not have come up with the second formulation without expressing the code in the initial, more verbose and explicit form.
The expression we rewrote was extremely large and obviously unreadable. Less extreme cases can also benefit from some rewriting. Often you can make an expression more readable by adding whitespace, by breaking it up into smaller parts by means of temporary variables, or by using parentheses to amplify the precedence of certain operators.
You do not always need to change the program structure to make it more readable. Often items that do not affect the program's operation (such as comments, the use of whitespace, and the choice of variable, function, and class names) can affect the program's readability. Consider the work we did to understand the code for the getstops function. A concise comment before the function definition would enhance the program's future readability.
/*
* Parse and verify the tab stop specification pointed to by cp
* setting the global variables nstops and tabstops[].
* Exit the program with an error message on bad specifications.
*/
When reading code under your control, make it a habit to add comments as needed.
In Sections 2.2 and 2.3 we explained how names and indentation can provide hints for understanding code functionality. Unfortunately, sometimes programmers choose unhelpful names and indent their programs inconsistently. You can improve the readability of poorly written code with better indentation and wise choice of variable names. These measures are extreme: apply them only when you have full responsibility and control over the source code, you are sure that your changes are a lot better than the original code, and you can revert to the original code if something goes wrong. Using a version management system such as the Revision Control System (RCS), the Source Code Control System (SCCS), the Concurrent Versions System (CVS), or Microsoft's Visual SourceSafe can help you control the code modifications. The adoption of a specific style for variable names and indentation can appear a tedious task. When modifying code that is part of a larger body to make it more readable, try to understand and follow the conventions of the rest of the code (see Chapter 7). Many organizations have a specific coding style; learn it and try to follow it. Otherwise, adopt one standard style (such as one of those used by the GNU[41] or BSD[42] groups) and use it consistently. When the code indentation is truly inconsistent and cannot be manually salvaged, a number of tools (such as indent) can help you automatically reindent it to make it more readable (see Section 10.7). Use such tools with care: the judicious use of whitespace allows programmers to provide visual clues that are beyond the abilities of automated formatting tools. Applying indent to the code example in Figure 2.10 would definitely make it less readable.
[41] http://www.gnu.org/prep/standardstoc.html
[42] netbsdsrc/share/misc/style:1–315
Keep in mind that although reindenting code may help readability, it also messes up the program's change history in the revision control system. For this reason it is probably best not to combine the reformatting with any actual changes to the program's logic. Do the reformat, check it in, and then make the other changes. In this way future code readers will be able to selectively retrieve and review your changes to the program's logic without getting overwhelmed by the global formatting changes. On the flip side of the coin, when you are examining a program revision history that spans a global reindentation exercise using the diff program, you can often avoid the noise introduced by the changed indentation levels by specifying the -w option to have diff ignore whitespace differences.
Exercise 2.25 Provide five examples from your environment or the book's CD-ROM where the code structure can be improved to make it more readable.
Exercise 2.26 You can find tens of intentionally unreadable C programs at the International Obfuscated C Code Contest Web site.[43] Most of them use several layers of obfuscation to hide their algorithms. See how gradual code changes can help you untangle their code. If you are not familiar with the C preprocessor, try to avoid programs with a large number of #define lines.
[43] http://www.ioccc.org
Exercise 2.27 Modify the position location code we examined to work on the mirror image of a board (interchange the right and left sides). Time yourself in modifying the original code and the final version listed in Figure 2.10. Do not look at the readable representations; if you find them useful, create them from scratch. Calculate the cost difference assuming current programmer salary rates (do not forget to add overheads). If the readable code runs at half the speed of the original code (it does not), calculate the cost of this slowdown by making reasonable assumptions concerning the number of times the code will get executed over the lifetime of a computer bought at a given price.
Exercise 2.28 If you are not familiar with a specific coding standard, locate one and adopt it. Verify local code against the coding standard.
The body of the for loop in the getstops function starts with a block of code that can appear cryptic at first sight (Figure 2.5:2). To understand it we need to dissect the expressions that comprise it. The first, the condition in the while loop, is comparing *cp (the character cp points to) against two characters: ´0´ and ´9´. Both comparisons must be true and both of them involve *cp combined with a different inequality operator and another expression. Such a test can often be better understood by rewriting the comparisons to bring the value being compared in the middle of the expression and to arrange the other two values in ascending order. This rewriting in our case would yield
while ( '\'0' <= *cp && *cp <= '9')
This can then be read as a simple range membership test for a character c.
0 c 9
Note that this test assumes that the digit characters are arranged sequentially in ascending order in the underlying character set. While this is true for the digits in all character sets we know, comparisons involving alphabetical characters may yield surprising results in a number of character sets and locales. Consider the following typical example.[31]
[31] netbsdsrc/games/hack/hack.objnam.c:352–253
if ('a' <= *s && *s <= 'z')
*s -= ( 'a' - 'A');
the code attempts to convert lowercase characters to uppercase by subtracting from each character found to be lowercase (as determined by the if test) the character set distance from ´a´ to ´A´. This fragment will fail to work when there are lowercase characters in character set positions outside the range a...z, when the character set range a...z contains non lowercase characters, and when the code of each lower case character is not a fixed distance away from the corresponding uppercase character. Many non-ASCII character sets exhibit at least one of these problems.
The next line in the block (Figure 2.5:2) can also appear daunting. It modifies the variable i based on the values of i and *cp and two constants: 10 and ´0´ while at the same time incrementing cp. The variable names are not especially meaningful, and the program author has not used macro or constant definitions to document the constants; we have to make the best of the information available.
We can often understand the meaning of an expression by applying it on sample data. In our case we can work based on the initial value of i (0) and assume that cp points to a string containing a number (for example, 24) based on our knowledge of the formatting specifications that expand accepts. We can then create a table containing the values of all variables and expression parts as each expression part is evaluated. We use the notation i' and *cp' to denote the variable value after the expression has been evaluated.
Iteration
i
i*10
*cp
*cp-'0'
i'
*cp'
0
0
0
'2'
2
2
'4'
1
2
20
'4'
4
24
0
The expression*cp - '0' uses a common idiom: by subtracting the ordinal value of '0' from *cp the expression yields the integer value of the character digit pointed to by *cp. Based on the table we can now see a picture emerging: after the loop terminates, i will contain the decimal value of the numeric string pointed to by cp at the beginning of the loop.
Armed with the knowledge of what i stands for (the integer value of a tab-stop specification), we can now examine the lines that verify the specification. The expression that verifies i for reasonable values is straightforward. It is a Boolean expression based on the logical OR (||) of two other expressions. Although this particular expression reads naturally as English text (print an error message if i is either less than or equal to zero, or greater than 255), it is sometimes useful to transform Boolean expressions to a more readable form. If, for example, we wanted to translate the expression into the range membership expression we used above, we would need to substitute the logical OR with a logical AND (&&). This can easily be accomplished by using De Morgan's rules.[32]
[32] We use the operator <=> to denote that two expressions are equivalent. This is not a C/C++/C#/Java operator.
!(a || b) <=> !a && !b
!(a && b) <=> !a || !b
We can thus transform the testing code as follows:
i <= 0 || i > 256 <=>
!(!(i <= 0) && !(i > 256)) <=>
!(i > 0 && i <= 256) <=>
!(0 < i && i <= 256) <=>
¬(0 < i 256)+
In our example we find both the initial and final expressions equally readable; in other cases you may find that De Morgan's rules provide you a quick and easy way to disentangle a complicated logical expression.
When reading Boolean expressions, keep in mind that in many modern languages Boolean expressions are evaluated only to the extent needed. In a sequence of expressions joined with the && operator (a conjunction), the first expression to evaluate to false will terminate the evaluation of the whole expression yielding a false result. Similarly, in a sequence of expressions joined with the || operator (a disjunction), the first expression to evaluate to true will terminate the evaluation of the whole expression yielding a true result. Many expressions are written based on this short-circuit evaluation property and should be read in the same way. When reading a conjunction, you can always assume that the expressions on the left of the expression you are examining are true; when reading a disjunction, you can similarly assume that the expressions on the left of the expression you are examining are false. As an example, the expression in the following if statement will become true only when all its constituent elements are true, and t->type will be evaluated only when t is not NULL.[33]
[33] netbsdsrc/bin/ksh/main.c:606–607
if (t != NULL && t->type != TEOF && interactive && really–exit)
really–exit = 0;
Conversely, in the following example argv[1] will be checked for being NULL only if argv is not NULL .[34]
[34] netbsdsrc/lib/libedit/term.c:1212–1213
if (argv == NULL || argv[1] == NULL || argv[2] == NULL)
return -1;
In both cases, the first check guards against the subsequent dereference of a NULL pointer. Our getstops function also uses short-circuit evaluation when checking that a delimiter specified (i) is larger than the previous one (tabstops[nstops-1]) (Figure 2.5:4). This test will be performed only if at least one additional delimiter specification has been processed (nstops > 0). You can depend on the short-circuit evaluation property in most C-derived languages such as C++, Perl, and Java; on the other hand, Fortran, Pascal, and most Basic dialects will always evaluate all elements of a Boolean expression.
Exercise 2.20 Locate expressions containing questionable assumptions about character code values in the book's CD-ROM. Read about the Java Character class test and conversion methods such as isUpper and toLowerCase or the corresponding ctype family of C functions (isupper, tolower , and so on). Propose changes to make the code less dependent on the target architecture character set.
Exercise 2.21 Find, simplify, and reason about five nontrivial Boolean expressions in the source code base. Do not spend time on understanding what the expression elements mean; concentrate on the conditions that will make the expression become true or false. Where possible, identify and use the properties of short-circuit evaluation.
Exercise 2.22 Locate and reason about five nontrivial integer or character expressions in the source code base. Try to minimize the amount of code you need to comprehend in order to reason about each expression.
Figure 2.6 The goto statement used for a common error handler.
static int
gen_init(void)
{
[...]
if ((sigaction(SIGXCPU, &n_hand, &o_hand) < 0) &&
(o_hand.sa_handler == SIG_IGN) &&
(sigaction(SIGXCPU, &o_hand, &o_hand) < 0))
goto out;
n_hand.sa_handler = SIG_IGN;
if ((sigaction(SIGPIPE, &n_hand, &o_hand) < 0) ||
(sigaction(SIGXFSZ, &n_hand, &o_hand) < 0))
goto out;
return(0);
out:
syswarn(1, errno, "Unable to set up signal handler");
return(-1);
}
Failure; exit with an error
Failure; exit with an error
Normal function exit (success)
Common error handling code
2.8 goto Statements
The code segment that complains about unreasonable tab specifications (Figure 2.5:3) begins with a word followed by a colon. This is a label: the target of a goto instruction. Labels and goto statements should immediately raise your defenses when reading code. They can be easily abused to create "spaghetti" code: code with a flow of control that is difficult to follow and figure out. Therefore, the designers of Java decided not to support the goto statement. Fortunately, most modern programs use the goto statement in a small number of specific circumstances that do not adversely affect the program's structure.
You will find goto often used to exit a program or a function after performing some actions (such as printing an error message or freeing allocated resources). In our example the exit(1) call at the end of the block will terminate the program, returning an error code (1) to the system shell. Therefore all goto statements leading to the bad label are simply a shortcut for terminating the program after printing the error message. In a similar manner, the listing in Figure 2.6[35] illustrates how a common error handler (Figure 2.6:4) is used as a common exit point in all places where an error is found (Figure 2.6:1, Figure 2.6:2). A normal exit route for the function, located before the error handler (Figure 2.6:3), ensures that the handler will not get called when no error occurs.
[35] netbsdsrc/bin/pax/pax.c:309–412
Figure 2.7 The use of goto to reexecute code.
again:
if ((p = fgets(line, BUFSIZ, servf)) == NULL) <-- a
return (NULL);
if (*p == '#') <-- b
goto again;
cp = strpbrk(p, "#\n");
if (cp == NULL) <-- c
goto again;
*cp = '\0'; <-- d
[...]
return (&serv);
(a) Read a line; return on EOF
(b) Comment? Retry
(c) Incomplete line? Retry
(d) Complete entry
You will also find the goto statement often used to reexecute a portion of code, presumably after some variables have changed value or some processing has been performed. Although such a construct can often be coded by using a structured loop statement (for example, for (;;)) together with break and continue, in practice the coder's intent is sometimes better communicated by using goto. A single label, almost invariably named again or retry, is used as the goto target. The example in Figure 2.7,[36] which locates the entry of a specific service in the system's database while ignoring comments and overly large lines, is a typical case. (Interestingly, the code example also seems to contain a bug. If a partial line is read, it continues by reading the remainder as if it were a fresh line, so that if the tail of a long line happened to look like a service definition it would be used. Such oversights are common targets for computer security exploits.)
[36] netbsdsrc/lib/libc/net/getservent.c:65–104
Finally, you will find the goto statement used to change the flow of control in nested loop and switch statements instead of using break and continue, which affect only the control flow in the innermost loop. Sometimes goto is used even if the nesting level would allow the use of a break or continue statement. This is used in large, complex loops to clarify where the flow of control will go and to avoid the possibility of errors should a nested loop be added around a particular break or continue statement. In the example in Figure 2.8[37] the statement goto have–msg is used instead of break to exit the for loop.
[37] netbsdsrc/sys/dev/ic/ncr5380sbc.c:1575–1654
Exercise 2.23 Locate five instances of code that use the goto statement in the code base. Categorize its use (try to locate at least one instance for every one of the possible uses we outlined), and argue whether each particular goto could and should be replaced with a loop or other statement.
Figure 2.8 Exiting a loop using the goto statement.
<-- a
for (;;) {
[...]
if ((sc->sc_state & NCR_DROP_MSGIN) == 0) {
if (n >= NCR_MAX_MSG_LEN) {
ncr_sched_msgout(sc, SEND_REJECT);
sc->sc_state |= NCR_DROP_MSGIN;
} else {
[...]
if (n == 1 && IS1BYTEMSG(sc->sc_imess[0]))
goto have_msg; <-- b
[...]
}
}
[...]
}
have_msg: <-- c
(a) for loop
(b) Exit the for loop
(c) goto target
Exercise 2.24 The function getstops produces the same error message for a number of different errors. Describe how you could make its error reporting more user-friendly while at the same time eliminating the use of the goto statement. Discuss when such source code changes are appropriate and when they should be avoided.
2.9 Refactoring in the Small
The rest of the getstops code is relatively straightforward. After checking that each tab stop is greater than the previous one (Figure 2.5:4), the tab stop offset is stored in the tabstops array. After a single tab stop number has been converted into an integer (Figure 2.5:2), cp will point to the first nondigit character in the string (that is, the loop will process all digits and terminate at the first nondigit). At that point, a series of checks specified by if statements control the program's operation. If cp points to the end of the tab stop specification string (the character with the value 0, which signifies the end of a C string), then the loop will terminate (Figure 2.5:5). The last if (Figure 2.5:6) will check for invalid delimiters and terminate the program operation (using the goto bad statement) if one is found.
The body of each one of the if statements will transfer control somewhere else via a goto or break statement. Therefore, we can also read the sequence as:
if (*cp == 0)
break;
else if (*cp != ',' && *cp != ' ')
goto bad;
else
cp++;
This change highlights the fact that only one of the three statements will ever get executed and makes the code easier to read and reason about. If you have control over a body of code (that is, it is not supplied or maintained by an outside vendor or an open-source group), you can profit by reorganizing code sections to make them more readable. This improvement of the code's design after it has been written is termed refactoring. Start with small changes such as the one we outlined—you can find more than 70 types of refactoring changes described in the relevant literature. Modest changes add up and often expose larger possible improvements.
As a further example, consider the following one-line gem.[38]
[38] netbsdsrc/games/worms/worms.c:419
op = &(!x ? (!y ? upleft : (y == bottom ? lowleft : left)) :
(x == last ? (!y ? upright : (y == bottom ? lowright : right)) :
(!y ? upper : (y == bottom ? lower : normal))))[w->orientation];
The code makes excessive use of the conditional operator ?:. Read expressions using the conditional operator like if code. As an example, read the expression[39]
[39] netbsdsrc/bin/csh/set.c:852
sign ? -n : n
as follows:
"If sign is true, then the value of the expression is -n; otherwise, the value of the expression is n".
Since we read an expression like an if statement, we can also format it like an if statement; one that uses x ? instead of if (x), parentheses instead of curly braces, and : instead of else. To reformat the expression, we used the indenting features of our editor in conjunction with its ability to show matching parentheses. You can see the result in Figure 2.9 (left).
Reading the conditional expression in its expanded form is certainly easier, but there is still room for improvement. At this point we can discern that the x and y variables that control the expression evaluation are tested for three different values:
0 (expressed as !x or !y)
bottom or last
All other values
Figure 2.9. A conditional expression formatted like an if statement (left) and like cascading if–else statements (right). op = &(
!x ? (
!y ?
upleft
: (
y == bottom ?
lowleft
:
left
)
) : (
x == last ? (
!y ?
upright
: (
y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: (
y == bottom ?
lower
:
normal
)
)
))[w->orientation];
op = &(
!x ? (
!y ?
upleft
: ( y == bottom ?
lowleft
:
left
)
) : ( x == last ? (
!y ?
upright
: ( y == bottom ?
lowright
:
right
)
) : (
!y ?
upper
: ( y == bottom ?
lower
:
normal
)
)
))[w->orientation];
We can therefore rewrite the expression formatted as a series of cascading if–else statements (expressed using the ?: operator) to demonstrate this fact. You can see the result in Figure 2.9 (right).
The expression's intent now becomes clear: the programmer is selecting one of nine different location values based on the combined values of x and y. Both alternative formulations, however, visually emphasize the punctuation at the expense of the semantic content and use an inordinate amount of vertical space. Nevertheless, based on our newly acquired insight, we can create a two-dimensional array containing these location values and index it using offsets we derive from the x and y values. You can see the new result in Figure 2.10. Notice how in the initialization of the array named locations, we use a two-dimensional textual structure to illustrate the two-dimensional nature of the computation being performed. The initializers are laid out two-dimensionally in the program text, the array is indexed in the normally unconventional order [y][x], and the mapping is to integers "0, 2, 1" rather than the more obvious "0, 1, 2", so as to make the two-dimensional presentation coincide with the semantic meanings of the words upleft, upper, and so on.
Figure 2.10 Location detection code replacing the conditional expression.
struct options *locations[3][3] = {
<-- a
{upleft, upper, upright},
{left, normal, right},
{lowleft, lower, lowright},
};
int xlocation, ylocation; <-- b
<-- c
if (x == 0)
xlocation = 0;
else if (x == last)
xlocation = 2;
else
xlocation = 1;
<-- d
if (y == 0)
ylocation = 0;
else if (y == bottom)
ylocation = 2;
else
ylocation = 1;
op = &(locations[ylocation][xlocation])[w->orientation];
(a) Location map
(b) To store the x, y map offsets
(c) Determine x offset
(d) Determine y offset
The code, at 20 lines, is longer than the original one-liner but still shorter by 7 lines from the one-liner's readable cascading-else representation. In our eyes it appears more readable, self-documenting, and easier to verify. One could argue that the original version would execute faster than the new one. This is based on the fallacy that code readability and efficiency are somehow incompatible. There is no need to sacrifice code readability for efficiency. While it is true that efficient algorithms and certain optimizations can make the code more complicated and therefore more difficult to follow, this does not mean that making the code compact and unreadable will make it more efficient. On our system and compiler the initial and final versions of the code execute at exactly the same speed: 0.6 ms. Even if there were speed differences, the economics behind software maintenance costs, programmer salaries, and CPU performance most of the time favor code readability over efficiency.
However, even the code in Figure 2.10 can be considered a mixed blessing: it achieves its advantages at the expense of two distinct disadvantages. First, it separates the code into two chunks that, while shown together in Figure 2.10, would necessarily be separated in real code. Second, it introduces an extra encoding (0, 1, 2), so that understanding what the code is doing requires two mental steps rather than one (map "0, last, other" to "0, 2, 1" and then map a pair of "0, 2, 1" values to one of nine items). Could we somehow directly introduce the two-dimensional structure of our computation into the conditional code? The following code fragment[40] reverts to conditional expressions but has them carefully laid out to express the computation's intent.
[40] Suggested by Guy Steele.
op =
&( !y ? (!x ? upleft : x!=last ? upper : upright) :
y!=bottom ? (!x ? left : x!=last ? normal : right) :
(!x ? lowleft : x!=last ? lower : lowright)
)[w->orientation];
The above formulation is a prime example on how sometimes creative code layout can be used to improve code readability. Note that the nine values are right-justified within their three columns, to make them stand out visually and to exploit the repetition of "left" and "right" in their names. Note also that the usual practice of putting spaces around operators is eschewed for the case of != in order to reduce the test expressions to single visual tokens, making the nine data values stand out more. Finally, the fact that the whole expression fits in five lines makes the vertical alignment of the first and last parentheses more effective, making it much easier to see that the basic structure of the entire statement is of the form
op = &(
The choice between the two new alternative representations is largely a matter of taste; however, we probably would not have come up with the second formulation without expressing the code in the initial, more verbose and explicit form.
The expression we rewrote was extremely large and obviously unreadable. Less extreme cases can also benefit from some rewriting. Often you can make an expression more readable by adding whitespace, by breaking it up into smaller parts by means of temporary variables, or by using parentheses to amplify the precedence of certain operators.
You do not always need to change the program structure to make it more readable. Often items that do not affect the program's operation (such as comments, the use of whitespace, and the choice of variable, function, and class names) can affect the program's readability. Consider the work we did to understand the code for the getstops function. A concise comment before the function definition would enhance the program's future readability.
/*
* Parse and verify the tab stop specification pointed to by cp
* setting the global variables nstops and tabstops[].
* Exit the program with an error message on bad specifications.
*/
When reading code under your control, make it a habit to add comments as needed.
In Sections 2.2 and 2.3 we explained how names and indentation can provide hints for understanding code functionality. Unfortunately, sometimes programmers choose unhelpful names and indent their programs inconsistently. You can improve the readability of poorly written code with better indentation and wise choice of variable names. These measures are extreme: apply them only when you have full responsibility and control over the source code, you are sure that your changes are a lot better than the original code, and you can revert to the original code if something goes wrong. Using a version management system such as the Revision Control System (RCS), the Source Code Control System (SCCS), the Concurrent Versions System (CVS), or Microsoft's Visual SourceSafe can help you control the code modifications. The adoption of a specific style for variable names and indentation can appear a tedious task. When modifying code that is part of a larger body to make it more readable, try to understand and follow the conventions of the rest of the code (see Chapter 7). Many organizations have a specific coding style; learn it and try to follow it. Otherwise, adopt one standard style (such as one of those used by the GNU[41] or BSD[42] groups) and use it consistently. When the code indentation is truly inconsistent and cannot be manually salvaged, a number of tools (such as indent) can help you automatically reindent it to make it more readable (see Section 10.7). Use such tools with care: the judicious use of whitespace allows programmers to provide visual clues that are beyond the abilities of automated formatting tools. Applying indent to the code example in Figure 2.10 would definitely make it less readable.
[41] http://www.gnu.org/prep/standardstoc.html
[42] netbsdsrc/share/misc/style:1–315
Keep in mind that although reindenting code may help readability, it also messes up the program's change history in the revision control system. For this reason it is probably best not to combine the reformatting with any actual changes to the program's logic. Do the reformat, check it in, and then make the other changes. In this way future code readers will be able to selectively retrieve and review your changes to the program's logic without getting overwhelmed by the global formatting changes. On the flip side of the coin, when you are examining a program revision history that spans a global reindentation exercise using the diff program, you can often avoid the noise introduced by the changed indentation levels by specifying the -w option to have diff ignore whitespace differences.
Exercise 2.25 Provide five examples from your environment or the book's CD-ROM where the code structure can be improved to make it more readable.
Exercise 2.26 You can find tens of intentionally unreadable C programs at the International Obfuscated C Code Contest Web site.[43] Most of them use several layers of obfuscation to hide their algorithms. See how gradual code changes can help you untangle their code. If you are not familiar with the C preprocessor, try to avoid programs with a large number of #define lines.
[43] http://www.ioccc.org
Exercise 2.27 Modify the position location code we examined to work on the mirror image of a board (interchange the right and left sides). Time yourself in modifying the original code and the final version listed in Figure 2.10. Do not look at the readable representations; if you find them useful, create them from scratch. Calculate the cost difference assuming current programmer salary rates (do not forget to add overheads). If the readable code runs at half the speed of the original code (it does not), calculate the cost of this slowdown by making reasonable assumptions concerning the number of times the code will get executed over the lifetime of a computer bought at a given price.
Exercise 2.28 If you are not familiar with a specific coding standard, locate one and adopt it. Verify local code against the coding standard.
Wednesday, December 26, 2007
Functions and Global Variables
2.2 Functions and Global Variables
The program expand processes the files named as its arguments (or its standard input if no file arguments are specified) by expanding hard tab characters(\t, ASCII character9) to a number of spaces. The default behavior is to set tab stops every eight characters; this can be overridden by a comma or space-separated numeric list specified using the -t option. An interesting aspect of the program's implementation, and the reason we are examining it, is that it uses all of the control flow statements available in the C family of languages. Figure 2.2 contains the variable and function declarations of expand,[10] Figure 2.3 contains the main code body,[11] and Figure 2.5 (in Section 2.5) contains the two supplementary functions used.[12]
[10] netbsdsrc/usr.bin/expand/expand.c:36–62
[11] netbsdsrc/usr.bin/expand/expand.c:64–151
[12] netbsdsrc/usr.bin/expand/expand.c:153–185
When examining a nontrivial program, it is useful to first identify its major constituent parts. In our case, these are the global variables (Figure 2.2:1) and the functions main (Figure 2.3), getstops (see Figure 2.5:1), and usage (see Figure 2.5:8).
The integer variable nstops and the array of integers tabstops are declared as global variables, outside the scope of function blocks. They are therefore visible to all functions in the file we are examining.
The three function declarations that follow (Figure 2.2:2) declare functions that will appear later within the file. Since some of these functions are used before they are defined, in C/C++ programs the declarations allow the compiler to verify the arguments passed to the function and their return values and generate correct corresponding code. When no forward declarations are given, the C compiler will make assumptions about the function return type and the arguments when the function is first used; C++ compilers will flag such cases as errors. If the following function definition does not match these assumptions, the compiler will issue a warning or error message. However, if the wrong declaration is supplied for a function defined in another file, the program may compile without a problem and fail at runtime.
Figure 2.2 Expanding tab stops (declarations).
<-- a
#include
#include
#include
#include
#include
int nstops;
int tabstops[100];
static void getstops(char *);
int main(int, char *);
static void usage (void);
(a) Header files
Global variables
Forward function declarations
Notice how the two functions are declared as static while the variables are not. This means that the two functions are visible only within the file, while the variables are potentially visible to all files comprising the program. Since expand consists only of a single file, this distinction is not important in our case. Most linkers that combine compiled C files are rather primitive; variables that are visible to all program files (that is, not declared as static) can interact in surprising ways with variables with the same name defined in other files. It is therefore a good practice when inspecting code to ensure that all variables needed only in a single file are declared as static.
Let us now look at the functions comprising expand. To understand what a function (or method) is doing you can employ one of the following strategies.
Guess, based on the function name.
Read the comment at the beginning of the function.
Examine how the function is used.
Read the code in the function body.
Consult external program documentation.
In our case we can safely guess that the function usage will display program usage information and then exit; many command-line programs have a function with the same name and functionality. When you examine a large body of code, you will gradually pick up names and naming conventions for variables and functions. These will help you correctly guess what they do. However, you should always be prepared to revise your initial guesses following new evidence that your code reading will inevitably unravel. In addition, when modifying code based on guesswork, you should plan the process that will verify your initial hypotheses. This process can involve checks by the compiler, the introduction of assertions, or the execution of appropriate test cases.
Figure 2.3 Expanding tab stops (main part).
int
main(int argc, char *argv)
{
int c, column;
int n;
while ((c = getopt (argc, argv, "t:")) != -1) {
switch (c) {
case 't':
getstops(optarg);
break;
case '?': default: <-- a
usage();
}
}
argc -= optind;
argv += optind;
do {
if (argc > 0) {
if (freopen(argv[0], "r", stdin) == NULL) {
perror(argv[0]);
exit(1);
}
argc--, argv++;
}
column = 0;
while ((c = getchar()) != EOF) {
switch (c) {
case '\t': <-- b
if (nstops == 0) {
do {
putchar(' ');
column++;
} while (column & 07);
continue;
}
if (nstops == 1) {
do {
putchar(' ');
column++;
} while (((column - 1) % tabstops[0]) != (tabstops[0] - 1));
continue;
}
for (n = 0; n < nstops; n++)
if (tabstops[n] > column)
break;
if (n == nstops) {
putchar(' ');
column++;
continue;
}
while (column < tabstops[n]) {
putchar(' ');
column++;
}
continue;
case '\b': <-- c
if (column)
column--;
putchar('\b');
continue;
default: <-- d
putchar(c);
column++;
continue;
case '\n': <-- e
putchar(c);
column = 0;
continue;
} <-- f
} <-- g
} while (argc > 0);) <-- h
exit(0);
}
Variables local to main
Argument processing using getopt
Process the -t option
(a) Switch labels grouped together
End of switch block
At least once
(7) Process remaining arguments
Read characters until EOF
(b) Tab character
Process next character
(c) Backspace
(d) All other characters
(e) Newline
(f) End of switch block
(g) End of while block
(h) End of do block
The role of getstops is more difficult to understand. There is no comment, the code in the function body is not trivial, and its name can be interpreted in different ways. Noting that it is used in a single part of the program (Figure 2.3:3) can help us further. The program part where getstops is used is the part responsible for processing the program's options (Figure 2.3:2). We can therefore safely (and correctly in our case) assume that getstops will process the tab stop specification option. This form of gradual understanding is common when reading code; understanding one part of the code can make others fall into place. Based on this form of gradual understanding you can employ a strategy for understanding difficult code similar to the one often used to combine the pieces of a jigsaw puzzle: start with the easy parts.
Exercise 2.7 Examine the visibility of functions and variables in programs in your environment. Can it be improved (made more conservative)?
Exercise 2.8 Pick some functions or methods from the book's CD-ROM or from your environment and determine their role using the strategies we outlined. Try to minimize the time you spend on each function or method. Order the strategies by their success rate.
2.4 switch Statements
The normal return values of getopt are handled by a switch statement. You will find switch statements used when a number of discrete integer or character values are being processed. The code to handle each value is preceded by a case label. When the value of the expression in the switch statement matches the value of one of the case labels, the program will start to execute statements from that point onward. If none of the label values match the expression value and a default label exists, control will transfer to that point; otherwise, no code within the switch block will get executed. Note that additional labels encountered after transferring execution control to a label will not terminate the execution of statements within the switch block; to stop processing code within the switch block and continue with statements outside it, a break statement must be executed. You will often see this feature used to group case labels together, merging common code elements. In our case when getopt returns 't', the statements that handle -t are executed, with break causing a transfer of execution control immediately after the closing brace of the switch block (Figure 2.3:4). In addition, we can see that the code for the default switch label and the error return value ´?´ is common since the two corresponding labels are grouped together.
When the code for a given case or default label does not end with a statement that transfers control out of the switch block (such as break, return, or continue), the program will continue to execute the statements following the next label. When examining code, look out for this error. In rare cases the programmer might actually want this behavior. To alert maintainers to that fact, it is common to mark these places with a comment, such as FALLTHROUGH, as in the following example.[17]
[17] netbsdsrc/bin/ls/ls.c:173–178
case 'a':
fts_options |= FTS–SEEDOT;
/* FALLTHROUGH */
case 'A':
f_listdot = 1;
break;
The code above comes from the option processing of the Unix ls command, which lists files in a directory. The option -A will include in the list files starting with a dot (which are, by convention, hidden), while the option -a modifies this behavior by adding to the list the two directory entries. Programs that automatically verify source code against common errors, such as the Unix lint command, can use the FALLTHROUGH comment to suppress spurious warnings.
A switch statement lacking a default label will silently ignore unexpected values. Even when one knows that only a fixed set of values will be processed by a switch statement, it is good defensive programming practice to include a default label. Such a default label can catch programming errors that yield unexpected values and alert the program maintainer, as in the following example.[18]
[18] netbsdsrc/usr.bin/at/at.c:535–561
switch (program) {
case ATQ:
[...]
case BATCH:
writefile(time(NULL), 'b');
break;
default:
panic("Internal error");
break;
}
In our case the switch statement can handle two getopt return values.
't' is returned to handle the -t option. Optind will point to the argument of -t. The processing is handled by calling the function getstops with the tab specification as its argument.
'?' is returned when an unknown option or another error is found by getopt. In that case the usage function will print program usage information and exit the program.
A switch statement is also used as part of the program's character-processing loop (Figure 2.3:7). Each character is examined and some characters (the tab, the newline, and the backspace) receive special processing.
Exercise 2.13 The code body of switch statements in the source code collection is formatted differently from the other statements. Express the formatting rule used, and explain its rationale.
Exercise 2.14 Examine the handling of unexpected values in switch statements in the programs you read. Propose changes to detect errors. Discuss how these changes will affect the robustness of programs in a production environment.
Exercise 2.15 Is there a tool or a compiler option in your environment for detecting missing break statements in switch code? Use it, and examine the results on some sample programs.
2.5 for Loops
To complete our understanding of how expand processes its command-line options, we now need to examine the getstops function. Although the role of its single cp argument is not obvious from its name, it becomes apparent when we examine how getstops is used. getstops is passed the argument of the -t option, which is a list of tab stops, for example, 4, 8, 16, 24. The strategies outlined for determining the roles of functions (Section 2.2) can also be employed for their arguments. Thus a pattern for reading code slowly emerges. Code reading involves many alternative strategies: bottom-up and top-down examination, the use of heuristics, and review of comments and external documentation should all be tried as the problem dictates.
After setting nstops to 0, getstops enters a for loop. Typically a for loop is specified by an expression to be evaluated before the loop starts, an expression to be evaluated before each iteration to determine if the loop body will be entered, and an expression to be evaluated after the execution of the loop body. for loops are often used to execute a body of code a specific number of times.[19]
[19] cocoon/src/java/org/apache/cocoon/util/StringUtils.java:85
for (i = 0; i < len; i++) {
Loops of this type appear very frequently in programs; learn to read them as "execute the body of code len times." On the other hand, any deviation from this style, such as an initial value other than 0 or a comparison operator other than <, should alert you to carefully reason about the loop's behavior. Consider the number of times the loop body is executed in the following examples.
Loop extrknt + 1 times:[20]
[20] netbsdsrc/usr.bin/fsplit/fsplit.c:173
for (i = 0; i <= extrknt; i++)
Loop month - 1 times:[21]
[21] netbsdsrc/usr.bin/cal/cal.c:332
for (i = 1; i < month; i++)
Loop nargs times:[22]
[22] netbsdsrc/usr.bin/apply/apply.c:130
for (i = 1; i <= nargs; i++)
Note that the last expression need not be an increment operator. The following line will loop 256 times, decrementing code in the process:[23]
[23] netbsdsrc/usr.bin/compress/zopen.c:510
for (code = 255; code >= 0; code--) {
In addition, you will find for statements used to loop over result sets returned by library functions. The following loop is performed for all files in the directory dir.[24]
[24] netbsdsrc/usr.bin/ftp/complete.c:193–198
if ((dd = opendir(dir)) == NULL)
return (CC_ERROR);
for (dp = readdir(dd); dp != NULL; dp = readdir(dd)) {
The call to opendir returns a value that can be passed to readdir to sequentially access each directory entry of dir. When there are no more entries in the directory, readdir will return NULL and the loop will terminate.
The three parts of the for specification are expressions and not statements. Therefore, if more than one operation needs to be performed when the loop begins or at the end of each iteration, the expressions cannot be grouped together using braces. You will, however, often find expressions grouped together using the expression-sequencing comma (,) operator.[25]
[25] netbsdsrc/usr.bin/vi/vi/vs smap.c:389
for (cnt = 1, t = p; cnt <= cnt–orig; ++t, ++cnt) {
The value of two expressions joined with the comma operator is just the value of the second expression. In our case the expressions are evaluated only for their side effects: before the loop starts, cnt will be set to 1 and t to p, and after every loop iteration t and cnt will be incremented by one.
Any expression of a for statement can be omitted. When the second expression is missing, it is taken as true. Many programs use a statement of the form for (;;) to perform an "infinite" loop. Very seldom are such loops really infinite. The following example—taken out of init, the program that continuously loops, controlling all Unix processes—is an exception.[26]
[26] netbsdsrc/sbin/init/init.c:540–545
Figure 2.5 Expanding tab stops (supplementary functions).
static void
getstops(char *cp)
{
int i;
nstops = 0;
for (;;) {
i = 0;
while (*cp >= '0' && *cp <= '9')
i = i * 10 + *cp++ - '0';
if (i <= 0 || i > 256) {
bad:
fprintf(stderr, "Bad tab stop spec\n");
exit(1);
}
if (nstops > 0 && i <= tabstops[nstops-1])
goto bad;
tabstops[nstops++] = i;
if (*cp == 0)
break;
if (*cp != ',' && *cp != ' ')
goto bad;
cp++;
}
}
static void
usage(void)
{
(void)fprintf (stderr, "usage: expand [-t tablist] [file ...]\n");
exit(1);
}
Parse tab stop specification
Convert string to number
Complain about unreasonable specifications
Verify ascending order
Break out of the loop
Verify valid delimiters
break will transfer control here
Print program usage and exit
for (;;) {
s = (state_t) (*s)();
quiet = 0;
}
In most cases an "infinite" loop is a way to express a loop whose exit condition(s) cannot be specified at its beginning or its end. These loops are typically exited either by a return statement that exits the function, a break statement that exits the loop body, or a call to exit or a similar function that exits the entire program. C++, C#, and Java programs can also exit such loops through an exception (see Section 5.2).
A quick look through the code of the loop in Figure 2.5 provides us with the possible exit routes.
A bad stop specification will cause the program to terminate with an error message (Figure 2.5:3).
The end of the tab specification string will break out of the loop.
Exercise 2.16 The for statement in the C language family is very flexible. Examine the source code provided to create a list of ten different uses.
Exercise 2.17 Express the examples in this section using while instead of for. Which of the two forms do you find more readable?
Exercise 2.18 Devise a style guideline specifying when while loops should be used in preference to for loops. Verify the guideline against representative examples from the book's CD-ROM.
2.6 break and continue Statements
A break statement will transfer the execution to the statement after the innermost loop or switch statement (Figure 2.5:7). In most cases you will find break used to exit early out of a loop. A continue statement will continue the iteration of the innermost loop without executing the statements to the end of the loop. A continue statement will reevaluate the conditional expression of while and do loops. In for loops it will evaluate the third expression and then the conditional expression. You will find continue used where a loop body is split to process different cases; each case typically ends with a continue statement to cause the next loop iteration. In the program we are examining, continue is used after processing each different input character class (Figure 2.3:8).
Note when you are reading Perl code that break and continue are correspondingly named last and next.[27]
[27] perl/lib/unicode/mktables.PL:415–425
while () {
chomp;
if (s/0x[\d\w]+\s+\((.*?)\)// and $wanted eq $1) {
[...]
last;
}
}
To determine the effect of a break statement, start reading the program upward from break until you encounter the first while, for, do,or switch block that encloses the break statement. Locate the first statement after that loop; this will be the place where control will transfer when break is executed. Similarly, when examining code that contains a continue statement, start reading the program upward from continue until you encounter the first while, for, or do loop that encloses the continue statement. Locate the last statement of that loop; immediately after it (but not outside the loop) will be the place where control will transfer when continue is executed. Note that continue ignores switch statements and that neither break nor continue affect the operation of if statements.
There are situations where a loop is executed only for the side effects of its controlling expressions. In such cases continue is sometimes used as a placeholder instead of the empty statement (expressed by a single semicolon). The following example illustrates such a case.[28]
[28] netbsdsrc/usr.bin/error/pi.c:174–175
for (; *string && isdigit(*string); string++)
continue;
In Java programs break and continue can be followed by a label identifier. The same identifier, followed by a colon, is also used to label a loop statement. The labeled form of the continue statement is then used to skip an iteration of a nested loop; the label identifies the loop statement that the corresponding continue will skip. Thus, in the following example, the continue skip; statement will skip one iteration of the outermost for statement.[29]
[29] jt4/jasper/src/share/org/apache/jasper/compiler/JspReader.java:472–482
skip:
for ( [...]) {
if ( ch == limit.charAt(0)) {
for (int i = 1 ; i < limlen ; i++) {
if ( [...] )
continue skip;
}
return ret;
}
}
Similarly, the labeled form of the break statement is used to exit from nested loops; the label identifies the statement that the corresponding break will terminate. In some cases a labeled break or continue statements is used, even when there are no nested loops, to clarify the corresponding loop statement.[30]
[30] cocoon/src/scratchpad/src/org/apache/cocoon/treeprocessor/MapStackResolver.java:201–244
comp : while(prev < length) {
[...]
if (pos >= length || pos == -1) {
[...]
break comp;
}
}
Exercise 2.19 Locate ten occurrences of break and continue in the source code provided with the book. For each case indicate the point where execution will transfer after the corresponding statement is executed, and explain why the statement is used. Do not try to understand in full the logic of the code; simply provide an explanation based on the statement's use pattern.
The program expand processes the files named as its arguments (or its standard input if no file arguments are specified) by expanding hard tab characters(\t, ASCII character9) to a number of spaces. The default behavior is to set tab stops every eight characters; this can be overridden by a comma or space-separated numeric list specified using the -t option. An interesting aspect of the program's implementation, and the reason we are examining it, is that it uses all of the control flow statements available in the C family of languages. Figure 2.2 contains the variable and function declarations of expand,[10] Figure 2.3 contains the main code body,[11] and Figure 2.5 (in Section 2.5) contains the two supplementary functions used.[12]
[10] netbsdsrc/usr.bin/expand/expand.c:36–62
[11] netbsdsrc/usr.bin/expand/expand.c:64–151
[12] netbsdsrc/usr.bin/expand/expand.c:153–185
When examining a nontrivial program, it is useful to first identify its major constituent parts. In our case, these are the global variables (Figure 2.2:1) and the functions main (Figure 2.3), getstops (see Figure 2.5:1), and usage (see Figure 2.5:8).
The integer variable nstops and the array of integers tabstops are declared as global variables, outside the scope of function blocks. They are therefore visible to all functions in the file we are examining.
The three function declarations that follow (Figure 2.2:2) declare functions that will appear later within the file. Since some of these functions are used before they are defined, in C/C++ programs the declarations allow the compiler to verify the arguments passed to the function and their return values and generate correct corresponding code. When no forward declarations are given, the C compiler will make assumptions about the function return type and the arguments when the function is first used; C++ compilers will flag such cases as errors. If the following function definition does not match these assumptions, the compiler will issue a warning or error message. However, if the wrong declaration is supplied for a function defined in another file, the program may compile without a problem and fail at runtime.
Figure 2.2 Expanding tab stops (declarations).
<-- a
#include
#include
#include
#include
#include
int nstops;
int tabstops[100];
static void getstops(char *);
int main(int, char *);
static void usage (void);
(a) Header files
Global variables
Forward function declarations
Notice how the two functions are declared as static while the variables are not. This means that the two functions are visible only within the file, while the variables are potentially visible to all files comprising the program. Since expand consists only of a single file, this distinction is not important in our case. Most linkers that combine compiled C files are rather primitive; variables that are visible to all program files (that is, not declared as static) can interact in surprising ways with variables with the same name defined in other files. It is therefore a good practice when inspecting code to ensure that all variables needed only in a single file are declared as static.
Let us now look at the functions comprising expand. To understand what a function (or method) is doing you can employ one of the following strategies.
Guess, based on the function name.
Read the comment at the beginning of the function.
Examine how the function is used.
Read the code in the function body.
Consult external program documentation.
In our case we can safely guess that the function usage will display program usage information and then exit; many command-line programs have a function with the same name and functionality. When you examine a large body of code, you will gradually pick up names and naming conventions for variables and functions. These will help you correctly guess what they do. However, you should always be prepared to revise your initial guesses following new evidence that your code reading will inevitably unravel. In addition, when modifying code based on guesswork, you should plan the process that will verify your initial hypotheses. This process can involve checks by the compiler, the introduction of assertions, or the execution of appropriate test cases.
Figure 2.3 Expanding tab stops (main part).
int
main(int argc, char *argv)
{
int c, column;
int n;
while ((c = getopt (argc, argv, "t:")) != -1) {
switch (c) {
case 't':
getstops(optarg);
break;
case '?': default: <-- a
usage();
}
}
argc -= optind;
argv += optind;
do {
if (argc > 0) {
if (freopen(argv[0], "r", stdin) == NULL) {
perror(argv[0]);
exit(1);
}
argc--, argv++;
}
column = 0;
while ((c = getchar()) != EOF) {
switch (c) {
case '\t': <-- b
if (nstops == 0) {
do {
putchar(' ');
column++;
} while (column & 07);
continue;
}
if (nstops == 1) {
do {
putchar(' ');
column++;
} while (((column - 1) % tabstops[0]) != (tabstops[0] - 1));
continue;
}
for (n = 0; n < nstops; n++)
if (tabstops[n] > column)
break;
if (n == nstops) {
putchar(' ');
column++;
continue;
}
while (column < tabstops[n]) {
putchar(' ');
column++;
}
continue;
case '\b': <-- c
if (column)
column--;
putchar('\b');
continue;
default: <-- d
putchar(c);
column++;
continue;
case '\n': <-- e
putchar(c);
column = 0;
continue;
} <-- f
} <-- g
} while (argc > 0);) <-- h
exit(0);
}
Variables local to main
Argument processing using getopt
Process the -t option
(a) Switch labels grouped together
End of switch block
At least once
(7) Process remaining arguments
Read characters until EOF
(b) Tab character
Process next character
(c) Backspace
(d) All other characters
(e) Newline
(f) End of switch block
(g) End of while block
(h) End of do block
The role of getstops is more difficult to understand. There is no comment, the code in the function body is not trivial, and its name can be interpreted in different ways. Noting that it is used in a single part of the program (Figure 2.3:3) can help us further. The program part where getstops is used is the part responsible for processing the program's options (Figure 2.3:2). We can therefore safely (and correctly in our case) assume that getstops will process the tab stop specification option. This form of gradual understanding is common when reading code; understanding one part of the code can make others fall into place. Based on this form of gradual understanding you can employ a strategy for understanding difficult code similar to the one often used to combine the pieces of a jigsaw puzzle: start with the easy parts.
Exercise 2.7 Examine the visibility of functions and variables in programs in your environment. Can it be improved (made more conservative)?
Exercise 2.8 Pick some functions or methods from the book's CD-ROM or from your environment and determine their role using the strategies we outlined. Try to minimize the time you spend on each function or method. Order the strategies by their success rate.
2.4 switch Statements
The normal return values of getopt are handled by a switch statement. You will find switch statements used when a number of discrete integer or character values are being processed. The code to handle each value is preceded by a case label. When the value of the expression in the switch statement matches the value of one of the case labels, the program will start to execute statements from that point onward. If none of the label values match the expression value and a default label exists, control will transfer to that point; otherwise, no code within the switch block will get executed. Note that additional labels encountered after transferring execution control to a label will not terminate the execution of statements within the switch block; to stop processing code within the switch block and continue with statements outside it, a break statement must be executed. You will often see this feature used to group case labels together, merging common code elements. In our case when getopt returns 't', the statements that handle -t are executed, with break causing a transfer of execution control immediately after the closing brace of the switch block (Figure 2.3:4). In addition, we can see that the code for the default switch label and the error return value ´?´ is common since the two corresponding labels are grouped together.
When the code for a given case or default label does not end with a statement that transfers control out of the switch block (such as break, return, or continue), the program will continue to execute the statements following the next label. When examining code, look out for this error. In rare cases the programmer might actually want this behavior. To alert maintainers to that fact, it is common to mark these places with a comment, such as FALLTHROUGH, as in the following example.[17]
[17] netbsdsrc/bin/ls/ls.c:173–178
case 'a':
fts_options |= FTS–SEEDOT;
/* FALLTHROUGH */
case 'A':
f_listdot = 1;
break;
The code above comes from the option processing of the Unix ls command, which lists files in a directory. The option -A will include in the list files starting with a dot (which are, by convention, hidden), while the option -a modifies this behavior by adding to the list the two directory entries. Programs that automatically verify source code against common errors, such as the Unix lint command, can use the FALLTHROUGH comment to suppress spurious warnings.
A switch statement lacking a default label will silently ignore unexpected values. Even when one knows that only a fixed set of values will be processed by a switch statement, it is good defensive programming practice to include a default label. Such a default label can catch programming errors that yield unexpected values and alert the program maintainer, as in the following example.[18]
[18] netbsdsrc/usr.bin/at/at.c:535–561
switch (program) {
case ATQ:
[...]
case BATCH:
writefile(time(NULL), 'b');
break;
default:
panic("Internal error");
break;
}
In our case the switch statement can handle two getopt return values.
't' is returned to handle the -t option. Optind will point to the argument of -t. The processing is handled by calling the function getstops with the tab specification as its argument.
'?' is returned when an unknown option or another error is found by getopt. In that case the usage function will print program usage information and exit the program.
A switch statement is also used as part of the program's character-processing loop (Figure 2.3:7). Each character is examined and some characters (the tab, the newline, and the backspace) receive special processing.
Exercise 2.13 The code body of switch statements in the source code collection is formatted differently from the other statements. Express the formatting rule used, and explain its rationale.
Exercise 2.14 Examine the handling of unexpected values in switch statements in the programs you read. Propose changes to detect errors. Discuss how these changes will affect the robustness of programs in a production environment.
Exercise 2.15 Is there a tool or a compiler option in your environment for detecting missing break statements in switch code? Use it, and examine the results on some sample programs.
2.5 for Loops
To complete our understanding of how expand processes its command-line options, we now need to examine the getstops function. Although the role of its single cp argument is not obvious from its name, it becomes apparent when we examine how getstops is used. getstops is passed the argument of the -t option, which is a list of tab stops, for example, 4, 8, 16, 24. The strategies outlined for determining the roles of functions (Section 2.2) can also be employed for their arguments. Thus a pattern for reading code slowly emerges. Code reading involves many alternative strategies: bottom-up and top-down examination, the use of heuristics, and review of comments and external documentation should all be tried as the problem dictates.
After setting nstops to 0, getstops enters a for loop. Typically a for loop is specified by an expression to be evaluated before the loop starts, an expression to be evaluated before each iteration to determine if the loop body will be entered, and an expression to be evaluated after the execution of the loop body. for loops are often used to execute a body of code a specific number of times.[19]
[19] cocoon/src/java/org/apache/cocoon/util/StringUtils.java:85
for (i = 0; i < len; i++) {
Loops of this type appear very frequently in programs; learn to read them as "execute the body of code len times." On the other hand, any deviation from this style, such as an initial value other than 0 or a comparison operator other than <, should alert you to carefully reason about the loop's behavior. Consider the number of times the loop body is executed in the following examples.
Loop extrknt + 1 times:[20]
[20] netbsdsrc/usr.bin/fsplit/fsplit.c:173
for (i = 0; i <= extrknt; i++)
Loop month - 1 times:[21]
[21] netbsdsrc/usr.bin/cal/cal.c:332
for (i = 1; i < month; i++)
Loop nargs times:[22]
[22] netbsdsrc/usr.bin/apply/apply.c:130
for (i = 1; i <= nargs; i++)
Note that the last expression need not be an increment operator. The following line will loop 256 times, decrementing code in the process:[23]
[23] netbsdsrc/usr.bin/compress/zopen.c:510
for (code = 255; code >= 0; code--) {
In addition, you will find for statements used to loop over result sets returned by library functions. The following loop is performed for all files in the directory dir.[24]
[24] netbsdsrc/usr.bin/ftp/complete.c:193–198
if ((dd = opendir(dir)) == NULL)
return (CC_ERROR);
for (dp = readdir(dd); dp != NULL; dp = readdir(dd)) {
The call to opendir returns a value that can be passed to readdir to sequentially access each directory entry of dir. When there are no more entries in the directory, readdir will return NULL and the loop will terminate.
The three parts of the for specification are expressions and not statements. Therefore, if more than one operation needs to be performed when the loop begins or at the end of each iteration, the expressions cannot be grouped together using braces. You will, however, often find expressions grouped together using the expression-sequencing comma (,) operator.[25]
[25] netbsdsrc/usr.bin/vi/vi/vs smap.c:389
for (cnt = 1, t = p; cnt <= cnt–orig; ++t, ++cnt) {
The value of two expressions joined with the comma operator is just the value of the second expression. In our case the expressions are evaluated only for their side effects: before the loop starts, cnt will be set to 1 and t to p, and after every loop iteration t and cnt will be incremented by one.
Any expression of a for statement can be omitted. When the second expression is missing, it is taken as true. Many programs use a statement of the form for (;;) to perform an "infinite" loop. Very seldom are such loops really infinite. The following example—taken out of init, the program that continuously loops, controlling all Unix processes—is an exception.[26]
[26] netbsdsrc/sbin/init/init.c:540–545
Figure 2.5 Expanding tab stops (supplementary functions).
static void
getstops(char *cp)
{
int i;
nstops = 0;
for (;;) {
i = 0;
while (*cp >= '0' && *cp <= '9')
i = i * 10 + *cp++ - '0';
if (i <= 0 || i > 256) {
bad:
fprintf(stderr, "Bad tab stop spec\n");
exit(1);
}
if (nstops > 0 && i <= tabstops[nstops-1])
goto bad;
tabstops[nstops++] = i;
if (*cp == 0)
break;
if (*cp != ',' && *cp != ' ')
goto bad;
cp++;
}
}
static void
usage(void)
{
(void)fprintf (stderr, "usage: expand [-t tablist] [file ...]\n");
exit(1);
}
Parse tab stop specification
Convert string to number
Complain about unreasonable specifications
Verify ascending order
Break out of the loop
Verify valid delimiters
break will transfer control here
Print program usage and exit
for (;;) {
s = (state_t) (*s)();
quiet = 0;
}
In most cases an "infinite" loop is a way to express a loop whose exit condition(s) cannot be specified at its beginning or its end. These loops are typically exited either by a return statement that exits the function, a break statement that exits the loop body, or a call to exit or a similar function that exits the entire program. C++, C#, and Java programs can also exit such loops through an exception (see Section 5.2).
A quick look through the code of the loop in Figure 2.5 provides us with the possible exit routes.
A bad stop specification will cause the program to terminate with an error message (Figure 2.5:3).
The end of the tab specification string will break out of the loop.
Exercise 2.16 The for statement in the C language family is very flexible. Examine the source code provided to create a list of ten different uses.
Exercise 2.17 Express the examples in this section using while instead of for. Which of the two forms do you find more readable?
Exercise 2.18 Devise a style guideline specifying when while loops should be used in preference to for loops. Verify the guideline against representative examples from the book's CD-ROM.
2.6 break and continue Statements
A break statement will transfer the execution to the statement after the innermost loop or switch statement (Figure 2.5:7). In most cases you will find break used to exit early out of a loop. A continue statement will continue the iteration of the innermost loop without executing the statements to the end of the loop. A continue statement will reevaluate the conditional expression of while and do loops. In for loops it will evaluate the third expression and then the conditional expression. You will find continue used where a loop body is split to process different cases; each case typically ends with a continue statement to cause the next loop iteration. In the program we are examining, continue is used after processing each different input character class (Figure 2.3:8).
Note when you are reading Perl code that break and continue are correspondingly named last and next.[27]
[27] perl/lib/unicode/mktables.PL:415–425
while (
chomp;
if (s/0x[\d\w]+\s+\((.*?)\)// and $wanted eq $1) {
[...]
last;
}
}
To determine the effect of a break statement, start reading the program upward from break until you encounter the first while, for, do,or switch block that encloses the break statement. Locate the first statement after that loop; this will be the place where control will transfer when break is executed. Similarly, when examining code that contains a continue statement, start reading the program upward from continue until you encounter the first while, for, or do loop that encloses the continue statement. Locate the last statement of that loop; immediately after it (but not outside the loop) will be the place where control will transfer when continue is executed. Note that continue ignores switch statements and that neither break nor continue affect the operation of if statements.
There are situations where a loop is executed only for the side effects of its controlling expressions. In such cases continue is sometimes used as a placeholder instead of the empty statement (expressed by a single semicolon). The following example illustrates such a case.[28]
[28] netbsdsrc/usr.bin/error/pi.c:174–175
for (; *string && isdigit(*string); string++)
continue;
In Java programs break and continue can be followed by a label identifier. The same identifier, followed by a colon, is also used to label a loop statement. The labeled form of the continue statement is then used to skip an iteration of a nested loop; the label identifies the loop statement that the corresponding continue will skip. Thus, in the following example, the continue skip; statement will skip one iteration of the outermost for statement.[29]
[29] jt4/jasper/src/share/org/apache/jasper/compiler/JspReader.java:472–482
skip:
for ( [...]) {
if ( ch == limit.charAt(0)) {
for (int i = 1 ; i < limlen ; i++) {
if ( [...] )
continue skip;
}
return ret;
}
}
Similarly, the labeled form of the break statement is used to exit from nested loops; the label identifies the statement that the corresponding break will terminate. In some cases a labeled break or continue statements is used, even when there are no nested loops, to clarify the corresponding loop statement.[30]
[30] cocoon/src/scratchpad/src/org/apache/cocoon/treeprocessor/MapStackResolver.java:201–244
comp : while(prev < length) {
[...]
if (pos >= length || pos == -1) {
[...]
break comp;
}
}
Exercise 2.19 Locate ten occurrences of break and continue in the source code provided with the book. For each case indicate the point where execution will transfer after the corresponding statement is executed, and explain why the statement is used. Do not try to understand in full the logic of the code; simply provide an explanation based on the statement's use pattern.
Subscribe to:
Posts (Atom)