The AWK Programming Language

Alfred V. Aho, Peter J Weinberger, Brian W. Kernighan

1. An AWK Tutorial

1.1. Getting Started

Suppose you have a file called emp.data that contains the name, pay rate in dollars per hour, and number of hours worked for your employees, one employee record per line:

echo 'Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18' > emp.data

cat emp.data

1.1s

Bash

Now you want to print the name and pay (rate times hours) for everyone who worked more than zero hours:

awk '$3 > 0 { print $1, $2 * $3 }' emp.data

0.7s

Bash

To print the names of those employees who did not work:

awk '$3 == 0 { print $1 }' emp.data

0.7s

Bash

1.2. Simple Output

Print every line

awk '{ print }' emp.data

0.8s

Bash

awk '{ print $0 }' emp.data

0.8s

Bash

Print certain fields

awk '{ print $1, $3 }' emp.data

0.7s

Bash

NF, the Number of Fields

awk '{ print NF, $1, $NF }' emp.data

0.8s

Bash

Computing and Printing

awk '{ print $1, $2 * $3 }' emp.data

0.7s

Bash

Printing line numbers

awk '{ print NR, $0 }' emp.data

0.8s

Bash

Putting Text in the Output

awk '{ print "total pay for", $1, "is", $2 * $3 }' emp.data

0.7s

Bash

1.3. Fancier output

Lining Up Fields

The printf statement has the form

printf()

where is a string that contains text to be printed verbatim, interspersed with specifications of how each of the values is to be printed. A specification is a % followed by a few characters that control the format of a value.

Here's a program that uses printf to print the total pay for every employee:

awk '{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }' emp.data

0.8s

Bash

The %s says to print the first value, $1, as a string of characters; the second, %.2f, says to print the second value, $2 * $3, as a number with 2 digits after the decimal point. Everything else in the specification string, including the dollar sign, is printed verbatim.

Here's another program that prints each employee's name and pay:

awk '{ printf("%-8s $%6.2f\n", $1, $2 * $3) }' emp.data

0.7s

Bash

The first specification, %-8s, prints a name as a string of characters left- justified in a field 8 characters wide. The second specification, %6.2f, prints the pay as a number with two digits after the decimal point, in a field 6 characters wide.

Sorting the output

awk '{ printf("%6.2f %s\n", $2 * $3, $0) }' emp.data | sort -n

0.7s

Bash

1.4. Selection

Selection by Comparison

This program uses a comparison pattern to select the records of employees who earn $5.00 or more per hour, that is, lines in which the second field is greater than or equal to 5:

awk '$2 >= 5' emp.data

0.6s

Bash

Selection by Computation

This program prints the pay of those employees whose total pay exceeds $50:

awk '$2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }' emp.data

0.7s

Bash

Selection by Text Content

Print all lines in which the first field is Susie:

awk '$1 == "Susie"' emp.data

0.7s

Bash

The operator == tests for equality. You can also look for text containing any of a set of letters, words, and phrases by using patterns called regular expressions.

This program prints all lines that contain Susie anywhere:

awk '/Susie/' emp.data

0.7s

Bash

Combinations of Patterns

Patterns can be combined with parentheses and the logical operators &&, || and !, which stand for AND, OR, and NOT.

Print the lines where $2 is at least 4 or $3 is at least 20:

awk '$2 >= 4 || $3 >= 20' emp.data

0.6s

Bash

Lines that satisfy both conditions are printed only once.

The next one prints lines where it is not true that $2 is less than 4 and $3 is less than 20; this condition is equivalent to the first one above, though perhaps less readable.

awk '!($2 < 4 && $3 < 20)' emp.data

0.7s

Bash

Data Validation

There are always errors in real data. Awk is an excellent tool for checking that data has reasonable values and is in the right format, a task that is often called data validation.

Data validation is essentially negative: instead of printing lines with desirable properties, one prints lines that are suspicious. The following program uses comparison patterns to apply five plausibility tests to each line of emp.data:

awk 'NF != 3 { print $0, "number of fields is not equal to 3" }' emp.data

0.1s

Bash

awk '$2 < 3.35 { print $0, "rate is below minimum wage" }' emp.data

0.1s

Bash

awk '$2 > 10 { print $0, "rate exceeds $10 per hour" }' emp.data

0.1s

Bash

awk '$3 < 0 { print $0, "negative hours worked" }' emp.data

0.1s

Bash

awk '$3 > 60 { print $0, "too many hours worked" }' emp.data

0.1s

Bash

If there are no errors, there's no output.

BEGIN and END

The special pattern BEGIN matches before the first line of the first input file is read, and END matches after the last line of the last file has been processed. This program uses BEGIN to print a heading:

awk 'BEGIN { print "NAME RATE HOURS"; print "" }
           { print }' emp.data

0.7s

Bash

1.5. Computing with AWK

An action is a sequence of statements separated by newlines or semicolons. You have already seen examples in which the action was a single print statement. This section provides examples of statements for performing simple numeric and string computations. In these statements you can use not only the built-in variables like NF, but you can create your own variables for performing calculations, storing data, and the like. In awk, user-created variables are not declared.

Counting

This program uses a variable emp to count employees who have worked more than 15 hours:

awk '$3 > 15 { emp = emp + 1 }
     END     { print emp, "employees worked more than 15 hours" }' emp.data

0.7s

Bash

Awk variables used as numbers begin life with the value 0, so we didn't need to initialize emp.

Computing Sums and Averages

To count the number of employees, we can use the built-in variable NR, which holds the number of lines read so far; its value at the end of all input is the total number of lines read.

awk 'END { print NR, "employees" }' emp.data

0.7s

Bash

Here is a program that uses NR to compute the average pay:

awk '{ pay = pay + $2 * $3 }
 END { print NR, "employees"
       print "total pay is", pay
       print "average pay is", pay/NR }' emp.data

0.7s

Bash

Clearly, printf could be used to produce neater output. There's also a potential error: in the unlikely case that NR is zero, the program will attempt to divide by zero and thus will generate an error message.

Handling Text

One of the strengths of awk is its ability to handle strings of characters as conveniently as most languages handle numbers. Awk variables can hold strings of characters as well as numbers. This program finds the employee who is paid the most per hour:

awk '$2 > maxrate { maxrate = $2; maxemp = $1 }
     END { print "highest hourly rate:", maxrate, "for", maxemp }' emp.data

0.6s

Bash

In this program the variable maxrate holds a numeric value, while the variable maxemp holds a string. (If there are several employees who all make the same maximum pay, this program finds only the first.}

String Concatenation

New strings may be created by combining old ones; this operation is called concatenation. This program collects all the employee names into a single string, by appending each name and a blank to the previous value in the variable names. The value of names is printed by the END action:

awk ' { names = names $1 " " }
  END { print names }' emp.data

0.7s

Bash

The concatenation operation is represented in an awk program by writing string values one after the other. At every input line, the first statement in the program concatenates three strings: the previous value of names, the first field, and a blank; it then assigns the resulting string to names. Thus, after all input lines have been read, names contains a single string consisting of the names of all the employees, each followed by a blank. Variables used to store strings begin life holding the null string (that is, the string containing no characters), so in this program names did not need to be explicitly initialized.

Printing the Last Input Line

Although NR retains its value in an END action, $0 does not. This program is one way to print the last input line:

awk '{ last = $0 }
 END { print last }' emp.data

0.7s

Bash

Built-in Functions

We have already seen that awk provides built-in variables that maintain frequently used quantities like the number of fields and the input line number. Similarly, there are built-in functions for computing other useful values. Besides arithmetic functions for square roots, logarithms, random numbers, and the like, there are also functions that manipulate text. One of these is length, which counts the number of characters in a string. For example, this program computes the length of each person's name:

awk '{ print $1, length($1)}' emp.data

0.8s

Bash

Counting Lines, Words, and Characters

This program uses length, NF, and NR to count the number of lines, words, and characters in the input. For convenience, we'll treat each field as a word.

awk '{ nc = nc + length($0) + 1
       nw = nw + NF
     }
     END { print NR, "lines,", nw, "words,", nc, "characters" }' emp.data

0.7s

Bash

We have added one for the newline character at the end of each input line, since $0 doesn't include it.

1.6. Control-Flow Statements

Awk provides an if-else statement for making decisions and several statements for writing loops, all modeled on those found in the C programming language. They can only be used in actions.

If-Else Statement

The following program computes the total and average pay of employees making more than $6.00 an hour. It uses an if to defend against division by zero in computing the average pay.

awk '$2 > 6 { n = n + 1; pay = pay + $2 * $3 }
     END    { if (n > 0)
                  print n, "employees, total pay is", pay,
                           "average pay is", pay/n
              else
                  print "no employees are paid more than $6/hour"
            }' emp.data

0.7s

Bash

In the if-else statement, the condition following the if is evaluated. If it is true, the first print statement is performed. Otherwise, the second print statement is performed. Note that we can continue a long statement over several lines by breaking it after a comma.

While Statement

A while statement has a condition and a body. The statements in the body are performed repeatedly while the condition is true. This program shows how the value of an amount of money invested at a particular interest rate grows over a number of years, using the formula .

echo '# interest1 - compute compound interest
#   input: amount rate years
#   output: compounded value at the end of each year

{   i = 1
    while (i <= $3) {
        printf("\t%.2f\n", $1 * (1 + $2) ^ i)
        i = i + 1
    }
}' > interest1

cat interest1

0.6s

Bash

The condition is the parenthesized expression after the while; the loop body isthe two statements enclosed in braces after the condition. The \t in the printf specification string stands for a tab character; the ^ is the exponentiation operator. Text from a # to the end of the line is a comment, which is ignored by awk but should be helpful to readers of the program who want to understand what is going on.
You can type triplets of numbers at this program to see what various amounts, rates, and years produce. For example, this transaction shows how $1000 grows at 6% and 12% compound interest for five years:

echo "1000 .06 5" | awk -f interest1

0.7s

Bash

echo "1000 .12 5" | awk -f interest1

0.6s

Bash

For Statement

Another statement, for, compresses into a single line the initialization, test, and increment that are part of most loops. Here is the previous interest computation with a for:

echo '# interest2 - compute compound interest
#   input: amount rate years
#   output: compounded value at the end of each year

{   for (i = 1; i <= $3; i = i + 1)
        printf("\t%.2f\n", $1 * (1 + $2) ^ i)
}
' > interest2

cat interest2

0.8s

Bash

echo "1000 .06 5" | awk -f interest2

0.7s

Bash

echo "1000 .12 5" | awk -f interest2

0.9s

Bash

The initialization i = 1 is performed once. Next, the condition i <= $3 is tested; if it is true, the printf statement, which is the body of the loop, is performed. Then the increment i = i + 1 is performed after the body, and the next iteration of the loop begins with another test of the condition. The code is more compact, and since the body of the loop is only a single statement, no braces are needed to enclose it.

1.7. Arrays

Awk provides arrays for storing groups of related values. Although arrays give awk considerable power, we will show only a simple example here. The following program prints its input in reverse order by line. The first action puts the input lines into successive elements of the array line; that is, the first line goes into line[1], the second line into line[2], and so on. The END action uses a while statement to print the lines from the array from last to first:

# reverse - print input in reverse order by line

awk '{ line[NR] = $0 } # remember each input line
 END {i = NR           # print lines in reverse order
      while (i > 0) {
          print line[i]
          i = i - 1
       }
    }' emp.data

0.7s

Bash

Here is the same example with a for statement:

# reverse - print input in reverse order by line

awk '{ line[NR] = $0 } # remember each input line
 END { for (i = NR; i > 0; i = i - 1)
           print line[i]
     }' emp.data

0.8s

Bash

1.8. A Handful of Useful "One-liners"

Although awk can be used to write programs of some complexity, many useful programs are not much more complicated than what we've seen so far. Here is a collection of short programs that you might find handy and/or instructive. Most are variations on material already covered.

Print the total number of input lines:

awk 'END { print NR }' emp.data

0.8s

Bash

Print the third input line:

awk 'NR == 3' emp.data

0.8s

Bash

Print the last field of every input line:

awk '{ print $NF }' emp.data

0.8s

Bash

Print the last field of the last input line:

awk '{ field = $NF }
  END { print field }' emp.data

0.7s

Bash

Print every input line with more than two fields:

awk 'NF > 2' emp.data

0.7s

Bash

Print every input line in which the last field is more than 2:

awk '$NF > 2' emp.data

0.7s

Bash

Print the total number of fields in all input lines:

awk '{ nf = nf + NF }
 END { print nf }' emp.data

0.6s

Bash

Print the total number of lines that contain Beth:

awk '/Beth/ { nlines = nlines + 1 }
     END    { print nlines }' emp.data

0.6s

Bash

Print the largest second field and the line that contains it (assumes some $1 is positive):

awk '$2 > max { max = $2; maxline = $0 }
     END      { print max, maxline }' emp.data

0.7s

Bash

Print every line that has at least one field:

awk 'NF > 0' emp.data

0.8s

Bash

Print every line longer than 10 characters:

awk 'length($0) > 10' emp.data

0.7s

Bash

Print the number of fields in every line followed by the line itself:

awk '{ print NF, $0 }' emp.data

0.7s

Bash

Print the first two fields, in opposite order, of every line:

awk '{ print $2, $1 }' emp.data

0.6s

Bash

Exchange the first two fields of every line and then print the line:

awk '{ temp = $1; $1 = $2; $2 = temp; print }' emp.data

0.7s

Bash

Print every line with the first field replaced by the line number:

awk '{ $1 = NR; print }' emp.data

0.7s

Bash

Print every line after erasing the second field:

awk '{ $2 = ""; print }' emp.data

0.6s

Bash

Print in reverse order the fields of every line:

awk '{for (i = NF; i > 0; i = i - 1) printf("%s ", $i)
      printf("\n")
     }' emp.data

0.6s

Bash

Print the sums of the fields of every line:

awk '{ sum = 0
       for (i = 1; i <= NF; i = i + 1) sum = sum + $i
       print sum
     }' emp.data

0.8s

Bash

Add up all fields in all lines and print the sum:

awk '{ for (i = 1; i <= NF; i = i + 1) sum = sum + $i }
 END { print sum }' emp.data

0.7s

Bash

Print every line after replacing each field by its absolute value:

awk '{ for (i = 1; i <= NF; i = i + 1) if ($i < 0) Si = -$i
       print
     }' emp.data

0.8s

Bash

1.9. What Next?

You have now seen the essentials of awk. Each program in this chapter has been a sequence of pattern-action statements. Awk tests every input line against the patterns, and when a pattern matches, performs the corresponding action. Patterns can involve numeric and string comparisons, and actions can include computation and formatted printing. Besides reading through your input files automatically, awk splits each input line into fields. It also provides a number of built-in variables and functions, and lets you define your own as well. With this combination of features, quite a few useful computations can be expressed by short programs - many of the details that would be needed in another language are handled implicitly in an awk program.The rest of the book elaborates on these basic ideas. Since some of the examples are quite a bit bigger than anything in this chapter, we encourage you strongly to begin writing programs as soon as possible. This will give you familiarity with the language and make it easier to understand the larger programs. Furthermore, nothing answers questions so well as some simple experiments. You should also browse through the whole book; each example conveys something about the language, either about how to use a particular feature, or how to create an interesting program.

2. The AWK Language

This chapter explains, mostly with examples, the constructs that make up awk programs. Because it's a description of the complete language, the material is detailed, so we recommend that you skim it, then come back as necessary to check up on details.

The simplest awk program is a sequence of pattern-action statements:

pattern { action }

. . .

In some statements, the pattern may be missing; in others, the action and its enclosing braces may be missing. After awk has checked your program to make sure there are no syntactic errors, it reads the input a line at a time, and for each line, evaluates the patterns in order. For each pattern that matches the current input line, it executes the associated action. A missing pattern matches every input line, so every action with no pattern is performed at each line. A pattern-action statement consisting only of a pattern prints each input line matched by the pattern. Throughout most of this chapter, the terms "input line" and "record" are used synonymously. In Section 2.5, we will discuss multiline records, where a record may contain several lines.

The first section of this chapter describes patterns in detail. The second section begins the description of actions by describing expressions, assignments, and control-flow statements. The remaining sections cover function definitions, out-put, input, and how awk programs can call other programs. Most sections contain summaries of major features.

The Input File

As input for many of the awk programs in this chapter, we will use a file called countries. Each line contains the name of a country, its area in thousands of square miles, its population in millions, and the continent it is in. The data is from 1984; the USSR has been arbitrarily placed in Asia. In the file, the four columns are separated by tabs; a single blank separates North and South from America.

echo 'USSR	8649	275	Asia
Canada	3852	25	North America
China	3705	1032	Asia
USA	3615	237	North America
Brazil	3286	134	South America
India	1267	746	Asia
Mexico	762	78	North America
France	211	55	Europe
Japan	144	120	Asia
Germany	96	61	Europe
England	94	56	Europe' > countries

0.2s

Bash

cat countries

0.6s

Bash

Pattern-action statements and the statements within an action are usually separated by newlines, but several statements may appear on one line if they are separated by semicolons. A semicolon may be put at the end of any statement.

The opening brace of an action must be on the same line as the pattern it accompanies; the remainder of the action, including the closing brace, may appear on the following lines.

Blank lines are ignored; they may be inserted before or after any statement to improve the readability of a program. Blanks and tabs may be inserted around operators and operands, again to enhance readability.

Comments may be inserted at the end of any line. A comment starts with the character # and finishes at the end of the line, as in

awk '{ print $1, $3 } # I print country name and population' countries

0.6s

Bash

A long statement may be spread over several lines by inserting a backslash and newline at each break:

awk '{ print \
$1, # country name
$2, # area in thousands of square miles
$3 } # population in millions' countries

0.7s

Bash

As this example shows, statements may also be broken after commas, and a comment may be inserted at the end of each broken line.

In this book we have used several formatting styles, partly to illustrate different ones, and partly to keep programs from occupying too many lines. For short programs like those in this chapter, format doesn't much matter, but consistency and readability will help to keep longer programs manageable.

2.1. Patterns

Patterns control the execution of actions: when a pattern matches, its associated action is executed. This section describes the six types of patterns and the conditions under which they match.

Summary of Patterns

BEGIN { statements } The statements are executed once before any input has been read.
END { statements } The statements are executed once after all input has been read.
expression { statements } The statements are executed at each input line where the expression is true, that is, nonzero or nonnull.
/regular expression/ { statements } The statements are executed at each input line that contains a string matched by the regular expression.
compound pattern { statements } A compound pattern combines expressions with && (AND), || (OR), ! (NOT), and parentheses; the statements are executed at each input line where the compound pattern is true.
, { statements } A range pattern matches each input line from a line matched by to the next line matched by , inclusive; the statements are executed at each matching line.

BEGIN and END do not combine with other patterns. A range pattern cannot be part of any other pattern. BEGIN and END are the only patterns that require an action.

BEGIN and END

The BEGIN and END patterns do not match any input lines. Rather, the statements in the BEGIN action are executed before awk reads any input; the statements in the END action are executed after all input has been read. BEGIN and END thus provide a way to gain control for initialization and wrap up. BEGIN and END do not combine with other patterns. If there is more than one BEGIN the associated actions are executed in the order in which they appear in the program, and similarly for multiple END's. Although it's not mandatory, we put BEGIN first and END last.

One common use of a BEGIN action is to change the default way that input lines are split into fields. The field separator is controlled by a built-in variable called FS. By default, fields are separated by blanks and/or tabs; this behavior occurs when FS is set to a blank. Setting FS to any character other than a blank makes that character the field separator.

The following program uses the BEGIN action to set the field separator to a tab character \t and to put column headings on the output. The second printf statement, which is executed at each input line, formats the output into a table, neatly aligned under the column headings. The END action prints the totals. (Variables and expressions are discussed in Section 2.2.)

awk '# print countries with column headers and totals
 
BEGIN { FS = "\t" # make tab the field separator
    printf("%10s %6s %5s	%s\n\n",
    "COUNTRY", "AREA", "POP", "CONTINENT")
    }
    { printf("%10s %6d %5d	%s\n", $1, $2, $3, $4)
    area = area + $2
    pop = pop + $3
    }
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }' countries

0.7s

Bash

Expressions as Patterns

Like most programming languages, awk is rich in expressions for describing numeric computations. Unlike many languages, awk also has expressions for describing operations on strings. Throughout this book, the term string means a sequence of zero or more characters. These may be stored in variables, or appear literally as string constants like "" or "Asia". The string "", which contains no characters, is called the null string. The term substring means a contiguous sequence of zero or more characters within a string. In every string, the null string appears as a substring of length zero before the first character, between every pair of adjacent characters, and after the last character.

Any expression can be used as an operand of any operator. If an expression has a numeric value but an operator requires a string value, the numeric value is automatically transformed into a string; similarly, a string is converted into a number when an operator demands a numeric value.

Any expression can be used as a pattern. If an expression used as a pattern has a nonzero or nonnull value at the current input line, then the pattern matches that line. The typical expression patterns are those involving comparisons between numbers or strings. A comparison expression contains one of the six relational operators, or one of the two string-matching operators ~ (tilde) and !~ that will be discussed in the next section. These operators are listed in Table 2-1.

If the pattern is a comparison expression like NF > 10, then it matches the current input line when the condition is satisfied, that is, when the number of fields in the line is greater than ten. If the pattern is an arithmetic expression like NF, it matches the current input line when its numeric value is nonzero. If the pattern is a string expression, it matches the current input line when the string value of the expression is nonnull.

In a relational comparison, if both operands are numeric, a numeric comparison is made; otherwise, any numeric operand is converted to a string, and then the operands are compared as strings. The strings are compared character by character using the ordering provided by the machine, most often the ASCII character set. One string is said to be "less than" another if it would appear before the other according to this ordering, e.g., "Canada" < "China" and "Asia"< "Asian".

The pattern

awk '$3/$2 >= 0.5' countries

0.8s

Bash

selects lines where the value of the third field divided by the second is numerically greater than or equal to 0.5, while

awk '$0 >= "M"' countries

1.0s

Bash

selects lines that begin with an M, N, O, etc.

Sometimes the type of a comparison operator cannot be determined solely by the syntax of the expression in which it appears. The program

awk '$1 < $4' countries

0.9s

Bash

could compare the first and fourth fields of each input line either as numbers or as strings. Here, the type of the comparison depends on the values of the fields, and it may vary from line to line. In the countries file, the first and fourth fields are always strings, so string comparisons are always made.

Only if both fields are numbers is the comparison done numerically; this would be the case with

awk '$2 > $3' countries

0.8s

Bash

Section 2.2 contains a more complete discussion of strings, numbers, and expressions.

String-Matching Patterns

/regexpr/ Matches when the current input line contains a substring matched by regexpr.
expression ~ /regexpr/ Matches if the string value of expression contains a substring matched by regexpr.
expression !~ /regexpr/ Matches if the string value of expression does not contain a substring matched by regexpr.

Any expression may be used in place of /regexpr/ in the context of ~ and !~.

Awk provides a notation called regular expressions for specifying and matching strings of characters. Regular expressions are widely used in Unix programs, including its text editors and shell. Restricted forms of regular expressions also occur in systems like MS-DOS as "wild-card characters" for specifying sets of filenames.

A string-matching pattern tests whether a string contains a substring matched by a regular expression.

The simplest regular expression is a string of letters and numbers, like Asia, that matches itself. To turn a regular expression into a string-matching pattern, just enclose it in slashes:

awk '/Asia/' countries

0.8s

Bash

This pattern matches when the current input line contains the substring Asia, either as Asia by itself or as some part of a larger word like Asian or Pan-Asiatic. Note that blanks are significant within regular expressions: the string-matching pattern

awk '/ Asia /' countries

0.2s

Bash

matches only when Asia is surrounded by blanks.

The pattern above is one of three types of string-matching patterns. Its form is a regular expression r enclosed in slashes:

awk '/r/' countries

1.0s

Bash

This pattern matches an input line if the line contains a substring matched by r.

The other two types of string-matching patterns use an explicit matching operator:

expression ~ /r/

expression !~ /r/

The matching operator ~ means "is matched by" and !~ means "is not matched by." The first pattern matches when the string value of expression contains a substring matched by the regular expression r; the second pattern matches if there is no such substring.

The left operand of a matching operator is often a field: the pattern

awk '$4 ~ /Asia/' countries

1.3s

Bash

matches all input lines in which the fourth field contains Asia as a substring, while

awk '$4 !~ /Asia/' countries

0.9s

Bash

matches if the fourth field does not contain Asia anywhere.

Note that the string-matching pattern /Asia/ is a shorthand for $0 ~ /Asia/.

Regular Expressions

A regular expression is a notation for specifying and matching strings. Like an arithmetic expression, a regular expression is a basic expression or one created by applying operators to component expressions. To understand the strings matched by a regular expression, we need to understand the strings matched by its components.

The basic regular expressions are summarized in the table above. The characters

\ ^ $ . [ ] | ( ) * + ?

are called metacharacters because they have special meanings. A regular expression consisting of a single nonmetacharacter matches itself. Thus, a single letter or digit is a basic regular expression that matches itself. To preserve the literal meaning of a metacharacter in a regular expression, precede it by a backslash. Thus, the regular expression

\$

matches the character $. If a character is preceded by a single \ we'll say that character is quoted.

In a regular expression, an unquoted caret ^ matches the beginning of a string, an unquoted dollar-sign $ matches the end of a string, and an unquoted period . matches any single character. Thus,

^C matches a C at the beginning of a string
C$ matches a C at the end of a string
^C$ matches the string consisting of the single character C
^.$ matches any string containing exactly one character
^...$ matches any string containing exactly three characters
... matches any three consecutive characters
\.$ matches a period at the end of a string

A regular expression consisting of a group of characters enclosed in brackets is called a character class; it matches any one of the enclosed characters. For example, [AEIOU] matches any of the characters A, E, I, 0, or U.

Ranges of characters can be abbreviated in a character class by using a hyphen. The character immediately to the left of the hyphen defines the beginning of the range; the character immediately to the right defines the end. Thus, [0-9] matches any digit, and [a-zA-Z][0-9] matches a letter followed by a digit. Without both a left and right operand, a hyphen in a character class denotes itself, so the character classes [+-] and [-+] match either a + or a -. The character class [A-Za-z-]+ matches words that include hyphens.

A complemented character class is one in which the first character after the [ is a ^. Such a class matches any character not in the group following the caret. Thus, [^0-9] matches any character except a digit; [^a-zA-Z] matches any character except an upper or lower-case letter.

^[ABC] matches an A, B or C at the beginning of a string
^[^ABC] matches any character at the beginning of a string, except A, B or C
[^ABC] matches any character other than an A, B or C
^[^a-z]$ matches any single-character string, except a lower-case letter

Inside a character class, all characters have their literal meaning, except forthe quoting character \, " at the beginning, and - between two characters.Thus, [ • ] matches a period and " [ ""] matches any character except a caretat the beginning of a string.Parentheses are used in regular expressions to specify how components aregrouped. There are two binary regular expression operators: alternation andconcatenation. The alternation operator I is used to specify alternatives: if r 1and r 2 are regular expressions, then r 1 I r 2 matches any string matched by r 1or by r 2 •There is no explicit concatenation operator. If r 1 and r 2 are regular expres-sions, then (r 1 ) (r 2 ) (with no blank between (r 1 ) and (r 2 )) matches anystring of the form xy where r 1 matches x and r 2 matches y. The parenthesesaround r 1 or r 2 can be omitted, if the contained regular expression does notcontain the alternation operator. The regular expression(Asian:European:North American) (male:female) (black:blue)birdmatches twelve strings ranging from