The AWK Programming Language
Alfred V. Aho, Peter J Weinberger, Brian W. Kernighan
1. An AWK Tutorial
1.1. Getting Started
Suppose you have a file called emp.data
that contains the name, pay rate in dollars per hour, and number of hours worked for your employees, one employee record per line:
echo 'Beth 4.00 0 Dan 3.75 0 Kathy 4.00 10 Mark 5.00 20 Mary 5.50 22 Susie 4.25 18' > emp.data cat emp.data
Now you want to print the name and pay (rate times hours) for everyone who worked more than zero hours:
awk '$3 > 0 { print $1, $2 * $3 }' emp.data
To print the names of those employees who did not work:
awk '$3 == 0 { print $1 }' emp.data
1.2. Simple Output
Print every line
awk '{ print }' emp.data
awk '{ print $0 }' emp.data
Print certain fields
awk '{ print $1, $3 }' emp.data
NF, the Number of Fields
awk '{ print NF, $1, $NF }' emp.data
Computing and Printing
awk '{ print $1, $2 * $3 }' emp.data
Printing line numbers
awk '{ print NR, $0 }' emp.data
Putting Text in the Output
awk '{ print "total pay for", $1, "is", $2 * $3 }' emp.data
1.3. Fancier output
Lining Up Fields
The printf
statement has the form
printf(
where %
followed by a few characters that control the format of a value.
Here's a program that uses printf
to print the total pay for every employee:
awk '{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }' emp.data
The %s
says to print the first value, $1
, as a string of characters; the second, %.2f
, says to print the second value, $2 * $3
, as a number with 2 digits after the decimal point. Everything else in the specification string, including the dollar sign, is printed verbatim.
Here's another program that prints each employee's name and pay:
awk '{ printf("%-8s $%6.2f\n", $1, $2 * $3) }' emp.data
The first specification, %-8s
, prints a name as a string of characters left- justified in a field 8 characters wide. The second specification, %6.2f
, prints the pay as a number with two digits after the decimal point, in a field 6 characters wide.
Sorting the output
awk '{ printf("%6.2f %s\n", $2 * $3, $0) }' emp.data | sort -n
1.4. Selection
Selection by Comparison
This program uses a comparison pattern to select the records of employees who earn $5.00 or more per hour, that is, lines in which the second field is greater than or equal to 5:
awk '$2 >= 5' emp.data
Selection by Computation
This program prints the pay of those employees whose total pay exceeds $50:
awk '$2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }' emp.data
Selection by Text Content
Print all lines in which the first field is Susie:
awk '$1 == "Susie"' emp.data
The operator ==
tests for equality. You can also look for text containing any of a set of letters, words, and phrases by using patterns called regular expressions.
This program prints all lines that contain Susie anywhere:
awk '/Susie/' emp.data
Combinations of Patterns
Patterns can be combined with parentheses and the logical operators &&
, ||
and !
, which stand for AND, OR, and NOT.
Print the lines where $2
is at least 4
or $3
is at least 20
:
awk '$2 >= 4 || $3 >= 20' emp.data
Lines that satisfy both conditions are printed only once.
The next one prints lines where it is not true that $2
is less than 4
and $3
is less than 20
; this condition is equivalent to the first one above, though perhaps less readable.
awk '!($2 < 4 && $3 < 20)' emp.data
Data Validation
There are always errors in real data. Awk is an excellent tool for checking that data has reasonable values and is in the right format, a task that is often called data validation.
Data validation is essentially negative: instead of printing lines with desirable properties, one prints lines that are suspicious. The following program uses comparison patterns to apply five plausibility tests to each line of emp.data
:
awk 'NF != 3 { print $0, "number of fields is not equal to 3" }' emp.data
awk '$2 < 3.35 { print $0, "rate is below minimum wage" }' emp.data
awk '$2 > 10 { print $0, "rate exceeds $10 per hour" }' emp.data
awk '$3 < 0 { print $0, "negative hours worked" }' emp.data
awk '$3 > 60 { print $0, "too many hours worked" }' emp.data
If there are no errors, there's no output.
BEGIN and END
The special pattern BEGIN
matches before the first line of the first input file is read, and END
matches after the last line of the last file has been processed. This program uses BEGIN
to print a heading:
awk 'BEGIN { print "NAME RATE HOURS"; print "" } { print }' emp.data
1.5. Computing with AWK
An action is a sequence of statements separated by newlines or semicolons. You have already seen examples in which the action was a single print
statement. This section provides examples of statements for performing simple numeric and string computations. In these statements you can use not only the built-in variables like NF
, but you can create your own variables for performing calculations, storing data, and the like. In awk, user-created variables are not declared.
Counting
This program uses a variable emp to count employees who have worked more than 15 hours:
awk '$3 > 15 { emp = emp + 1 } END { print emp, "employees worked more than 15 hours" }' emp.data
Awk variables used as numbers begin life with the value 0
, so we didn't need to initialize emp
.
Computing Sums and Averages
To count the number of employees, we can use the built-in variable NR
, which holds the number of lines read so far; its value at the end of all input is the total number of lines read.
awk 'END { print NR, "employees" }' emp.data
Here is a program that uses NR
to compute the average pay:
awk '{ pay = pay + $2 * $3 } END { print NR, "employees" print "total pay is", pay print "average pay is", pay/NR }' emp.data
Clearly, printf
could be used to produce neater output. There's also a potential error: in the unlikely case that NR
is zero, the program will attempt to divide by zero and thus will generate an error message.
Handling Text
One of the strengths of awk is its ability to handle strings of characters as conveniently as most languages handle numbers. Awk variables can hold strings of characters as well as numbers. This program finds the employee who is paid the most per hour:
awk '$2 > maxrate { maxrate = $2; maxemp = $1 } END { print "highest hourly rate:", maxrate, "for", maxemp }' emp.data
In this program the variable maxrate
holds a numeric value, while the variable maxemp
holds a string. (If there are several employees who all make the same maximum pay, this program finds only the first.}
String Concatenation
New strings may be created by combining old ones; this operation is called concatenation. This program collects all the employee names into a single string, by appending each name and a blank to the previous value in the variable names. The value of names is printed by the END action:
awk ' { names = names $1 " " } END { print names }' emp.data
The concatenation operation is represented in an awk program by writing string values one after the other. At every input line, the first statement in the program concatenates three strings: the previous value of names, the first field, and a blank; it then assigns the resulting string to names. Thus, after all input lines have been read, names contains a single string consisting of the names of all the employees, each followed by a blank. Variables used to store strings begin life holding the null string (that is, the string containing no characters), so in this program names did not need to be explicitly initialized.
Printing the Last Input Line
Although NR
retains its value in an END
action, $0
does not. This program is one way to print the last input line:
awk '{ last = $0 } END { print last }' emp.data
Built-in Functions
We have already seen that awk provides built-in variables that maintain frequently used quantities like the number of fields and the input line number. Similarly, there are built-in functions for computing other useful values. Besides arithmetic functions for square roots, logarithms, random numbers, and the like, there are also functions that manipulate text. One of these is length, which counts the number of characters in a string. For example, this program computes the length of each person's name:
awk '{ print $1, length($1)}' emp.data
Counting Lines, Words, and Characters
This program uses length
, NF
, and NR
to count the number of lines, words, and characters in the input. For convenience, we'll treat each field as a word.
awk '{ nc = nc + length($0) + 1 nw = nw + NF } END { print NR, "lines,", nw, "words,", nc, "characters" }' emp.data
We have added one for the newline character at the end of each input line, since $0
doesn't include it.
1.6. Control-Flow Statements
Awk provides an if-else
statement for making decisions and several statements for writing loops, all modeled on those found in the C programming language. They can only be used in actions.
If-Else Statement
The following program computes the total and average pay of employees making more than $6.00 an hour. It uses an if to defend against division by zero in computing the average pay.
awk '$2 > 6 { n = n + 1; pay = pay + $2 * $3 } END { if (n > 0) print n, "employees, total pay is", pay, "average pay is", pay/n else print "no employees are paid more than $6/hour" }' emp.data
In the if-else
statement, the condition following the if
is evaluated. If it is true, the first print
statement is performed. Otherwise, the second print
statement is performed. Note that we can continue a long statement over several lines by breaking it after a comma.
While Statement
A while
statement has a condition and a body. The statements in the body are performed repeatedly while the condition is true. This program shows how the value of an amount of money invested at a particular interest rate grows over a number of years, using the formula
echo '# interest1 - compute compound interest # input: amount rate years # output: compounded value at the end of each year { i = 1 while (i <= $3) { printf("\t%.2f\n", $1 * (1 + $2) ^ i) i = i + 1 } }' > interest1 cat interest1
The condition is the parenthesized expression after the while
; the loop body isthe two statements enclosed in braces after the condition. The \t
in the printf
specification string stands for a tab character; the ^
is the exponentiation operator. Text from a #
to the end of the line is a comment, which is ignored by awk but should be helpful to readers of the program who want to understand what is going on.
You can type triplets of numbers at this program to see what various amounts, rates, and years produce. For example, this transaction shows how $1000 grows at 6% and 12% compound interest for five years:
echo "1000 .06 5" | awk -f interest1
echo "1000 .12 5" | awk -f interest1
For Statement
Another statement, for
, compresses into a single line the initialization, test, and increment that are part of most loops. Here is the previous interest computation with a for
:
echo '# interest2 - compute compound interest # input: amount rate years # output: compounded value at the end of each year { for (i = 1; i <= $3; i = i + 1) printf("\t%.2f\n", $1 * (1 + $2) ^ i) } ' > interest2 cat interest2
echo "1000 .06 5" | awk -f interest2
echo "1000 .12 5" | awk -f interest2
The initialization i = 1
is performed once. Next, the condition i <= $3
is tested; if it is true, the printf
statement, which is the body of the loop, is performed. Then the increment i = i + 1
is performed after the body, and the next iteration of the loop begins with another test of the condition. The code is more compact, and since the body of the loop is only a single statement, no braces are needed to enclose it.
1.7. Arrays
Awk provides arrays for storing groups of related values. Although arrays give awk considerable power, we will show only a simple example here. The following program prints its input in reverse order by line. The first action puts the input lines into successive elements of the array line; that is, the first line goes into line[1]
, the second line into line[2]
, and so on. The END
action uses a while
statement to print the lines from the array from last to first:
# reverse - print input in reverse order by line awk '{ line[NR] = $0 } # remember each input line END {i = NR # print lines in reverse order while (i > 0) { print line[i] i = i - 1 } }' emp.data
Here is the same example with a for statement:
# reverse - print input in reverse order by line awk '{ line[NR] = $0 } # remember each input line END { for (i = NR; i > 0; i = i - 1) print line[i] }' emp.data
1.8. A Handful of Useful "One-liners"
Although awk can be used to write programs of some complexity, many useful programs are not much more complicated than what we've seen so far. Here is a collection of short programs that you might find handy and/or instructive. Most are variations on material already covered.
Print the total number of input lines:
awk 'END { print NR }' emp.data
Print the third input line:
awk 'NR == 3' emp.data
Print the last field of every input line:
awk '{ print $NF }' emp.data
Print the last field of the last input line:
awk '{ field = $NF } END { print field }' emp.data
Print every input line with more than two fields:
awk 'NF > 2' emp.data
Print every input line in which the last field is more than 2:
awk '$NF > 2' emp.data
Print the total number of fields in all input lines:
awk '{ nf = nf + NF } END { print nf }' emp.data
Print the total number of lines that contain Beth:
awk '/Beth/ { nlines = nlines + 1 } END { print nlines }' emp.data
Print the largest second field and the line that contains it (assumes some $1
is positive):
awk '$2 > max { max = $2; maxline = $0 } END { print max, maxline }' emp.data
Print every line that has at least one field:
awk 'NF > 0' emp.data
Print every line longer than 10 characters:
awk 'length($0) > 10' emp.data
Print the number of fields in every line followed by the line itself:
awk '{ print NF, $0 }' emp.data
Print the first two fields, in opposite order, of every line:
awk '{ print $2, $1 }' emp.data
Exchange the first two fields of every line and then print the line:
awk '{ temp = $1; $1 = $2; $2 = temp; print }' emp.data
Print every line with the first field replaced by the line number:
awk '{ $1 = NR; print }' emp.data
Print every line after erasing the second field:
awk '{ $2 = ""; print }' emp.data
Print in reverse order the fields of every line:
awk '{for (i = NF; i > 0; i = i - 1) printf("%s ", $i) printf("\n") }' emp.data
Print the sums of the fields of every line:
awk '{ sum = 0 for (i = 1; i <= NF; i = i + 1) sum = sum + $i print sum }' emp.data
Add up all fields in all lines and print the sum:
awk '{ for (i = 1; i <= NF; i = i + 1) sum = sum + $i } END { print sum }' emp.data
Print every line after replacing each field by its absolute value:
awk '{ for (i = 1; i <= NF; i = i + 1) if ($i < 0) Si = -$i print }' emp.data
1.9. What Next?
You have now seen the essentials of awk. Each program in this chapter has been a sequence of pattern-action statements. Awk tests every input line against the patterns, and when a pattern matches, performs the corresponding action. Patterns can involve numeric and string comparisons, and actions can include computation and formatted printing. Besides reading through your input files automatically, awk splits each input line into fields. It also provides a number of built-in variables and functions, and lets you define your own as well. With this combination of features, quite a few useful computations can be expressed by short programs - many of the details that would be needed in another language are handled implicitly in an awk program.The rest of the book elaborates on these basic ideas. Since some of the examples are quite a bit bigger than anything in this chapter, we encourage you strongly to begin writing programs as soon as possible. This will give you familiarity with the language and make it easier to understand the larger programs. Furthermore, nothing answers questions so well as some simple experiments. You should also browse through the whole book; each example conveys something about the language, either about how to use a particular feature, or how to create an interesting program.
2. The AWK Language
This chapter explains, mostly with examples, the constructs that make up awk programs. Because it's a description of the complete language, the material is detailed, so we recommend that you skim it, then come back as necessary to check up on details.
The simplest awk program is a sequence of pattern-action statements:
pattern { action }
pattern { action }
. . .
In some statements, the pattern may be missing; in others, the action and its enclosing braces may be missing. After awk has checked your program to make sure there are no syntactic errors, it reads the input a line at a time, and for each line, evaluates the patterns in order. For each pattern that matches the current input line, it executes the associated action. A missing pattern matches every input line, so every action with no pattern is performed at each line. A pattern-action statement consisting only of a pattern prints each input line matched by the pattern. Throughout most of this chapter, the terms "input line" and "record" are used synonymously. In Section 2.5, we will discuss multiline records, where a record may contain several lines.
The first section of this chapter describes patterns in detail. The second section begins the description of actions by describing expressions, assignments, and control-flow statements. The remaining sections cover function definitions, out-put, input, and how awk programs can call other programs. Most sections contain summaries of major features.
The Input File
As input for many of the awk programs in this chapter, we will use a file called countries
. Each line contains the name of a country, its area in thousands of square miles, its population in millions, and the continent it is in. The data is from 1984; the USSR has been arbitrarily placed in Asia. In the file, the four columns are separated by tabs; a single blank separates North and South from America.
echo 'USSR 8649 275 Asia Canada 3852 25 North America China 3705 1032 Asia USA 3615 237 North America Brazil 3286 134 South America India 1267 746 Asia Mexico 762 78 North America France 211 55 Europe Japan 144 120 Asia Germany 96 61 Europe England 94 56 Europe' > countries
cat countries
Pattern-action statements and the statements within an action are usually separated by newlines, but several statements may appear on one line if they are separated by semicolons. A semicolon may be put at the end of any statement.
The opening brace of an action must be on the same line as the pattern it accompanies; the remainder of the action, including the closing brace, may appear on the following lines.
Blank lines are ignored; they may be inserted before or after any statement to improve the readability of a program. Blanks and tabs may be inserted around operators and operands, again to enhance readability.
Comments may be inserted at the end of any line. A comment starts with the character #
and finishes at the end of the line, as in
awk '{ print $1, $3 } # I print country name and population' countries
A long statement may be spread over several lines by inserting a backslash and newline at each break:
awk '{ print \ $1, # country name $2, # area in thousands of square miles $3 } # population in millions' countries
As this example shows, statements may also be broken after commas, and a comment may be inserted at the end of each broken line.
In this book we have used several formatting styles, partly to illustrate different ones, and partly to keep programs from occupying too many lines. For short programs like those in this chapter, format doesn't much matter, but consistency and readability will help to keep longer programs manageable.
2.1. Patterns
Patterns control the execution of actions: when a pattern matches, its associated action is executed. This section describes the six types of patterns and the conditions under which they match.
Summary of Patterns
BEGIN {
statements}
The statements are executed once before any input has been read.END {
statements}
The statements are executed once after all input has been read.- expression
{
statements}
The statements are executed at each input line where the expression is true, that is, nonzero or nonnull. /
regular expression/
{
statements}
The statements are executed at each input line that contains a string matched by the regular expression.- compound pattern
{
statements}
A compound pattern combines expressions with&&
(AND),||
(OR),!
(NOT), and parentheses; the statements are executed at each input line where the compound pattern is true. , { statements }
A range pattern matches each input line from a line matched byto the next line matched by , inclusive; the statements are executed at each matching line.
BEGIN
and END
do not combine with other patterns. A range pattern cannot be part of any other pattern. BEGIN
and END
are the only patterns that require an action.
BEGIN and END
The BEGIN
and END
patterns do not match any input lines. Rather, the statements in the BEGIN
action are executed before awk reads any input; the statements in the END
action are executed after all input has been read. BEGIN
and END
thus provide a way to gain control for initialization and wrap up. BEGIN
and END
do not combine with other patterns. If there is more than one BEGIN
the associated actions are executed in the order in which they appear in the program, and similarly for multiple END
's. Although it's not mandatory, we put BEGIN
first and END
last.
One common use of a BEGIN action is to change the default way that input lines are split into fields. The field separator is controlled by a built-in variable called FS
. By default, fields are separated by blanks and/or tabs; this behavior occurs when FS
is set to a blank. Setting FS
to any character other than a blank makes that character the field separator.
The following program uses the BEGIN
action to set the field separator to a tab character \t
and to put column headings on the output. The second printf
statement, which is executed at each input line, formats the output into a table, neatly aligned under the column headings. The END
action prints the totals. (Variables and expressions are discussed in Section 2.2.)
awk '# print countries with column headers and totals BEGIN { FS = "\t" # make tab the field separator printf("%10s %6s %5s %s\n\n", "COUNTRY", "AREA", "POP", "CONTINENT") } { printf("%10s %6d %5d %s\n", $1, $2, $3, $4) area = area + $2 pop = pop + $3 } END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }' countries
Expressions as Patterns
Like most programming languages, awk is rich in expressions for describing numeric computations. Unlike many languages, awk also has expressions for describing operations on strings. Throughout this book, the term string means a sequence of zero or more characters. These may be stored in variables, or appear literally as string constants like ""
or "Asia"
. The string ""
, which contains no characters, is called the null string. The term substring means a contiguous sequence of zero or more characters within a string. In every string, the null string appears as a substring of length zero before the first character, between every pair of adjacent characters, and after the last character.
Any expression can be used as an operand of any operator. If an expression has a numeric value but an operator requires a string value, the numeric value is automatically transformed into a string; similarly, a string is converted into a number when an operator demands a numeric value.
Any expression can be used as a pattern. If an expression used as a pattern has a nonzero or nonnull value at the current input line, then the pattern matches that line. The typical expression patterns are those involving comparisons between numbers or strings. A comparison expression contains one of the six relational operators, or one of the two string-matching operators ~
(tilde) and !~
that will be discussed in the next section. These operators are listed in Table 2-1.
If the pattern is a comparison expression like NF > 10
, then it matches the current input line when the condition is satisfied, that is, when the number of fields in the line is greater than ten. If the pattern is an arithmetic expression like NF
, it matches the current input line when its numeric value is nonzero. If the pattern is a string expression, it matches the current input line when the string value of the expression is nonnull.
In a relational comparison, if both operands are numeric, a numeric comparison is made; otherwise, any numeric operand is converted to a string, and then the operands are compared as strings. The strings are compared character by character using the ordering provided by the machine, most often the ASCII character set. One string is said to be "less than" another if it would appear before the other according to this ordering, e.g., "Canada"
< "China"
and "Asia"
< "Asian"
.
The pattern
awk '$3/$2 >= 0.5' countries
selects lines where the value of the third field divided by the second is numerically greater than or equal to 0.5, while
awk '$0 >= "M"' countries
selects lines that begin with an M
, N
, O
, etc.
Sometimes the type of a comparison operator cannot be determined solely by the syntax of the expression in which it appears. The program
awk '$1 < $4' countries
could compare the first and fourth fields of each input line either as numbers or as strings. Here, the type of the comparison depends on the values of the fields, and it may vary from line to line. In the countries
file, the first and fourth fields are always strings, so string comparisons are always made.
Only if both fields are numbers is the comparison done numerically; this would be the case with
awk '$2 > $3' countries
Section 2.2 contains a more complete discussion of strings, numbers, and expressions.
String-Matching Patterns
/
regexpr/
Matches when the current input line contains a substring matched by regexpr.- expression ~
/
regexpr/
Matches if the string value of expression contains a substring matched by regexpr. - expression
!~
/
regexpr/
Matches if the string value of expression does not contain a substring matched by regexpr.
Any expression may be used in place of /
regexpr/
in the context of ~
and !~
.
Awk provides a notation called regular expressions for specifying and matching strings of characters. Regular expressions are widely used in Unix programs, including its text editors and shell. Restricted forms of regular expressions also occur in systems like MS-DOS as "wild-card characters" for specifying sets of filenames.
A string-matching pattern tests whether a string contains a substring matched by a regular expression.
The simplest regular expression is a string of letters and numbers, like Asia
, that matches itself. To turn a regular expression into a string-matching pattern, just enclose it in slashes:
awk '/Asia/' countries
This pattern matches when the current input line contains the substring Asia
, either as Asia
by itself or as some part of a larger word like Asian
or Pan-Asiatic
. Note that blanks are significant within regular expressions: the string-matching pattern
awk '/ Asia /' countries
matches only when Asia
is surrounded by blanks.
The pattern above is one of three types of string-matching patterns. Its form is a regular expression r
enclosed in slashes:
awk '/r/' countries
This pattern matches an input line if the line contains a substring matched by r
.
The other two types of string-matching patterns use an explicit matching operator:
expression ~ /r/
expression !~ /r/
The matching operator ~
means "is matched by" and !~
means "is not matched by." The first pattern matches when the string value of expression contains a substring matched by the regular expression r
; the second pattern matches if there is no such substring.
The left operand of a matching operator is often a field: the pattern
awk '$4 ~ /Asia/' countries
matches all input lines in which the fourth field contains Asia
as a substring, while
awk '$4 !~ /Asia/' countries
matches if the fourth field does not contain Asia
anywhere.
Note that the string-matching pattern /Asia/
is a shorthand for $0 ~ /Asia/
.
Regular Expressions
A regular expression is a notation for specifying and matching strings. Like an arithmetic expression, a regular expression is a basic expression or one created by applying operators to component expressions. To understand the strings matched by a regular expression, we need to understand the strings matched by its components.
The basic regular expressions are summarized in the table above. The characters
\ ^ $ . [ ] | ( ) * + ?
are called metacharacters because they have special meanings. A regular expression consisting of a single nonmetacharacter matches itself. Thus, a single letter or digit is a basic regular expression that matches itself. To preserve the literal meaning of a metacharacter in a regular expression, precede it by a backslash. Thus, the regular expression
\$
matches the character $
. If a character is preceded by a single \
we'll say that character is quoted.
In a regular expression, an unquoted caret ^
matches the beginning of a string, an unquoted dollar-sign $
matches the end of a string, and an unquoted period .
matches any single character. Thus,
^C
matches aC
at the beginning of a stringC$
matches aC
at the end of a string^C$
matches the string consisting of the single characterC
^.$
matches any string containing exactly one character^...$
matches any string containing exactly three characters...
matches any three consecutive characters\.$
matches a period at the end of a string
A regular expression consisting of a group of characters enclosed in brackets is called a character class; it matches any one of the enclosed characters. For example, [AEIOU]
matches any of the characters A
, E
, I
, 0
, or U
.
Ranges of characters can be abbreviated in a character class by using a hyphen. The character immediately to the left of the hyphen defines the beginning of the range; the character immediately to the right defines the end. Thus, [0-9]
matches any digit, and [a-zA-Z][0-9]
matches a letter followed by a digit. Without both a left and right operand, a hyphen in a character class denotes itself, so the character classes [+-]
and [-+]
match either a +
or a -
. The character class [A-Za-z-]+
matches words that include hyphens.
A complemented character class is one in which the first character after the [
is a ^
. Such a class matches any character not in the group following the caret. Thus, [^0-9]
matches any character except a digit; [^a-zA-Z]
matches any character except an upper or lower-case letter.
^[ABC]
matches anA
,B
orC
at the beginning of a string^[^ABC]
matches any character at the beginning of a string, exceptA
,B
orC
[^ABC]
matches any character other than anA
,B
orC
^[^a-z]$
matches any single-character string, except a lower-case letter
Inside a character class, all characters have their literal meaning, except forthe quoting character \, " at the beginning, and - between two characters.Thus, [ • ] matches a period and " [ ""] matches any character except a caretat the beginning of a string.Parentheses are used in regular expressions to specify how components aregrouped. There are two binary regular expression operators: alternation andconcatenation. The alternation operator I is used to specify alternatives: if r 1and r 2 are regular expressions, then r 1 I r 2 matches any string matched by r 1or by r 2 •There is no explicit concatenation operator. If r 1 and r 2 are regular expres-sions, then (r 1 ) (r 2 ) (with no blank between (r 1 ) and (r 2 )) matches anystring of the form xy where r 1 matches x and r 2 matches y. The parenthesesaround r 1 or r 2 can be omitted, if the contained regular expression does notcontain the alternation operator. The regular expression(Asian:European:North American) (male:female) (black:blue)birdmatches twelve strings ranging from