Various 59 to 1 nr 9

Your browser does not have Javascript enabled. I use Javascript for analytics, and to show ads which pay for the maintenance Last modified: Thu Apr 23 16:37:47 EDT 2015 Part of the Unix tutorials And then there's My blog Table of Contents

  • Why learn AWK?
  • Basic Structure
  • Executing an AWK script
  • Which shell to use with AWK?
  • Dynamic Variables
  • The Essential Syntax of AWK
  • Arithmetic Expressions
    • Unary arithmetic operators
    • The Autoincrement and Autodecrement Operators
    • Assignment Operators
    • Conditional expressions
    • Regular Expressions
    • And/Or/Not
  • Summary of AWK Commands
  • AWK Built-in Variables
    • FS - The Input Field Separator Variable
    • OFS - The Output Field Separator Variable
    • NF - The Number of Fields Variable
    • NR - The Number of Records Variable
    • RS - The Record Separator Variable
    • ORS - The Output Record Separator Variable
    • FILENAME - The Current Filename Variable
  • Associative Arrays
    • Multi-dimensional Arrays
    • Example of using AWK's Associative Arrays
      • Output of the script
  • Picture Perfect PRINTF Output
    • PRINTF - formatting output
    • Escape Sequences
    • Format Specifiers
    • Width - specifying minimum field size
    • Left Justification
    • The Field Precision Value
    • Explicit File output
  • Flow Control with next and exit
  • AWK Numerical Functions
    • Trigonometric Functions
    • Exponents, logs and square roots
    • Truncating Integers
    • Random Numbers
    • The Lotto script
  • String Functions
    • The Length function
    • The Index Function
    • The Substr function
    • The Split function
    • GAWK's Tolower and Toupper function
    • NAWK's string functions
      • The Match function
      • The System function
      • The Getline function
      • The systime function
      • The Strftime function
  • User Defined Functions
  • AWK patterns
  • Formatting AWK programs
  • Environment Variables
    • ARGC - Number or arguments (NAWK/GAWK)
    • ARGV - Array of arguments (NAWK/GAWK)
    • ARGIND - Argument Index (GAWK only)
    • RSTART, RLENGTH and match (NAWK/GAWK)
    • SUBSEP - Multi-dimensional array separator (NAWK/GAWK)
    • ENVIRON - environment variables (GAWK only)
    • IGNORECASE (GAWK only)
    • CONVFMT - conversion format (GAWK only)
    • ERRNO - system errors (GAWK only)
    • FIELDWIDTHS - fixed width fields (GAWK only)
Copyright 1994,1995 Bruce Barnett and General Electric Company Copyright 2001, 2004, 2013, 2014 Bruce Barnett All rights reserved You are allowed to print copies of this tutorial for your personal use, and link to this page, but you are not allowed to make electronic copies, or redistribute this tutorial in any form without permission. Original version written in 1994 and published in the Sun Observer Awk is an extremely versatile programming language for working on files. We'll teach you just enough to understand the examples in this page, plus a smidgen. The examples given below have the extensions of the executing script as part of the filename. Once you download it, and make it executable, you can rename it anything you want. Why learn AWK? In the past I have covered grep and sed . This section discusses AWK, another cornerstone of UNIX shell programming. There are three variations of AWK: AWK - the (very old) original from AT&T
NAWK - A newer, improved version from AT&T
GAWK - The Free Software foundation's version
Originally, I didn't plan to discuss NAWK, but several UNIX vendors have replaced AWK with NAWK, and there are several incompatibilities between the two. It would be cruel of me to not warn you about the differences. So I will highlight those when I come to them. It is important to know than all of AWK's features are in NAWK and GAWK. Most, if not all, of NAWK's features are in GAWK. NAWK ships as part of Solaris. GAWK does not. However, many sites on the Internet have the sources freely available. If you user Linux, you have GAWK. But in general, assume that I am talking about the classic AWK unless otherwise noted. And now there is talk about MAWK, TAWK, and JAWK . Why is AWK so important? It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C. AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise. I'll try to cover the essential parts or AWK, and mention the extensions/variations. The "new AWK," or "nawk", comes on the Sun system, and you may find it superior to the old AWK in many ways. In particular, it has better diagnostics, and won't print out the infamous "bailing out near line ..." message the original AWK is prone to do. Instead, "nawk" prints out the line it didn't understand, and highlights the bad parts with arrows. GAWK does this as well, and this really helps a lot. If you find yourself needing a feature that is very difficult or impossible to do in AWK, I suggest you either use NAWK, or GAWK, or convert your AWK script into PERL using the "a2p" conversion program which comes with PERL. PERL is a marvelous language, and I use it all the time, but I do not plan to cover PERL in these tutorials. Having made my intention clear, I can continue with a clear conscience. Many UNIX utilities have strange names. AWK is one of those utilities. It is not an abbreviation for awk ward. In fact, it is an elegant and simple language. The work "AWK" is derived from the initials of the language's three developers: A. Aho, B. W. Kernighan and P. Weinberger. Basic Structure The essential organization of an AWK program follows the form: pattern { action }
The pattern specifies when the action is performed. Like most UNIX utilities, AWK is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern. Two other important patterns are specified by the keywords "BEGIN" and "END". As you might expect, these two words specify actions to be taken before any lines are read, and after the last line is read. The AWK program below: BEGIN { print "START" } { print } END { print "STOP" } adds one line before and one line after the input file. This isn't very useful, but with a simple change, we can make this into a typical AWK program: BEGIN { print "File\tOwner"}
{ print $8, "\t", $3}
END { print " - DONE -" }
I'll improve the script in the next sections, but we'll call it "FileOwner". But let's not put it into a script or file yet. I will cover that part in a bit. Hang on and follow with me so you get the flavor of AWK. The characters "\t" Indicates a tab character so the output lines up on even boundries. The "$8" and "$3" have a meaning similar to a shell script. Instead of the eighth and third argument, they mean the eighth and third field of the input line. You can think of a field as a column, and the action you specify operates on each line or row read in. There are two differences between AWK and a shell processing the characters within double quotes. AWK understands special characters follow the "\" character like "t". The Bourne and C UNIX shells do not. Also, unlike the shell (and PERL) AWK does not evaluate variables within strings. To explain, the second line could not be written like this: {print "$8\t$3" }
That example would print "$8 $3". Inside the quotes, the dollar sign is not a special character. Outside, it corresponds to a field. What do I mean by the third and eight field? Consider the Solaris "/usr/bin/ls -l" command, which has eight columns of information. The System V version (Similar to the Linux version), "/usr/5bin/ls -l" has 9 columns. The third column is the owner, and the eighth (or nineth) column in the name of the file. This AWK program can be used to process the output of the "ls -l" command, printing out the filename, then the owner, for each file. I'll show you how. Update: On a linux system, change "$8" to "$9". One more point about the use of a dollar sign. In scripting languages like Perl and the various shells, a dollar sign means the word following is the name of the variable. Awk is different. The dollar sign means that we are refering to a field or column in the current line. When switching between Perl and AWK you must remener that "$" has a different meaning. So the following piece of code prints two "fields" to standard out. The first field printed is the number "5", the second is the fifth field (or column) on the input line. BEGIN { x=5 }
{ print x, $x}
Executing an AWK script So let's start writing our first AWK script. There are a couple of ways to do this. Assuming the first script is called "FileOwner", the invocation would be ls -l | FileOwner
This might generate the following if there were only two files in the current directory: File Owner

- DONE -
There are two problems with this script. Both problems are easy to fix, but I'll hold off on this until I cover the basics. The script itself can be written in many ways. I've show both the C shell (csh/tcsh), and Bourne/Bash/POSIX shell script. The C shell version would look like this:
#!/bin/csh -f # Linux users have to change $8 to $9 awk '\ BEGIN { print "File\tOwner" } \ { print $8, "\t", $3} \ END { print " - DONE -" } \ ' And of course, once you create this script, you need to make this script executable by typing chmod +x awk_
Click here to get file: awk_
As you can see in the above script, each line of the AWK script must have a backslash if it is not the last line of the script. This is necessary as the C shell doesn't, by default, allow strings to be longer than a line. I have a long list of complaints about using the C shell. See Top Ten reasons not to use the C shell
The Bourne shell (as does most shells) allows quoted strings to span several lines:

# Linux users have to change $8 to $9
awk '
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
END { print " - DONE -" }
And again, once it is created, it has to be made executable: chmod +x awk_
Click here to get file: awk_
By the way, I give example scripts in the tutorial, and use an extension on the filename to indicate the type of script. You can, of course, "install" the script in your home "bin" directory by typing cp awk_ $HOME/bin/awk_example1 chmod +x $HOME/bin/awk_example1 A third type of AWK script is a "native' AWK script, where you don't use the shell. You can write the commands in a file, and execute awk -f filename Since AWK is also an interpretor, like the shell, you can save yourself a step and make the file executable by add one line in the beginning of the file:

#!/bin/awk -f
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
END { print " - DONE -" }

Then execute "chmod +x" and ise this file as a new UNIX command.
Click here to get file: awk_
Notice the "-f" option following '#!/bin/awk " above, which is also used in the third format where you use AWK to execute the file directly, . "awk -f filename". The "-f" option specifies the AWK file containing the instructions. As you can see, AWK considers lines that start with a "#" to be a comment, just like the shell. To be precise, anything from the "#" to the end of the line is a comment (unless its inside an AWK string. However, I always comment my AWK scripts with the "#" at the start of the line, for reasons I'll discuss later. Which format should you use? I prefer the last format when possible. It's shorter and simpler. It's also easier to debug problems. If you need to use a shell, and want to avoid using too many files, you can combine them as we did in the first and second example. Which shell to use with AWK? The format of the original AWK is not free-form. You cannot put new line breaks just anywhere. They must go in particular locations. To be precise, in the original AWK you can insert a new line character after the curly braces, and at the end of a command, but not elsewhere. If you wanted to break a long line into two lines at any other place, you had to use a backslash:

#!/bin/awk -f
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
END { print " - DONE -" }

Click here to get file: awk_
The Bourne shell version would be

awk '
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
END { print "done"}

Click here to get file: awk_
while the C shell would be

#!/bin/csh -f
awk '
BEGIN { print "File\tOwner" }\
{ print $8, "\t", \\
END { print "done"}\

Click here to get file: awk_
As you can see, this demonstrates how awkward the C shell is when enclosing an AWK script. Not only are back slashes needed for every line, some lines need two, then the old original AWK is used. Newer AWK's are more flexible where newlines can be added. Many people, like me, will warn you about the C shell. Some of the problems are subtle, and you may never see them. Try to include an AWK or sed script within a C shell script, and the back slashes will drive you crazy. This is what convinced me to learn the Bourne shell years ago, when I was starting out (before the Korn shell or Bash shell were available). Even if you insist on use the C shell, you should at least learn enough of the Borne/POSIX shell to set variables, which by some strange coincidence is the subject of the next section. Dynamic Variables Since you can make a script an AWK executable by mentioning "#!/bin/awk -f" on the first line, including an AWK script inside a shell script isn't needed unless you want to either eliminate the need for an extra file, or if you want to pass a variable to the insides of an AWK script. Since this is a common problem, now is as good a time to explain the technique. I'll do this by showing a simple AWK program that will only print one column. NOTE: there will be a bug in the first version. The number of the column will be specified by the first argument. The first version of the program, which we will call "Column", looks like this:

#NOTE - this script does not work!
awk '{print $column}'

Click here to get file (but be aware that it doesn't work):
A suggested use is: ls -l | Column 3
This would print the third column from the ls command, which would be the owner of the file. You can change this into a utility that counts how many files are owned by each user by adding ls -l | Column 3 | uniq -c | sort -nr
Only one problem: the script doesn't work. The value of the "column" variable is not seen by AWK. Change "awk" to "echo" to check. You need to turn off the quoting when the variable is seen. This can be done by ending the quoting, and restarting it after the variable:

awk '{print $'$column'}'

Click here to get file:
This is a very important concept, and throws experienced programmers a curve ball. In many computer languages, a string has a start quote, and end quote, and the contents in between. If you want to include a special character inside the quote, you must prevent the character from having the typical meaning. In the C language, this is down by putting a backslash before the character. In other languages, there is a special combination of characters to to this. In the C and Bourne shell, the quote is just a switch. It turns the interpretation mode on or off. There is really no such concept as "start of string" and "end of string". The quotes toggle a switch inside the interpretor. The quote character is not passed on to the application. This is why there are two pairs of quotes above. Notice there are two dollar signs. The first one is quoted, and is seen by AWK. The second one is not quoted, so the shell evaluates the variable, and replaces "$column" by the value. If you don't understand, either change "awk" to "echo", or change the first line to read "#!/bin/sh -x". Some improvements are needed, however. The Bourne shell has a mechanism to provide a value for a variable if the value isn't set, or is set and the value is an empty string. This is done by using the format: ${ variable :- defaultvalue }
This is shown below, where the default column will be one:

awk '{print $'$column'}'

Click here to get file:
We can save a line by combining these two steps:

awk '{print $'${1:-1}'}'

Click here to get file:
It is hard to read, but it is compact. There is one other method that can be used. If you execute an AWK command and include on the command line variable = value
this variable will be set when the AWK script starts. An example of this use would be:

awk '{print $c}' c=${1:-1}

Click here to get file:
This last variation does not have the problems with quoting the previous example had. You should master the earlier example, however, because you can use it with any script or command. The second method is special to AWK. Modern AWK's have other options as well. See the FAQ . The Essential Syntax of AWK Earlier I discussed ways to start an AWK script. This section will discuss the various grammatical elements of AWK. Arithmetic Expressions There are several arithmetic operators, similar to C. These are the binary operators, which operate on two variables: AWK Table 1
Binary Operators Operator Type Meaning + Arithmetic Addition - Arithmetic Subtraction * Arithmetic Multiplication / Arithmetic Division % Arithmetic Modulo <space> String Concatenation Using variables with the value of "7" and "3", AWK returns the following results for each operator when using the print command: Expression Result 7+3 10 7-3 4 7*3 21 7/3 7%3 1 7 3 73 There are a few points to make. The modulus operator finds the remainder after an integer divide. The print command output a floating point number on the divide, but an integer for the rest. The string concatenate operator is confusing, since it isn't even visible. Place a space between two variables and the strings are concatenated together. This also shows that numbers are converted automatically into strings when needed. Unlike C, AWK doesn't have "types" of variables. There is one type only, and it can be a string or number. The conversion rules are simple. A number can easily be converted into a string. When a string is converted into a number, AWK will do so. The string "123" will be converted into the number 123. However, the string "123X" will be converted into the number 0. (NAWK will behave differently, and converts the string into integer 123, which is found in the beginning of the string). Unary arithmetic operators The "+" and "-" operators can be used before variables and numbers. If X equals 4, then the statement: print -x;
will print "-4". The Autoincrement and Autodecrement Operators AWK also supports the "++" and "--" operators of C. Both increment or decrement the variables by one. The operator can only be used with a single variable, and can be before or after the variable. The prefix form modifies the value, and then uses the result, while the postfix form gets the results of the variable, and afterwards modifies the variable. As an example, if X has the value of 3, then the AWK statement print x++, " ", ++x;
would print the numbers 3 and 5. These operators are also assignment operators, and can be used by themselves on a line: x++;
Assignment Operators Variables can be assigned new values with the assignment operators. You know about "++" and "--". The other assignment statement is simply: variable = arithmetic_expression
Certain operators have precedence over others; parenthesis can be used to control grouping. The statement x=1+2*3 4;
is the same as x = (1 + (2 * 3)) "4";
Both print out "74". Notice spaces can be added for readability. AWK, like C, has special assignment operators, which combine a calculation with an assignment. Instead of saying x=x+2;
you can more concisely say: x+=2;
The complete list follows: AWK Table 2
Assignment Operators Operator Meaning += Add result to variable -= Subtract result from variable *= Multiply variable by result /= Divide variable by result %= Apply modulo to variable Conditional expressions The second type of expression in AWK is the conditional expression. This is used for certain tests, like the if or while . Boolean conditions evaluate to true or false. In AWK, there is a definite difference between a boolean condition, and an arithmetic expression. You cannot convert a boolean condition to an integer or string. You can, however, use an arithmetic expression as a conditional expression. A value of 0 is false, while anything else is true. Undefined variables has the value of 0. Unlike AWK, NAWK lets you use booleans as integers. Arithmetic values can also be converted into boolean conditions by using relational operators: AWK Table 3
Relational Operators Operator Meaning == Is equal != Is not equal to > Is greater than >= Is greater than or equal to < Is less than <= Is less than or equal to These operators are the same as the C operators. They can be used to compare numbers or strings. With respect to strings, lower case letters are greater than upper case letters. Regular Expressions Two operators are used to compare strings to regular expressions: AWK Table 4
Regular Expression Operators Operator Meaning ~ Matches !~ Doesn't match The order in this case is particular. The regular expression must be enclosed by slashes, and comes after the operator. AWK supports extended regular expressions, so the following are examples of valid tests: word !~ /START/
lawrence_welk ~ /(one|two|three)/
And/Or/Not There are two boolean operators that can be used with conditional expressions. That is, you can combine two conditional expressions with the "or" or "and" operators: "&&" and "||". There is also the unary not operator: "!". Summary of AWK Commands There are only a few commands in AWK. The list and syntax follows: if ( conditional ) statement [ else statement ]
while ( conditional ) statement
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
{ [ statement ] ...}
variable = expression
print [ expression-list ] [ > expression ]
printf format [ , expression-list ] [ > expression ]
At this point, you can use AWK as a language for simple calculations; If you wanted to calculate something, and not read any lines for input, you could use the BEGIN keyword discussed earlier, combined with a exit command: #!/bin/awk -f BEGIN { # Print the squares from 1 to 10 the first way i=1; while (i 100) { print NR, $0; }
Click here to get file: awk_
RS - The Record Separator Variable Normally, AWK reads one line at a time, and breaks up the line into fields. You can set the "RS" variable to change AWK's definition of a "line". If you set it to an empty string, then AWK will read the entire file into memory. You can combine this with changing the "FS" variable. This example treats each line as a field, and prints out the second and third line: #!/bin/awk -f BEGIN { # change the record separator from newline to nothing RS="" # change the field separator from whitespace to newline FS="\n" } { # print the second and third line of the file print $2, $3; }
Click here to get file: awk_
The two lines are printed with a space between. Also this will only work if the input file is less than 100 lines, therefore this technique is limited. You can use it to break words up, one word per line, using this:

#!/bin/awk -f BEGIN { RS=" "; } { print ; }
Click here to get file: oneword_per_
but this only works if all of the words are separated by a space. If there is a tab or punctuation inside, it would not.
ORS - The Output Record Separator Variable The default output record separator is a newline, like the input. This can be set to be a newline and carriage return, if you need to generate a text file for a non-UNIX system.

#!/bin/awk -f # this filter adds a carriage return to all lines # before the newline character BEGIN { ORS="\r\n" } { print }
Click here to get file: add_
FILENAME - The Current Filename Variable The last variable known to regular AWK is "FILENAME", which tells you the name of the file being read. #!/bin/awk -f # reports which file is being read BEGIN { f=""; } { if (f != FILENAME) { print "reading", FILENAME; f=FILENAME; } print; }
Click here to get file: awk_
This can be used if several files need to be parsed by AWK. Normally you use standard input to provide AWK with information. You can also specify the filenames on the command line. If the above script was called "testfilter", and if you executed it with testfilter file1 file2 file3
It would print out the filename before each change. An alternate way to specify this on the command line is testfilter file1 - file3 <file2
In this case, the second file will be called "-", which is the conventional name for standard input. I have used this when I want to put some information before and after a filter operation. The prefix and postfix files special data before and after the real data. By checking the filename, you can parse the information differently. This is also useful to report syntax errors in particular files: #!/bin/awk -f { if (NF == 6) { # do the right thing } else { if (FILENAME == "-" ) { print "SYNTAX ERROR, Wrong number of fields,", "in STDIN, line #:", NR, "line: ", $0; } else { print "SYNTAX ERROR, Wrong number of fields,", "Filename: ", FILENAME, "line # ", NR, "line: ", $0; } } }
Click here to get file: awk_
Associative Arrays As a programmer in the 1980's, I had used several programming lanuages, such as BASIC, FORTRAN, COBOL, Algol, PL/1, DG/L, C, and Pascal. AWK was the first language I found that has associative arrays. (The perl language was released later, and had hash arrays, which are the same thing. But I will use the term associative arrays because that is how the AWK manual describes them). This term may be meaningless to you, but believe me, these arrays are invaluable, and simplify programming enormously. Let me describe a problem, and show you how associative arrays can be used for reduce coding time, giving you more time to explore another stupid problem you don't want to deal with in the first place. Let's suppose you have a directory overflowing with files, and you want to find out how many files are owned by each user, and perhaps how much disk space each user owns. You really want someone to blame; it's hard to tell who owns what file. A filter that processes the output of ls would work: ls -l | filter
But this doesn't tell you how much space each user is using. It also doesn't work for a large directory tree. This requires find and xargs : find . -type f -print | xargs ls -l | filter
The third column of "ls" is the username. The filter has to count how many times it sees each user. The typical program would have an array of usernames and another array that counts how many times each username has been seen. The index to both arrays are the same; you use one array to find the index, and the second to keep track of the count. I'll show you one way to do it in AWK--the wrong way: #!/bin/awk -f # bad example of AWK programming # this counts how many files each user owns. BEGIN { number_of_users=0; } { # must make sure you only examine lines with 8 or more fields if (NF> 7) { user=0; # look for the user in our list of users for (i=1; i 7) { username[$3]++; } } END { for (i in username) { print username[i], i; } }
Click here to get file: count_
This fixes the problem of counting the line with the total. However, it still generates an error when an empty file is read as input. To fix this problem, a common technique is to make sure the array always exists, and has a special marker value which specifies that the entry is invalid. Then when reporting the results, ignore the invalid entry. #!/bin/awk -f BEGIN { username[""]=0; } { username[$3]++; } END { for (i in username) { if (i != "") { print username[i], i; } } }
Click here to get file: count_
This happens to fix the other problem. Apply this technique and you will make your AWK programs more robust and easier for others to use. Multi-dimensional Arrays Some people ask if AWK can handle multi-dimensional arrays. It can. However, you don't use conventional two-dimensional arrays. Instead you use associative arrays. (Did I even mention how useful associative arrays are?) Remember, you can put anything in the index of an associative array. It requires a different way to think about problems, but once you understand, you won't be able to live without it. All you have to do is to create an index that combines two other indices. Suppose you wanted to effectively execute a[1,2] = y;
This is invalid in AWK. However, the following is perfectly fine: a[1 "," 2] = y;
Remember: the AWK string concatenation operator is the space. It combines the three strings into the single string "1,2". Then it uses it as an index into the array. That's all there is to it. There is one minor problem with associative arrays, especially if you use the for command to output each element: you have no control over the order of output. You can create an algorithm to generate the indices to an associative array, and control the order this way. However, this is difficult to do. Since UNIX provides an excellent sort utility, more programmers separate the information processing from the sorting. I'll show you what I mean. Example of using AWK's Associative Arrays I often find myself using certain techniques repeatedly in AWK. This example will demonstrate these techniques, and illustrate the power and elegance of AWK. The program is simple and common. The disk is full. Who's gonna be blamed? I just hope you use this power wisely. Remember, you may be the one who filled up the disk. Having resolved my moral dilemma, by placing the burden squarely on your shoulders, I will describe the program in detail. I will also discuss several tips you will find useful in large AWK programs. First, initialize all arrays used in a for loop. There will be four arrays for this purpose. Initialization is easy: u_count[""]=0;
The second tip is to pick a convention for arrays. Selecting the names of the arrays, and the indices for each array is very important. In a complex program, it can become confusing to remember which array contains what. I suggest you clearly identify the indices and contents of each array. To demonstrate, I will use a "_count" to indicate the number of files, and "_sum" to indicate the sum of the file sizes. In addition, the part before the "_" specifies the index used for the array, which will be either "u" for user, "g" for group, "ug" for the user and group combination, and "all" for the total for all files. In other programs, I have used names like username_to_directory[username]=directory;
Follow a convention like this, and it will be hard for you to forget the purpose of each associative array. Even when a quick hack comes back to haunt you three years later. I've been there. The third suggestion is to make sure your input is in the correct form. It's generally a good idea to be pessimistic, but I will add a simple but sufficient test in this example. if (NF != 10) { # ignore } else { etc. I placed the test and error clause up front, so the rest of the code won't be cluttered. AWK doesn't have user defined functions. NAWK, GAWK and PERL do. The next piece of advice for complex AWK scripts is to define a name for each field used. In this case, we want the user, group and size in disk blocks. We could use the file size in bytes, but the block size corresponds to the blocks on the disk, a more accurate measurement of space. Disk blocks can be found by using "ls -s". This adds a column, so the username becomes the fourth column, etc. Therefore the script will contain: size=$1;
This will allow us to easily adapt to changes in input. We could use "$1" throughout the script, but if we changed the number of fields, which the "-s" option does, we'd have to change each field reference. You don't want to go through an AWK script, and change all the "$1" to "$2", and also change the "$2" to "$3" because those are really the "$1" that you just changed to "$2". Of course this is confusing. That's why it's a good idea to assign names to the fields. I've been there too. Next the AWK script will count how many times each combination of users and groups occur. That is, I am going to construct a two-part index that contains the username and groupname. This will let me count up the number of times each user/group combination occurs, and how much disk space is used. Consider this: how would you calculate the total for just a user, or for just a group? You could rewrite the script. Or you could take the user/group totals, and total them with a second script. You could do it, but it's not the AWK way to do it. If you had to examine a bazillion files, and it takes a long time to run that script, it would be a waste to repeat this task. It's also inefficient to require two scripts when one can do everything. The proper way to solve this problem is to extract as much information as possible in one pass through the files. Therefore this script will find the number and size for each category: Each user
Each group
Each user/group combination
All users and groups
This is why I have 4 arrays to count up the number of files. I don't really need 4 arrays, as I can use the format of the index to determine which array is which. But this does maake the program easier to understand for now. The next tip is subtle, but you will see how useful it is. I mentioned the indices into the array can be anything. If possible, select a format that allows you to merge information from several arrays. I realize this makes no sense right now, but hang in there. All will become clear soon. I will do this by constructing a universal index of the form <user> <group>
This index will be used for all arrays. There is a space between the two values. This covers the total for the user/group combination. What about the other three arrays? I will use a "*" to indicate the total for all users or groups. Therefore the index for all files would be "* *" while the index for all of the file owned by user daemon would be "daemon *". The heart of the script totals up the number and size of each file, putting the information into the right category. I will use 8 arrays; 4 for file sizes, and 4 for counts: u_count[user " *"]++;
g_count["* " group]++;
ug_count[user " " group]++;
all_count["* *"]++;

u_size[user " *"]+=size;
g_size["* " group]+=size;
ug_size[user " " group]+=size;
all_size["* *"]+=size;
This particular universal index will make sorting easier, as you will see. Also important is to sort the information in an order that is useful. You can try to force a particular output order in AWK, but why work at this, when it's a one line command for sort ? The difficult part is finding the right way to sort the information. This script will sort information using the size of the category as the first sort field. The largest total will be the one for all files, so this will be one of the first lines output. However, there may be several ties for the largest number, and care must be used. The second field will be the number of files. This will help break a tie. Still, I want the totals and sub-totals to be listed before the individual user/group combinations. The third and fourth fields will be generated by the index of the array. This is the tricky part I warned you about. The script will output one string, but the sort utility will not know this. Instead, it will treat it as two fields. This will unify the results, and information from all 4 arrays will look like one array. The sort of the third and fourth fields will be dictionary order, and not numeric, unlike the first two fields. The "*" was used so these sub-total fields will be listed before the individual user/group combination. The arrays will be printed using the following format: for (i in u_count) { if (i != "") { print u_size[i], u_count[i], i; } } O I only showed you one array, but all four are printed the same way. That's the essence of the script. The results is sorted, and I converted the space into a tab for cosmetic reasons. Output of the script I changed my directory to /usr/ucb , used the script in that directory. The following is the output: size count user group 3173 81 * * 3173 81 root * 2973 75 * staff 2973 75 root staff 88 3 * daemon 88 3 root daemon 64 2 * kmem 64 2 root kmem 48 1 * tty 48 1 root tty This says there are 81 files in this directory, which takes up 3173 disk blocks. All of the files are owned by root. 2973 disk blocks belong to group staff. There are 3 files with group daemon, which takes up 88 disk blocks. As you can see, the first line of information is the total for all users and groups. The second line is the sub-total for the user "root". The third line is the sub-total for the group "staff". Therefore the order of the sort is useful, with the sub-totals before the individual entries. You could write a simple AWK or grep script to obtain information from just one user or one group, and the information will be easy to sort. There is only one problem. The /usr/ucb directory on my system only uses 1849 blocks; at least that's what du reports. Where's the discrepancy? The script does not understand hard links. This may not be a problem on most disks, because many users do not use hard links. Still, it does generate inaccurate results. In this case, the program vi is also e , ex , edit , view , and 2 other names. The program only exists once, but has 7 names. You can tell because the link count (field 2) reports 7. This causes the file to be counted 7 times, which causes an inaccurate total. The fix is to only count multiple links once. Examining the link count will determine if a file has multiple links. However, how can you prevent counting a link twice? There is an easy solution: all of these files have the same inode number. You can find this number with the -i option to ls . To save memory, we only have to remember the inodes of files that have multiple links. This means we have to add another column to the input, and have to renumber all of the field references. It's a good thing there are only three. Adding a new field will be easy, because I followed my own advice. The final script should be easy to follow. I have used variations of this hundreds of times and find it demonstrates the power of AWK as well as provide insight to a powerful programming paradigm. AWK solves these types of problems easier than most languages. But you have to use AWK the right way. Note - this version was written for a Solaris box. You have to verify if ls is generating the right number of arguments. The -g argument may need to be deleted, and the check for the number of files may have to be modified. Updated I added a Linux version below - to be downloaded. This is a fully working version of the program, that accurately counts disk space, appears below: #!/bin/sh find . -type f -print | xargs /usr/bin/ls -islg | awk ' BEGIN { # initialize all arrays used in for loop u_count[""]=0; g_count[""]=0; ug_count[""]=0; all_count[""]=0; } { # validate your input if (NF != 11) { # ignore } else { # assign field names inode=$1; size=$2; linkcount=$4; user=$5; group=$6; # should I count this file? doit=0; if (linkcount == 1) { # only one copy - count it doit++; } else { # a hard link - only count first one seen[inode]++; if (seen[inode] == 1) { doit++; } } # if doit is true, then count the file if (doit ) { # total up counts in one pass # use description array names # use array index that unifies the arrays # first the counts for the number of files u_count[user " *"]++; g_count["* " group]++; ug_count[user " " group]++; all_count["* *"]++; # then the total disk space used u_size[user " *"]+=size; g_size["* " group]+=size; ug_size[user " " group]+=size; all_size["* *"]+=size; } } } END { # output in a form that can be sorted for (i in u_count) { if (i != "") { print u_size[i], u_count[i], i; } } for (i in g_count) { if (i != "") { print g_size[i], g_count[i], i; } } for (i in ug_count) { if (i != "") { print ug_size[i], ug_count[i], i; } } for (i in all_count) { if (i != "") { print all_size[i], all_count[i], i; } } } ' | # numeric sort - biggest numbers first # sort fields 0 and 1 first (sort starts with 0) # followed by dictionary sort on fields 2 + 3 sort +0nr -2 +2d | # add header (echo "size count user group";cat -) | # convert space to tab - makes it nice output # the second set of quotes contains a single tab character tr ' ' ' ' # done - I hope you like it
Click here to get file: count_
Remember when I said I didn't need to use 4 different arrays? I can use just one. This is more confusing, but more concise #!/bin/sh find . -type f -print | xargs /usr/bin/ls -islg | awk ' BEGIN { # initialize all arrays used in for loop count[""]=0; } { # validate your input if (NF != 11) { # ignore } else { # assign field names inode=$1; size=$2; linkcount=$4; user=$5; group=$6; # should I count this file? doit=0; if (linkcount == 1) { # only one copy - count it doit++; } else { # a hard link - only count first one seen[inode]++; if (seen[inode] == 1) { doit++; } } # if doit is true, then count the file if (doit ) { # total up counts in one pass # use description array names # use array index that unifies the arrays # first the counts for the number of files count[user " *"]++; count["* " group]++; count[user " " group]++; count["* *"]++; # then the total disk space used size[user " *"]+=size; size["* " group]+=size; size[user " " group]+=size; size["* *"]+=size; } } } END { # output in a form that can be sorted for (i in count) { if (i != "") { print size[i], count[i], i; } } } ' | # numeric sort - biggest numbers first # sort fields 0 and 1 first (sort starts with 0) # followed by dictionary sort on fields 2 + 3 sort +0nr -2 +2d | # add header (echo "size count user group";cat -) | # convert space to tab - makes it nice output # the second set of quotes contains a single tab character tr ' ' ' ' # done - I hope you like it
Click here to get file: count_

Here is a version that works with modern Linux systems, but assumes you have well-behaved filenames (without spaces, etc,): count_users_
Picture Perfect PRINTF Output So far, I described several simple scripts that provide useful information, in a somewhat ugly output format. Columns might not line up properly, and it is often hard to find patterns or trends without this unity. As you use AWK more, you will be desirous of crisp, clean formatting. To achieve this, you must master the printf function. PRINTF - formatting output The printf is very similar to the C function with the same name. C programmers should have no problem using printf function. Printf has one of these syntactical forms: printf ( format);
printf ( format, arguments...);
printf ( format) >expression;
printf ( format, arguments...) > expression;
The parenthesis and semicolon are optional. I only use the first format to be consistent with other nearby printf statements. A print statement would do the same thing. Printf reveals it's real power when formatting commands are used. The first argument to the printf function is the format. This is a string, or variable whose value is a string. This string, like all strings, can contain special escape sequences to print control characters. Escape Sequences The character "\" is used to "escape" or mark special characters. The list of these characters is in table below: AWK Table 5
Escape Sequences Sequence Description \a ASCII bell (NAWK/GAWK only) \b Backspace \f Formfeed \n Newline \r Carriage Return \t Horizontal tab \v Vertical tab (NAWK only) \ddd Character (1 to 3 octal digits) (NAWK only) \xdd Character (hexadecimal) (NAWK only) \<Any other character> That character It's difficult to explain the differences without being wordy. Hopefully I'll provide enough examples to demonstrate the differences. With NAWK, you can print three tab characters using these three different representations: printf("\t\11\x9\n");
A tab character is decimal 9, octal 11, or hexadecimal 09. See the man page ascii(7) for more information. Similarly, you can print three double-quote characters (decimal 34, hexadecimal 22, or octal 42 ) using printf("\"\x22\42\n"); You should notice a difference between the printf function and the print function. Print terminates the line with the ORS character, and divides each field with the OFS separator. Printf does nothing unless you specify the action. Therefore you will frequently end each line with the newline character "\n", and you must specify the separating characters explicitly. Format Specifiers The power of the printf statement lies in the format specifiers, which always start with the character "%". The format specifiers are described in table 6: AWK Table 6
Format Specifiers Specifier Meaning %c ASCII Character %d Decimal integer %e Floating Point number
(engineering format) %f Floating Point number
(fixed point format) %g The shorter of e or f,
with trailing zeros removed %o Octal %s String %x Hexadecimal %% Literal % Again, I'll cover the differences quickly. Table 3 illustrates the differences. The first line states "printf(%c\n",)"" prints a "d". AWK Table 7
Example of format conversions Format Value Results %c d %c "" 1 (NAWK?) %c 42 " %d 100 %e +02 %f %g 100 %o 144 %s %s "13f" 13f %d "13f" 0 (AWK) %d "13f" 13 (NAWK) %x 64 This table reveals some differences between AWK and NAWK. When a string with numbers and letters are coverted into an integer, AWK will return a zero, while NAWK will convert as much as possible. The second example, marked with "NAWK?" will return "d" on some earlier versions of NAWK, while later versions will return "1". Using format specifiers, there is another way to print a double quote with NAWK. This demonstrates Octal, Decimal and Hexadecimal conversion. As you can see, it isn't symmetrical. Decimal conversions are done differently. printf("%s%s%s%c\n", "\"", "\x22", "\42", 34); Between the "%" and the format character can be four optional pieces of information. It helps to visualize these fields as: %<sign><zero><width>.<precision>format
I'll discuss each one separately. Width - specifying minimum field size If there is a number after the "%", this specifies the minimum number of characters to print. This is the width field. Spaces are added so the number of printed characters equal this number. Note that this is the minimum field size. If the field becomes to large, it will grow, so information will not be lost. Spaces are added to the left. This format allows you to line up columns perfectly. Consider the following format: printf("%st%d\n", s, d);
If the string "s" is longer than 8 characters, the columns won't line up. Instead, use printf("%20s%d\n", s, d);
As long as the string is less than 20 characters, the number will start on the 21st column. If the string is too long, then the two fields will run together, making it hard to read. You may want to consider placing a single space between the fields, to make sure you will always have one space between the fields. This is very important if you want to pipe the output to another program. Adding informational headers makes the output more readable. Be aware that changing the format of the data may make it difficult to get the columns aligned perfectly. Consider the following script: #!/usr/bin/awk -f BEGIN { printf("String Number\n"); } { printf("%10s %6d\n", $1, $2); }
Click here to get file: awk_
It would be awkward (forgive the choice of words) to add a new column and retain the same alignment. More complicated formats would require a lot of trial and error. You have to adjust the first printf to agree with the second printf statement. I suggest

#!/usr/bin/awk -f BEGIN { printf("%10s %6sn", "String", "Number"); } { printf("%10s %6d\n", $1, $2); }
Click here to get file: awk_
or even better

#!/usr/bin/awk -f BEGIN { format1 ="%10s %6sn"; format2 ="%10s %6dn"; printf(format1, "String", "Number"); } { printf(format2, $1, $2); }
Click here to get file: awk_
The last example, by using string variables for formatting, allows you to keep all of the formats together. This may not seem like it's very useful, but when you have multiple formats and multiple columns, it's very useful to have a set of templates like the above. If you have to add an extra space to make things line up, it's much easier to find and correct the problem with a set of format strings that are together, and the exact same width. CHainging the first columne from 10 characters to 11 is easy. Left Justification The last example places spaces before each field to make sure the minimum field width is met. What do you do if you want the spaces on the right? Add a negative sign before the width: printf("%-10s %-6d\n", $1, $2);
This will move the printing characters to the left, with spaces added to the right. The Field Precision Value The precision field, which is the number between the decimal and the format character, is more complex. Most people use it with the floating point format (%f), but surprisingly, it can be used with any format character. With the octal, decimal or hexadecimal format, it specifies the minimum number of characters. Zeros are added to met this requirement. With the %e and %f formats, it specifies the number of digits after the decimal point. The %e "e+00" is not included in the precision. The %g format combines the characteristics of the %d and %f formats. The precision specifies the number of digits displayed, before and after the decimal point. The precision field has no effect on the %c field. The %s format has an unusual, but useful effect: it specifies the maximum number of significant characters to print. If the first number after the "%", or after the "%-", is a zero, then the system adds zeros when padding. This includes all format types, including strings and the %c character format. This means "%010d" and "%.10d" both adds leading zeros, giving a minimum of 10 digits. The format "%" is therefore redundant. Table 8 gives some examples: AWK Table 8
Examples of complex formatting Format Variable Results %c 100 "d" %10c 100 " d" %010c 100 "000000000d" %d 10 "10" %10d 10 " 10" % " 0010" % " 00000010" %.8d "00000010" %010d "0000000010" %e "+02" % "+02" % "+02" %f "" % " " % "" % "" %g "" %10g " " % " " % "" %.8g "" %o "1733" %10o " 1733" %010o "0000001733" %.8o "00001733" %s "" %10s " " % " 987." % "" %x "3db" %10x " 3db" %010x "00000003db" %.8x "000003db" There is one more topic needed to complete this lesson on printf . Explicit File output Instead of sending output to standard output, you can send output to a named file. The format is printf("string\n") > "/tmp/file";
You can append to an existing file, by using ">>:" printf("string\n") >> "/tmp/file";
Like the shell, the double angle brackets indicates output is appended to the file, instead of written to an empty file. Appending to the file does not delete the old contents. However, there is a subtle difference between AWK and the shell. Consider the shell program: #!/bin/sh while x=`line` do echo got $x >>/tmp/a echo got $x >/tmp/b done This will read standard input, and copy the standard input to files "/tmp/a" and "/tmp/b". File "/tmp/a" will grow larger, as information is always appended to the file. File "/tmp/b", however, will only contain one line. This happens because each time the shell see the ">" or ">>" characters, it opens the file for writing, choosing the truncate/create or appending option at that time. Now consider the equivalent AWK program: #!/usr/bin/awk -f { print $0 >>"/tmp/a" print $0 >"/tmp/b" } This behaves differently. AWK chooses the create/append option the first time a file is opened for writing. Afterwards, the use of ">" or ">>" is ignored. Unlike the shell, AWK copies all of standard input to file "/tmp/b". Instead of a string, some versions of AWK allow you to specify an expression: # [note to self] check this one - it might not work
printf("string\n") > FILENAME ".out";
The following uses a string concatenation expression to illustrate this:

#!/usr/bin/awk -f END { for (i=0;i "/tmp/a" i; } }
Click here to get file: awk_
This script never finishes, because AWK can have 10 additional files open, and NAWK can have 20. If you find this to be a problem, look into PERL. I hope this gives you the skill to make your AWK output picture perfect. Flow Control with next and exit You can exit from an awk script using the exit command. #!/usr/bin/awk -f { # lots of code here, where you may find an error if ( numberOfErrors > 0 ) { exit } } If you want to exit with an error condition, so you can use the shell to distinquish between normal and error exits, you can include an option integer value. Let's say to expect all lines of a file to be 60 characters, and you want to use an awk program as a filter to exit if the number of characters is not 60. Some sample code could be #!/usr/bin/awk -f { if ( length($0) > 60) { exit 1 } else if ( length($0) < 60) { exit 2 } print } There is a special case if you use a newer version of awk that can have multiple END commands. If one of the END commands executes the "exit" command, the other END command does not execute. #!/usr/bin/awk -f { # .... some code here } END { print "EXIT1";exit} END { print "EXIT2"} Because of the "exit" command, only the first END command will execute. The "next" command will also change the flow of the program. It causes the current processing of the pattern space to stop. The program reads in the next line, and starts executing the commands again with the new line. AWK Numerical Functions In previous tutorials, I have shown how useful AWK is in manipulating information, and generating reports. When you add a few functions, AWK becomes even more, mmm, functional. There are three types of functions: numeric, string and whatever's left. Table9 lists all of the numeric functions: AWK Table 9
Numeric Functions Name Function Variant cos cosine GAWK,AWK,NAWK exp Exponent GAWK,AWK,NAWK int Integer GAWK,AWK,NAWK log Logarithm GAWK,AWK,NAWK sin Sine GAWK,AWK,NAWK sqrt Square Root GAWK,AWK,NAWK atan2 Arctangent GAWK,NAWK rand Random GAWK,NAWK srand Seed Random GAWK,NAWK Trigonometric Functions Oh joy. I bet millions, if not dozens, of my readers have been waiting for me to discuss trigonometry. Personally, I don't use trigonometry much at work, except when I go off on a tangent. Sorry about that. I don't know what came over me. I don't usually resort to puns. I'll write a note to myself, and after I sine the note, I'll have my boss cosine it. Now stop that! I hate arguing with myself. I always lose. Thinking about math I learned in the year 2 . (Before Computers) seems to cause flashbacks of high school, pimples, and (shudder) times best left forgotten. The stress of remembering those days must have made me forget the standards I normally set for myself. Besides, no-one appreciates obtuse humor anyway, even if I find acute way to say it. I better change the subject fast. Combining humor and computers is a very serious matter. Here is a NAWK script that calculates the trigonometric functions for all degrees between 0 and 360. It also shows why there is no tangent, secant or cosecant function. (They aren't necessary). If you read the script, you will learn of some subtle differences between AWK and NAWK. All this in a thin veneer of demonstrating why we learned trigonometry in the first place. What more can you ask for? Oh, in case you are wondering, I wrote this in the month of December. #!/usr/bin/nawk -f # # A smattering of trigonometry... # # This AWK script plots the values from 0 to 360 # for the basic trigonometry functions # but first - a review: # # (Note to the editor - the following diagram assumes # a fixed width font, like Courier. # otherwise, the diagram looks very stupid, instead of slightly stupid) # # Assume the following right triangle # # Angle Y # # | # | # | # a | c # | # | # +------- Angle X # b # # since the triangle is a right angle, then # X+Y=90 # # Basic Trigonometric Functions. If you know the length # of 2 sides, and the angles, you can find the length of the third side. # Also - if you know the length of the sides, you can calculate # the angles. # # The formulas are # # sine(X) = a/c # cosine(X) = b/c # tangent(X) = a/b # # reciprocal functions # cotangent(X) = b/a # secant(X) = c/b # cosecant(X) = c/a # # Example 1) # if an angle is 30, and the hypotenuse (c) is 10, then # a = sine(30) * 10 = 5 # b = cosine(30) * 10 = # # The second example will be more realistic: # # Suppose you are looking for a Christmas tree, and # while talking to your family, you smack into a tree # because your head was turned, and your kids were arguing over who # was going to put the first ornament on the tree. # # As you come to, you realize your feet are touching the trunk of the tree, # and your eyes are 6 feet from the bottom of your frostbitten toes. # While counting the stars that spin around your head, you also realize # the top of the tree is located at a 65 degree angle, relative to your eyes. # You suddenly realize the tree is feet high! After all, # tangent(65 degrees) * 6 feet = feet # All right, it isn't realistic. Not many people memorize the # tangent table, or can estimate angles that accurately. # I was telling the truth about the stars spinning around the head, however. # BEGIN { # assign a value for pi. PI=; # select an "Ed Sullivan" number - really really big BIG=999999; # pick two formats # Keep them close together, so when one column is made larger # the other column can be adjusted to be the same width fmt1="%7s %8s %8s %8s %10s %10s %10s %10sn"; # print out the title of each column fmt2="%7d % % % % % % %"; # old AWK wants a backslash at the end of the next line # to continue the print statement # new AWK allows you to break the line into two, after a comma printf(fmt1, "Degrees", "Radians", "Cosine", "Sine", "Tangent", "Cotangent", "Secant", "Cosecant"); for (i=0;i 0) { line = line $0; } else { printf("missing continuation on line %d\n", NR); } } print line; }
Click here to get file: awk_
Instead of reading into the standard variables, you can specify the variable to set: getline a_line
print a_line;
NAWK and GAWK allow the getline function to be given an optional filename or string containing a filename. An example of a primitive file preprocessor, that looks for lines of the format #include filename
and substitutes that line for the contents of the file:

#!/usr/bin/nawk -f { # a primitive include preprocessor if (($1 == "#include") && (NF == 2)) { # found the name of the file filename = $2; while (i = getline < filename ) { print; } } else { print; } }
Click here to get file:
NAWK's getline can also read from a pipe. If you have a program that generates single line, you can use "command" | getline;
print $0;
or "command" | getline abc;
print abc;
If you have more than one line, you can loop through the results: while ("command" | getline) { cmd[i++] = $0; } for (i in cmd) { printf("%s=%s\n", i, cmd[i]); } Only one pipe can be open at a time. If you want to open another pipe, you must execute close("command");
This is necessary even if the end of file is reached. The systime function The systime () function returns the current time of day as the number of seconds since Midnight, January 1, 1970. It is useful for measuring how long portions of your GAWK code takes to execute.

#!/usr/local/bin/gawk -f # how long does it take to do a few loops? BEGIN { LOOPS=100; # do the test twice start=systime(); for (i=0;i<LOOPS;i++) { } end = systime(); # calculate how long it takes to do a dummy test do_nothing = end-start; # now do the test again with the *IMPORTANT* code inside start=systime(); for (i=0;i<LOOPS;i++) { # How long does this take? while ("date" | getline) { date = $0; } close("date"); } end = systime(); newtime = (end - start) - do_nothing; if (newtime

Reporting the area since 1998 (now 19 years!) Archive Pages from October 2002 to date Please note that any comments made in this news page are those of the Editors'

Various 59 To 1 Nr 9Various 59 To 1 Nr 9Various 59 To 1 Nr 9Various 59 To 1 Nr 9