Review of Awk Principles q Awks purpose to








































- Slides: 40
Review of Awk Principles q Awk’s purpose: to give Unix a general purpose programming language that handles text (strings) as easily as numbers u This makes Awk one of the most powerful of the Unix utilities q Awk process fields while ed/sed process lines q nawk (new awk) is the new standard for Awk u Designed to facilitate large awk programs q Awk gets it’s input from u files u redirection and pipes u directly from standard input
q Other enhancements in nawk include: u Dynamic regular expressions l Text substitution and pattern matching functions u Additional built-in functions and variables u New operators and statements u Input from more than one file u Access to command line arguments q nawk also improved error messages which makes debugging considerably easier under nawk than awk q On most systems, nawk has replaced awk u On ours, both exist
Running an AWK Program q There u awk are several ways to run an Awk program ‘program’ input_file(s) l program and input files are provided as commandline arguments u awk ‘program’ l program is a command-line argument; input is taken from standard input (yes, awk is a filter!) u awk -f program_file_name input_files l program is read from a file
Awk as a Filter q Since Awk is a filter, you can also use pipes with other filters to massage its output even further q Suppose you want to print the data for each employee along with their pay and have it sorted in order of increasing pay awk ‘{ printf(“%6. 2 f %sn”, $2 * $3, $0) }’ emp. data | sort
Errors q If you make an error, Awk will provide a diagnostic error message awk '$3 == 0 [ print $1 }' emp. data awk: syntax error near line 1 awk: bailing out near line 1 q Or if you are using nawk '$3 == 0 [ print $1 }' emp. data nawk: syntax error at source line 1 context is $3 == 0 >>> [ <<< 1 extra } 1 extra [ nawk: bailing out at source line 1 1 extra } 1 extra [
Structure of an AWK Program q An Awk program consists of: u An optional BEGIN segment l For processing to execute prior to reading input u pattern - action pairs l Processing for input data l For each pattern matched, the corresponding action is taken u An optional END segment l Processing after end of input data BEGIN{action} pattern {action}. . . pattern { action} END {action}
BEGIN and END q Special pattern BEGIN matches before the first input line is read; END matches after the last input line has been read q This allows for initial and wrap-up processing BEGIN { print “NAME RATE HOURS”; print “” } { print } END { print “total number of employees is”, NR }
Pattern-Action Pairs q Both are optional, but one or the other u Default pattern is match every record u Default action is print record is required q Patterns u BEGIN and END u expressions l $3 < 100 l $4 == “Asia” u string-matching l /regex/ - /^. *$/ l string - abc – matches the first occurrence of regex or string in the record
u compound l $3 < 100 && $4 == “Asia” – && is a logical AND – || is a logical OR u range l NR == 10, NR == 20 – matches records 10 through 20 inclusive q Patterns can take any of these forms and for /regex/ and string patterns will match the first instance in the record
Selection q Awk patterns are good for selecting specific lines from the input for further processing q Selection by Comparison u $2 >=5 { print } q Selection u $2 by Computation * $3 > 50 { printf(“%6. 2 f for %sn”, $2 * $3, $1) } q Selection by Text Content u $1 == “Susie” u /Susie/ q Combinations u $2 of Patterns >= 4 || $3 >= 20
Data Validation q Validating data is a common operation q Awk is excellent at data validation u NF != 3 { print $0, “number of fields not equal to 3” } u $2 < 3. 35 { print $0, “rate is below minimum wage” } u $2 > 10 { print $0, “rate exceeds $10 per hour” } u $3 < 0 { print $0, “negative hours worked” } u $3 > 60 { print $0, “too many hours worked” }
Regular Expressions in Awk q Awk uses the same regular expressions we’ve been using u^ $ - beginning of/end of field u. - any character u [abcd] - character class u [^abcd] - negated character class u [a-z] - range of characters u (regex 1|regex 2) - alternation u * - zero or more occurrences of preceding expression u + - one or more occurrences of preceding expression u ? - zero or one occurrence of preceding expression u NOTE: the min max {m, n} or variations {m}, {m, } syntax is NOT supported
Awk Variables q $0, $1, $2, … , $NF q NR - Number of records read q FNR - Number of records read from current file q NF - Number of fields in current record q FILENAME - name of current input file q FS - Field separator, space or TAB by default q OFS - Output field separator, space by default q ARGC/ARGV - Argument Count, Argument Value array u Used to get arguments from the command line
Arrays q Awk provides arrays for storing groups of related data values # reverse - print input in reverse order by line { line[NR] = $0 } # remember each line END { i = NR # print lines in reverse order while (i > 0) { print line[i] i=i-1 } }
Operators q= assignment operator; sets a variable equal to a value or string q == equality operator; returns TRUE is both sides are equal q != inverse equality operator q && logical AND q || logical OR q ! logical NOT q <, >, <=, >= relational operators q +, -, /, *, %, ^ q String concatenation
Control Flow Statements q Awk provides several control flow statements for making decisions and writing loops q If-Else if (expression is true or non-zero){ statement 1 } else { statement 2 } where statement 1 and/or statement 2 can be multiple statements enclosed in curly braces { }s u the else and associated statement 2 are optional
Loop Control q While while (expression is true or non-zero) { statement 1 }
q For for(expression 1; expression 2; expression 3) { statement 1 } u This has the same effect as: expression 1 while (expression 2) { statement 1 expression 3 } u for(; ; ) is an infinite loop
q Do While do { statement 1 } while (expression)
Computing with AWK q Counting is easy to do with Awk $3 > 15 { emp = emp + 1} END { print emp, “employees worked more than 15 hrs”} q Computing Sums and Averages is also simple { pay = pay + $2 * $3 } END { print NR, “employees” print “total pay is”, pay print “average pay is”, pay/NR }
Handling Text q One major advantage of Awk is its ability to handle strings as easily as many languages handle numbers q Awk variables can hold strings of characters as well as numbers, and Awk conveniently translates back and forth as needed q This program finds the employee who is paid the most per hour $2 > maxrate { maxrate = $2; maxemp = $1 } END { print “highest hourly rate: ”, maxrate, “for”, maxemp }
q String Concatenation u New strings can be created by combining old ones { names = names $1 “ “ } END { print names } q Printing the Last Input Line u Although NR retains its value after the last input line has been read, $0 does not { last = $0 } END { print last }
Command Line Arguments q Accessed via built-ins ARGC and ARGV q ARGC is set to the number of command line arguments q ARGV[ ] contains each of the arguments u For the command line u awk ‘script’ filename l ARGC == 2 l ARGV[0] == “awk” l ARGV[1] == “filename l the script is not considered an argument
q ARGC and ARGV can be used like any other variable q They can be assigned, compared, used in expressions, printed q They are commonly used for verifying that the correct number of arguments were provided
ARGC/ARGV in Action #argv. awk – get a cmd line argument and display BEGIN {if(ARGC != 2) {print "Not enough arguments!"} else {print "Good evening, ", ARGV[1]} }
BEGIN {if(ARGC != 3) {print "Not enough arguments!" print "Usage is awk -f script in_file field_separator" exit} else {FS=ARGV[2] delete ARGV[2]} } $1 ~ /. . 3/ {print $1 "'s name in real life is", $5; ++nr} END {print; print "There are", nr, "students registered in your class. "}
getline q How do you get input into your awk script other than on the command line? q The getline function provides input capabilities q getline is used to read input from either the current input or from a file or pipe q getline returns 1 if a record was present, 0 if an end-of-file was encountered, and – 1 if some error occurred
getline Function Expression Sets getline $0, NF, NR, FNR getline var, NR, FNR getline <"file" $0, NF getline var <"file" var "cmd" | getline $0, NF "cmd" | getline var
getline from stdin #getline. awk - demonstrate the getline function BEGIN {print "What is your first name and major? " while (getline > 0) print "Hi", $1 ", your major is", $2 ". " }
getline From a File #getline 1. awk - demo getline with a file BEGIN {while (getline <"emp. data" >0) print $0}
getline From a Pipe #getline 2. awk - show using getline with a pipe BEGIN {{while ("who" | getline) nr++} print "There are", nr, "people logged on clyde right now. "}
Simple Output From AWK q Printing Every Line u If an action has no pattern, the action is performed for all input lines l { print } will print all input lines on stdout l { print $0 } will do the same thing q Printing Certain Fields u Multiple items can be printed on the same output line with a single print statement u { print $1, $3 } u Expressions separated by a comma are, by default, separated by a single space when output
q NF, the Number of Fields u Any valid expression can be used after a $ to indicate a particular field u One built-in expression is NF, or Number of Fields u { print NF, $1, $NF } will print the number of fields, the first field, and the last field in the current record q Computing u You and Printing can also do computations on the field values and include the results in your output u { print $1, $2 * $3 }
q Printing Line Numbers u The built-in variable NR can be used to print line numbers u { print NR, $0 } will print each line prefixed with its line number q Putting u You Text in the Output can also add other text to the output besides what is in the current record u { print “total pay for”, $1, “is”, $2 * $3 } u Note that the inserted text needs to be surrounded by double quotes
Formatted Output printf provides formatted output q Syntax is printf(“format string”, var 1, var 2, …. ) q Format specifiers q u u u q %c – single character %d - number %f - floating point number %s - string n - NEWLINE t - TAB Format modifiers - left justify in column u n column width u. n number of decimal places to print u
printf Examples q printf(“I have %d %sn”, how_many, animal_type) u q printf(“%-10 s has $%6. 2 f in their accountn”, name, amount) u q prints a left justified string in a 10 character wide field and a float with 2 decimal places in a six character wide field printf(“%10 s %-4. 2 f %-6 dn”, name, interest_rate, account_number > "account_rates") u q format a number (%d) followed by a string (%s) prints a right justified string in a 10 character wide field, a left justified float with 2 decimal places in a 4 digit wide field and a left justified decimal number in a 6 digit wide field to a file printf(“t%dt%6. 2 ft%sn”, id_no, age, balance, name >> "account") u appends a TAB separated number, 6. 2 float and a string to a file
Built-In Functions q Arithmetic u sin, cos, atan, exp, int, log, rand, sqrt q String u length, substitution, find substrings, split strings q Output u print, printf, print and printf to file q Special u system - executes a Unix command l system(“clear”) to clear the screen l Note double quotes around the Unix command u exit - stop reading input and go immediately to the END pattern-action pair if it exists, otherwise exit the script
Built-In Arithmetic Functions Function Return Value atan 2(y, x) arctangent of y/x (-p to p) cos(x) cosine of x, with x in radians sin(x) sine of x, with x in radians exp(x) exponential of x, ex int(x) integer part of x log(x) natural (base e) logarithm of x rand() srand(x) random number between 0 and 1 new seed for rand() sqrt(x) square root of x
Built-In String Functions Function Description gsub(r, s) substitute s for r globally in $0, return number of substitutions made gsub(r, s, t) substitute s for r globally in string t, return number of substitutions made index(s, t) return first position of string t in s, or 0 if t is not present length(s) return number of characters in s match(s, r) test whether s contains a substring matched by r, return index or 0 sprint(fmt, expr-list) return expr-list formatted according to format string fmt
Built-In String Functions Function Description split(s, a) split s into array a on FS, return number of fields split(s, a, fs) split s into array a on field separator fs, return number of fields sub(r, s) substitute s for the leftmost longest substring of $0 matched by r sub(r, s, t) substitute s for the leftmost longest substring of t matched by r substr(s, p) return suffix of s starting at position p substr(s, p, n) return substring of s of length n starting at position p