26 Mar 2019 , tagged: awk, unix, cheat sheet

awk Cheat Sheet

I needed to crunch some data quickly and decided awk was the right tool to do so. But every time I use awk, I have to go read the manual, so I decided it’s time for a cheat sheet.

Structure of an awk script

# Comments begin with a pound sign
BEGIN {
  # Instructions run before the main loop
  FS = ";" # Set a Field Separator
}

# Each line of input is applied against all the following
# regular expressions and runs the instructions in the
# block:

/^$/ { print "An empty line" }

END {
  # Instructions run after the main loop
}

Invoke awk with a script like so:

$ awk -F script <input file>

Matching

Match every line: awk will match each record against the instructions in the script. It will execute all matching instructions.

{ print $0 }           # print every single line

Match blank lines:

/^$/ { print "blank" } # print "blank" for every blank line

Match on columns:

$2 ~ /[0-9]+/ { print $2 } # print column 2 if it contains a number

Relational operators to match columns:

$2 < 3 { print "less than three" } # print "less than three" if column two's value is less than three

Negate match:

$1 !~ /[0-9]+/ { print "no number" } # print "no number if the first column contains no numbers

Input and Output

Awk splits the input into records on the RS (RecordSeparator). Each input record is split into fields via the FS variable (FieldSeparator) or via -F command line flag. Individual fields can be addressed with $<field index>, for example $1 returns the first field, $2 the second and so on. $0 returns the whole record.

$ echo 'a;b;c' | awk -F';' '{ print $1 " - " $2 " - " $3 }'
a - b - c

Similarly to RS and FS awk supports record and field separators for output formatting called ORS (OutputRecordSeparator) and OFS (OutputFieldSeparator).

The printf function allows more control over formatting:

$ echo '3.1415, hello' | awk '{ printf("a float: %f, a string: %s\n", $1, $2) }'
a float: 3.141500, a string: hello

Variables

Variables can simply be assigned by a name, the assignment operator, and an expression:

variable_name = 1 + 2

Variables have both a numeric and string value and awk will use whatever is appropriate. Strings have a numeric value of 0.

Variables can be passed into awk at the beginning of the execution as a parameter:

$ awk '{ print foo }' foo=bar
bar

These variables are not available in BEGIN blocks, but you can specify variable bindings at startup with -v var=value:

$ awk -v foo=bar 'BEGIN { print foo }`
bar

Arrays can be used just like variables and don’t require initialization. Arrays are associative, i.e. both numbers and strings can be used as index.

Predefined Variables

RS: Record separator

FS: Field separator

NR: number of records in input processed so far, aka line number

NF: number of fields in current record

ORS: Output record separator

OFS: Output field separator

Control Flow

Awk supports if, if-else, if-else-if-else, and the ternary operator expr ? action : other action:

if $1 > 20 { print "many!" }

In terms of loops awk has while, do-while, and for loops. The for loop can be used like a traditional C style for loop:

for (i = 0; i < NF; i++)
  print $i # prints each field in the current record

or as in a simplified form for traversing array’s indexes:

for (x in my_array) { print x ": " my_array[x] }

Furthermore awk has the continue and break keywords which do exactly what you would think. There’s also the exit and next keywords. exit does what you would expect and exits the script, END blocks will still be executed though.. next causes the next record to be read.