6 Text Filtering Uses for the Linux awk Command

The AWK command is a versatile text processing tool in Linux. It filters and manipulates files using models, conditions and actions. It supports a wide range of scenarios, which makes it easy to extract specific data from newspapers, configuration files or any text file.
Here are some common ways to filter the text with AWK. They include regular expression research, a selection of lines and fields, digital filtering, and more.
Match the text using regular expressions
Although Regex has a notoriously steep learning curve, it is the Swiss knife of the manipulation of the text. When you use it with tools like AWK, you can search for models, from simple to complex, on whole files in milliseconds.
Let’s start with the basic syntax:
awk '/pattern/ {print}' filename`
The front slashes say to AWK that you use a regular expression, and the curly hugs contain the action to be carried out on a correspondence. For example, you can print all the newspaper inputs containing “error” from a journal file with this:
awk '/error/ {print}' syslog.log
This line enters each line containing the word error, but Regex allows you to go further. For example, to find lines starting with information or warn, use this command:
awk '/^(INFO|WARN)/ {print}' syslog.log
Here, the pipe | acts as one or an operator, and the Caret ^ corresponds to the beginning of a line.
What I like about Regex is that it goes beyond exact words, allowing you to use generic characters for more complex research. For example, you can find email addresses in a text file with this model:
awk '/[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]+/ {print}' contacts.txt
This model corresponds to the sequences of letters and numbers, followed by a symbol, more letters and figures, a point and finally more letters. Although it is not perfect for each e-mail format, it catches the most common.
You can also use the Tilde operator (~) for more precise correspondence in specific fields. This approach is very effective for structured data. For example, if you want to search for errors only in the third field of each line, you can use it:
awk '$3 ~ /ERROR/ {print}' system.log
This approach makes it more precise than the search for the whole line. In addition, if cash sensitivity is a problem, simply use the Tolower () function as this:
awk 'tolower($0) ~ /error/ {print}' mixed_case.log
This command converts each line into tiny before looking for, which allows it to capture the error, the error, the error and any other capitalization.
Select lines by line number, length, number of fields or last field
Sometimes you need specific lines according to their position or characteristics, not their content. AWK manages these scenarios beautifully with its integrated variables that filter the lines by position or structure. These integrated variables include NR, Length (), NF and $ NF.
Let’s start with the NR variable. It means number of records (or, simply line number). AWK unbalances it for each line he reads. You can use it to enter specific lines or ranges.
To display the first 10 lines of a file, use:
awk 'NR <=10 {print}' largefile.txt
Likewise, to get all the uniform lines, use 2:
awk 'NR % 2 == 0 {print}' file.txt
For ranges, you can combine several NR conditions with the operator &&, like this:
awk 'NR >= 50 && NR <= 100 {print}' file.txt
This prints lines from 50 to 100. You can also combine the length () function with AWK when you deal with files that have an incoherent formatting or when you want to filter the empty lines.
For example, to count the characters with the long function () then display only lines of more than 80 characters, use:
awk 'length($0) > 80 {print}' code.py
Now let’s talk about another commonly used variable, NF, which refers to the number of fields. Using this variable, AWK automatically divides each line into fields according to a delimiter (by default, spaces or tabs).
For example, for CSV files or structured data, you can use the NF variable to count the fields (columns). You can filter according to the number of fields that each line contains:
awk 'NF == 5 {print}' customer_data.csv
This only retains lines with exactly 5 fields, helping you spot incomplete recordings. To do the opposite and print only the lines that do not have exactly 5 fields, use the operator! = Instead of ==. In addition, to print lines with at least 5 fields (columns), use the operator> =.
While NF tells you how many fields there is, $ NF gives you the value of the latest field. The variable $ NF represents the last field of each line, regardless of the number of fields. This is incredibly useful when managing files where the number of columns varies.
Take this order:
awk '$NF == "COMPLETED" {print}' task_list.txt
This finds all the lines ending with finished, even if certain lines have 3 fields and others have 7.
You can also combine all these conditions together:
awk 'length($0) > 50 && NF >= 4 {print}' mixed_data.txt
This command only prints the lines of more than 50 characters and contain at least 4 fields. It guarantees that the exit excludes short or incomplete lines.
Compare and filter digital field values
AWK is intelligent enough to treat the fields containing numbers like real numbers, not just text. This means that you can make mathematical comparisons without any special conversion function, opening a whole new world of filtering possibilities.
Now, let’s say that you have a CSV of student scores, and that you want to return only students who have scored above 80 points. You can do it with:
awk -F, '$2 > 80 {print $1, $2}' scores.csv
Here we use -to say that the fields are separated by commas. $ 2 refers to the second field or the column (the score), and $ 1 is the name.
When working with text files, percentages can cause problems if the symbol% is included in a field. You can delete it with GSUB ():
awk '{gsub(/%/, "", $4); if($4 > 75) print}' performance.txt
This command removes the symbol% of the fourth field, then checks whether the remaining number exceeds 75.
You can also use several conditions with logical operators. For example, if you want to extract the sales values from a CSV file between $ 500 and $ 2000, Run:
awk -F, '$3 > 500 && $3 < 2000 {print}' sales.csv
You can combine conditions with the || (Or) the operator too. For example, to find recordings where sales are very high or very low, use:
awk -F, '$3 > 5000 || $3 < 100 {print}' sales.csv
Beyond digital comparisons, you can also use several conditions to match the specific text, check the dates or exclude unwanted records.
You can also combine && and ||. However, parentheses are essential to properly control logic. Suppose you want to find active or pending accounts, but only if their balance exceeds $ 1,000. You would write:
awk -F, '($2 == "ACTIVE" || $2 == "PENDING") && $3 > 1000' accounts.csv
Without parentheses, the operator’s priority would change the meaning of the filter.
Capture of line beaches between correspondence models
Most Linux users know the use of GREP to capture specific information. However, AWK can also select a range of lines, starting from a line that corresponds to a start_pattern and ending with a line that corresponds to an end_pattern.
The syntax is elegantly simple:
awk '/start_pattern/,/end_pattern/' filename
This captures the starting line, all the lines between the two and the end line. If you want to exclude the markers themselves, you can use an approach based on a flag like this:
awk '/BEGIN/{flag=1; next} /END/{flag=0} flag' logfile.txt
In addition, we can also capture complete error blocks that start with specified text and end with the next empty line.
awk '/ERROR:/,/^$/ {print}' application.log
Here, AWK will start to print the lines when he see a line containing “error:” and will continue until he finds the first empty line.
Exclude unwanted or double entry
Cleaning data often means deleting what you don’t want. AWK provides simple means to filter noise and delete redundant data.
The operator no (!) Is your main exclusion tool. You can place it before any model or condition to reverse the match, telling AWK to take an action on all the lines that do not correspond.
For example, let’s say that you want to exclude all messages containing a debugging of your application journal. Use this:
awk '!/DEBUG/' application.log
This command prints each line of the application. Log unless it contains the word of debugging.
You can also combine this with field conditions, such as you can display all traffic that does not come from your internal surveillance IP address.
awk '$1 != "10.0.0.5"' access.log
Similarly, another major AWK operation is to filter the comments (lines starting with #) and empty lines from a configuration file. To do this, use:
awk '!/^#/ && NF > 0 {print}' config.ini
This jumps out the lines starting with # and also removes empty lines.
You can also delete duplicates according to a specific field. For example, consider that you have a CSV contact_list file with three columns: ID, Name and E-mail. To keep only the first recording for each single e-mail address (in the third field), use:
awk '!seen[$3]++ {print}' contact_list.csv
In addition, you can exclude insensitive exclusion from the case with the Tolower () function:
Reform the text by choosing specific columns
Raw data is often available in a structure you don’t really need, and AWK facilitates reorganization in something more useful. You can choose specific columns, modify their command or even combine them in new fields.
For example, if you only want to print the second and fifth columns from a file, you can use:
awk -F, '{print $2, $5}' sales.csv
If you want to modify the order of the columns, exchange $ 5 with $ 2.
By default, AWK separates the outlet with spaces, but you can insert your own separator such as a pipe symbol.
awk '{print $1 "|" $3}' data.txt
When you work with CSV files, the first line is often a header. You can jump it with NR> 1 and reorganize the columns to display the email followed by the name.
awk -F, 'NR > 1 {print $3, "->", $1, $2}' contacts.csv
You can also combine fields to create new ones, such as reaching the first and last name in a full name before printing the email.
awk -F, '{print $1 " " $2, $3}' contacts.csv
Finally, if you want the output to be clearer, you can add your own header before the data using a starting block.
awk 'BEGIN {print "Name,Email"} {print $1 "," $2}' data.txt
These simple examples are sufficient to start reformatting the text in a provision that better meets your needs.
Remember that AWK excels in the processing of structured text, everything that with coherent patterns or field separations. The more regular your data format, the more powerful AWK. And when you combine these filtering techniques with other Linux tools via pipes, you create incredibly effective word processing workflows.




