12.4. Text Processing Commands

Commands affecting text and text files

sort

File sorter, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See Example 10-9, Example 10-10, and Example A-9.

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with sort.
cat list-1 list-2 list-3 | sort | uniq > final.list
# Concatenates the list files,
# sorts them,
# removes duplicate lines,
# and finally writes the result to an output file.

The useful -c option prefixes each line of the input file with its number of occurrences.

bash$ cat testfile
This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.


bash$ uniq -c testfile
      1 This line occurs only once.
       2 This line occurs twice.
       3 This line occurs three times.


bash$ sort testfile | uniq -c | sort -nr
      3 This line occurs three times.
       2 This line occurs twice.
       1 This line occurs only once.
	      

The sort INPUTFILE | uniq -c | sort -nr command string produces a frequency of occurrence listing on the INPUTFILE file (the -nr options to sort cause a reverse numerical sort). This template finds use in analysis of log files and dictionary lists, and wherever the lexical structure of a document needs to be examined.

bash$ cat testfile
This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.


bash$ ./wf.sh testfile
      6 this
       6 occurs
       6 line
       3 times
       3 three
       2 twice
       1 only
       1 once
	       

expand, unexpand

The expand filter converts tabs to spaces. It is often used in a pipe.

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting fields from files. It is similar to the print $N command set in awk, but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:
cat /etc/mtab | cut -d ' ' -f1,2

Using cut to list the OS and kernel version:
uname -a | cut -d" " -f1,3,11,12

Using cut to extract message headers from an e-mail folder:
bash$ grep '^Subject:' read-messages | cut -c10-80
Re: Linux suitable for mission-critical apps?
 MAKE MILLIONS WORKING AT HOME!!!
 Spam complaint
 Re: Spam complaint

Using cut to parse a file:
# List all the users in /etc/passwd.

FILENAME=/etc/passwd

for user in $(cut -d: -f1 $FILENAME)
do
  echo $user
done

# Thanks, Oleg Philon for suggesting this.

cut -d ' ' -f2,3 filename is equivalent to awk -F'[ ]' '{ print $2, $3 }' filename

See also Example 12-33.

paste

Tool for merging together different files into a single, multi-column file. In combination with cut, useful for creating system log files.

join

Consider this a special-purpose cousin of paste. This powerful utility allows merging two files in a meaningful fashion, which essentially creates a simple version of a relational database.

The join command operates on exactly two files, but pastes together only those lines with a common tagged field (usually a numerical label), and writes the result to stdout. The files to be joined should be sorted according to the tagged field for the matchups to work properly.

File: 1.data

100 Shoes
200 Laces
300 Socks

File: 2.data

100 $40.00
200 $1.00
300 $2.00

bash$ join 1.data 2.data
File: 1.data 2.data

 100 Shoes $40.00
 200 Laces $1.00
 300 Socks $2.00
	      

Note

The tagged field appears only once in the output.

head

lists the beginning of a file to stdout (the default is 10 lines, but this can be changed). It has a number of interesting options.

Example 12-10. Generating 10-digit random numbers

#!/bin/bash
# rnd.sh: Outputs a 10-digit random number

# Script by Stephane Chazelas.

head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'


# =================================================================== #

# Analysis
# --------

# head:
# -c4 option takes first 4 bytes.

# od:
# -N4 option limits output to 4 bytes.
# -tu4 option selects unsigned decimal format for output.

# sed: 
# -n option, in combination with "p" flag to the "s" command,
# outputs only matched lines.



# The author of this script explains the action of 'sed', as follows.

# head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
# ----------------------------------> |

# Assume output up to "sed" --------> |
# is 0000000 1198195154\n

# sed begins reading characters: 0000000 1198195154\n.
# Here it finds a newline character,
# so it is ready to process the first line (0000000 1198195154).
# It looks at its <range><action>s. The first and only one is

#   range     action
#   1         s/.* //p

# The line number is in the range, so it executes the action:
# tries to substitute the longest string ending with a space in the line
# ("0000000 ") with nothing (//), and if it succeeds, prints the result
# ("p" is a flag to the "s" command here, this is different from the "p" command).

# sed is now ready to continue reading its input. (Note that before
# continuing, if -n option had not been passed, sed would have printed
# the line once again).

# Now, sed reads the remainder of the characters, and finds the end of the file.
# It is now ready to process its 2nd line (which is also numbered '$' as
# it's the last one).
# It sees it is not matched by any <range>, so its job is done.

# In few word this sed commmand means:
# "On the first line only, remove any character up to the right-most space,
# then print it."

# A better way to do this would have been:
#           sed -e 's/.* //;q'

# Here, two <range><action>s (could have been written
#           sed -e 's/.* //' -e q):

#   range                    action
#   nothing (matches line)   s/.* //
#   nothing (matches line)   q (quit)

# Here, sed only reads its first line of input.
# It performs both actions, and prints the line (substituted) before quitting
# (because of the "q" action) since the "-n" option is not passed.

# =================================================================== #

# A simpler altenative to the above 1-line script would be:
#           head -c4 /dev/urandom| od -An -tu4

exit 0
See also Example 12-30.

tail

lists the end of a file to stdout (the default is 10 lines). Commonly used to keep track of changes to a system logfile, using the -f option, which outputs lines appended to the file.

See also Example 12-4, Example 12-30 and Example 30-6.

grep

A multi-purpose file search tool that uses regular expressions. It was originally a command/filter in the venerable ed line editor, g/re/p, that is, global - regular expression - print.

grep pattern [file...]

Search the target file(s) for occurrences of pattern, where pattern may be literal text or a regular expression.

bash$ grep '[rst]ystem.$' osinfo.txt
The GPL governs the distribution of the Linux operating system.
	      

If no target file(s) specified, grep works as a filter on stdout, as in a pipe.

bash$ ps ax | grep clock
765 tty1     S      0:00 xclock
 901 pts/1    S      0:00 grep clock
	      

The -i option causes a case-insensitive search.

The -w option matches only whole words.

The -l option lists only the files in which matches were found, but not the matching lines.

The -r (recursive) option searches files in the current working directory and all subdirectories below it.

The -n option lists the matching lines, together with line numbers.

bash$ grep -n Linux osinfo.txt
2:This is a file containing information about Linux.
 6:The GPL governs the distribution of the Linux operating system.
	      

The -v (or --invert-match) option filters out matches.
grep pattern1 *.txt | grep -v pattern2

# Matches all lines in "*.txt" files containing "pattern1",
# but ***not*** "pattern2".	      

The -c (--count) option gives a numerical count of matches, rather than actually listing the matches.
grep -c txt *.sgml   # (number of occurrences of "txt" in "*.sgml" files)


#   grep -cz .
#            ^ dot
# means count (-c) zero-separated (-z) items matching "."
# that is, non-empty ones (containing at least 1 character).
# 
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz .     # 4
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$'   # 5
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^'   # 5
#
printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$'    # 9
# By default, newline chars (\n) separate items to match. 

# Note that the -z option is GNU "grep" specific.


# Thanks, S.C.

When invoked with more than one target file given, grep specifies which file contains matches.

bash$ grep Linux osinfo.txt misc.txt
osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
 misc.txt:The Linux operating system is steadily gaining in popularity.
	      

Tip

To force grep to show the filename when searching only one target file, simply give /dev/null as the second file.

bash$ grep Linux osinfo.txt /dev/null
osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
	      

If there is a successful match, grep returns an exit status of 0, which makes it useful in a condition test in a script, especially in combination with the -q option to suppress output.
SUCCESS=0                      # if grep lookup succeeds
word=Linux
filename=data.file

grep -q "$word" "$filename"    # The "-q" option causes nothing to echo to stdout.

if [ $? -eq $SUCCESS ]
then
  echo "$word found in $filename"
else
  echo "$word not found in $filename"
fi

Example 30-6 demonstrates how to use grep to search for a word pattern in a system logfile.

Note

egrep is the same as grep -E. This uses a somewhat different, extended set of regular expressions, which can make the search somewhat more flexible.

fgrep is the same as grep -F. It does a literal string search (no regular expressions), which allegedly speeds things up a bit.

agrep extends the capabilities of grep to approximate matching. The search string may differ by a specified number of characters from the resulting matches. This utility is not part of the core Linux distribution.

Tip

To search compressed files, use zgrep, zegrep, or zfgrep. These also work on non-compressed files, though slower than plain grep, egrep, fgrep. They are handy for searching through a mixed set of files, some compressed, some not.

To search bzipped files, use bzgrep.

look

The command look works like grep, but does a lookup on a "dictionary", a sorted word list. By default, look searches for a match in /usr/dict/words, but a different dictionary file may be specified.

sed, awk

Scripting languages especially suited for parsing text files and command output. May be embedded singly or in combination in pipes and shell scripts.

sed

Non-interactive "stream editor", permits using many ex commands in batch mode. It finds many uses in shell scripts.

awk

Programmable file extractor and formatter, good for manipulating and/or extracting fields (columns) in structured text files. Its syntax is similar to C.

wc

wc gives a "word count" on a file or I/O stream:
bash $ wc /usr/doc/sed-3.02/README
20     127     838 /usr/doc/sed-3.02/README
[20 lines  127 words  838 characters]

wc -w gives only the word count.

wc -l gives only the line count.

wc -c gives only the character count.

wc -L gives only the length of the longest line.

Using wc to count how many .txt files are in current working directory:
$ ls *.txt | wc -l
# Will work as long as none of the "*.txt" files have a linefeed in their name.

# Alternative ways of doing this are:
#      find . -maxdepth 1 -name \*.txt -print0 | grep -cz .
#      (shopt -s nullglob; set -- *.txt; echo $#)

# Thanks, S.C.

Using wc to total up the size of all the files whose names begin with letters in the range d - h
bash$ wc [d-h]* | grep total | awk '{print $3}'
71832
	      

Using wc to count the instances of the word "Linux" in the main source file for this book.
bash$ grep Linux abs-book.sgml | wc -l
50
	      

See also Example 12-30 and Example 16-7.

Certain commands include some of the functionality of wc as options.
... | grep foo | wc -l
# This frequently used construct can be more concisely rendered.

... | grep -c foo
# Just use the "-c" (or "--count") option of grep.

# Thanks, S.C.

tr

character translation filter.

Caution

Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.

Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.

The -d option deletes a range of characters.
echo "abcdef"                 # abcdef
echo "abcdef" | tr -d b-d     # aef


tr -d 0-9 <filename
# Deletes all digits from the file "filename".

The --squeeze-repeats (or -s) option deletes all but the first instance of a string of consecutive characters. This option is useful for removing excess whitespace.
bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
X

The -c "complement" option inverts the character set to match. With this option, tr acts only upon those characters not matching the specified set.

bash$ echo "acfdeb123" | tr -c b-d +
+c+d+b++++

Note that tr recognizes POSIX character classes. [1]

bash$ echo "abcd2ef1" | tr '[:alpha:]' -
----2--1
	      

fold

A filter that wraps lines of input to a specified width. This is especially useful with the -s option, which breaks lines at word spaces (see Example 12-19 and Example A-2).

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.

See also Example 12-4.

Tip

A powerful alternative to fmt is Kamil Toman's par utility, available from http://www.cs.berkeley.edu/~amc/Par/.

col

This deceptively named filter removes reverse line feeds from an input stream. It also attempts to replace whitespace with equivalent tabs. The chief use of col is in filtering the output from certain text processing utilities, such as groff and tbl.

column

Column formatter. This filter transforms list-type text output into a "pretty-printed" table by inserting tabs at appropriate places.

colrm

Column removal filter. This removes columns (characters) from a file and writes the file, lacking the range of specified columns, back to stdout. colrm 2 4 <filename removes the second through fourth characters from each line of the text file filename.

Warning

If the file contains tabs or nonprintable characters, this may cause unpredictable behavior. In such cases, consider using expand and unexpand in a pipe preceding colrm.

nl

Line numbering filter. nl filename lists filename to stdout, but inserts consecutive numbers at the beginning of each non-blank line. If filename omitted, operates on stdin.

The output of nl is very similar to cat -n, however, by default nl does not list blank lines.

pr

Print formatting filter. This will paginate files (or stdout) into sections suitable for hard copy printing or viewing on screen. Various options permit row and column manipulation, joining lines, setting margins, numbering lines, adding page headers, and merging files, among other things. The pr command combines much of the functionality of nl, paste, fold, column, and expand.

pr -o 5 --width=65 fileZZZ | more gives a nice paginated listing to screen of fileZZZ with margins set at 5 and 65.

A particularly useful option is -d, forcing double-spacing (same effect as sed -G).

gettext

A GNU utility for localization and translating the text output of programs into foreign languages. While primarily intended for C programs, gettext also finds use in shell scripts. See the info page.

iconv

A utility for converting file(s) to a different encoding (character set). Its chief use is for localization.

recode

Consider this a fancier version of iconv, above. This very versatile utility for converting a file to a different encoding is not part of the standard Linux installation.

TeX, gs

TeX and Postscript are text markup languages used for preparing copy for printing or formatted video display.

TeX is Donald Knuth's elaborate typsetting system. It is often convenient to write a shell script encapsulating all the options and arguments passed to one of these markup languages.

Ghostscript (gs) is a GPL-ed Postscript interpreter.

groff, tbl, eqn

Yet another text markup and display formatting language is groff. This is the enhanced GNU version of the venerable UNIX roff/troff display and typesetting package. Manpages use groff (see Example A-1).

The tbl table processing utility is considered part of groff, as its function is to convert table markup into groff commands.

The eqn equation processing utility is likewise part of groff, and its function is to convert equation markup into groff commands.

lex, yacc

The lex lexical analyzer produces programs for pattern matching. This has been replaced by the nonproprietary flex on Linux systems.

The yacc utility creates a parser based on a set of specifications. This has been replaced by the nonproprietary bison on Linux systems.

Notes

[1]

This is only true of the GNU version of tr, not the generic version often found on commercial UNIX systems.