Chapter 2

Linux programs, irrespective of whether they are compiled, Bash scripts, or scripts written in another language, receive several inputs and send several outputs. As shown in the previous chapter, programs receive command-line arguments and inherit environment variables from their parent processes. Aside from reading input files, they can also read from the standard input, or stdin. Additionally, they can also receive signals such as terminate (from kill) or interrupt (from C-c). For outputs, programs report their exit status (by updating $? upon exit), write to output files, and print to standard output, or stdout. Additionally, programs can also print to standard error, or stderr.

Standard input is what the user enters at the terminal when a program prompts for user input. Standard output is what is output to and displayed on the terminal. Standard error is also displayed on the terminal by default, but this data stream is different from standard output.

We will use the following utility programs in this chapter:

Command
Use

head

Prints the first part of files.

tail

Prints the last part of files.

wc

Counts lines, words, and bytes.

tr

Translates characters.

sort

Sorts lines of text.

shuf

Shuffles lines of text.

diff

Finds differences between two files line by line.

uniq

Reports unique lines in a file.

cut

Cuts out a column of a table.

paste

Pastes lines from files side by side.

join

Joins two tables on a common field.

sed

Performs basic text transformations.

Note on MacOS / OpenBSD

There is a common misconception that MacOS is Linux. It is not. Instead, MacOS came from OpenBSD. Both Linux and OpenBSD sought to emulate the Unix operating system, but they do have important differences, particularly in the utility programs. Therefore, if you are using a MacOS, you will want to install the Linux utility programs (coreutils) using the installer brew. If you run into problems, make sure you are using the Linux version of the programs.

Boilerplate

Let's start with a short template for writing your programs. Just a few lines of set up will make your life so much easier down the road.

#!/bin/bash
# Description of program

set -euo pipefail
IFS=$'\n\t'

The first line is the shebang line that you've seen before. Recall that it simply tells Linux how to execute this script: using the /bin/bash interpreter. This line is important because the default interpreter is sh, which may refer to different Shell script interpreters on different systems, and by specifying this line, we are certain that we will be using the correct interpreter.

The last two lines set up the so-called strict mode for Bash. It will make your debugging much easier so that you are hunting for the cause of some silent error. The -e flag ensures that the script stops on the first encountered error. The -u flag disallows reading from undefined variables, which helps prevent spelling mistakes in variable names. For example, suppose you defined variable language=Bash, but when you later try to use it, you misspelled it as langage. Under the set -u setting, bash will generate an error, because you are trying to read a variable whose value is still undefined: langage. Under the default mode, your script will proceed until some non-silent error occurs. Lastly, the -o pipefail flag prevents error in a series of pipes from being masked. We will discuss IFS in a later chapter.

So, we now segue into one of Bash's best features: pipes.

Help!

In the last chapter, we discussed how command line arguments allow you to change program settings and influence its behaviour. Command line arguments can take a short form (e.g. -h) or the long form (e.g. --help), where the former has one dash (-) and the latter has two (--).

Some programs support only short-form arguments, some support only long-form arguments, and some support both. Therefore, it is important to consult the help or manual documentation for the program.

All core Linux programs come with a manual entry, which can be viewed using the man command:

man head

Not all programs will have a manual entry, but most of them should have a help output, which you request using the --help or the -h argument. (Sadly, some programs have neither.) Programs such as ps may treat -h as something else, and other programs don't recognize --help. It would be nice if all software programmers follow the same standards, but eccentric programmers exist.

head -h
tail --help

Now, you should pause here and examine the help page of the utility programs in the above table.

We'll wait.

Pipes

Now that you have some idea of what the utility programs do, we'll proceed to join the programs together with pipes in order to form a pipeline.

Suppose we want to get the 5th line of the following poem:

sonnet104.txt

To me, fair friend, you never can be old,
For as you were when first your eye I eyed,
Such seems your beauty still. Three winters cold
Have from the forests shook three summers’ pride,
Three beauteous springs to yellow autumn turned
In process of the seasons have I seen,
Three April perfumes in three hot Junes burned,
Since first I saw you fresh, which yet are green.
Ah, yet doth beauty, like a dial-hand,
Steal from his figure, and no pace perceived;
So your sweet hue, which methinks still doth stand,
Hath motion, and mine eye may be deceived:
For fear of which, hear this, thou age unbred:
Ere you were born was beauty’s summer dead.

One way is to set up a pipe using head and tail. First, we print the first 5 lines using head:

head -n 5 sonnet104.txt

Then, we can keep just the last line of this output by piping the results to tail, which in turn gives us what we want: the 5th line of the poem.

head -n 5 sonnet104.txt | tail -n 1

What the pipe | operator does is to connect the stdout (standard output stream) from the first program (head) to the stdin (standard input stream of the second program (tail). In other words, the output of head is provided as the input to tail. These two programs combined together form a pipeline.

For the second task, we would like to see how many times the word "you" appears in the poem. We can build this up one program at a time.

  1. Print the text file to stdout.

    cat sonnet104.txt

    Now, we want to build on this first line. Instead of typing it again, we can simply press the up arrow key (or C-p) to get the previous command. We can then type the next part of the pipeline.

  2. Put each word in its own line.

    cat sonnet104.txt |
    tr -cs A-Za-z '\n'

    Bash is generally fussy about line breaks. Normally, we would want to break a command into multiple lines by using the \ operator in order to continue a command into the next line. In the case of the | operator without a right-hand side, Bash expects the next line to be a continuation.

  3. Convert all uppercase letters to lower case.

    cat sonnet104.txt |
    tr -cs A-Za-z '\n' |
    tr A-Z a-z
  4. Get lines containing just the word of interest.

    cat sonnet104.txt |
    tr -cs A-Za-z '\n' |
    tr A-Z a-z |
    grep '^you$'
  5. Create a frequency table.

    cat sonnet104.txt |
    tr -cs A-Za-z '\n' |
    tr A-Z a-z |
    grep '^you$' |
    uniq -c

Now, we have our answer.

So far, we have only worked with a single stream of data: from the stdout of one program to the stdin of the next. Now, what if you want to output multiple streams of data that will be processed by downstream programs. For this, we'll want to give a name to each of the many pipes, so that we can refer to them unambiguously. This is where named pipes come in, but that is for another chapter.

Summary

Here, we saw how to achieve a task by breaking it down into individual subtasks, and we combine these subtasks together using pipes in order to form a pipeline. By focusing on one subtask at a time, we can turn a long, complicated task into a series of easy subtasks. This is precisely what makes Bash so powerful!

Last updated