Beginning Perl Lesson 14

Table of Contents

Perl idioms

In this course, we’ve made an effort to write Perl that is easy to understand by not taking common Perl shortcuts. But as you gain more experience, and as you read more scripts, you’ll come across commonly used (idiomatic) pieces of code that are terse or even obfuscated. Perl, like C, is a programming language that provides so many shortcuts that it’s easy to write code that no one else can understand.

In the first part of today’s class, we’re going to talk about some common Perl idioms that you’ll need to be aware of.

The $_ variable

The special variable $_ is used in Perl as an all-purpose default argument in many situations.

$_ and reading lines from a file

Here is the way we have been reading lines from a file, where $ARGV[ 0 ] is the name of the file as given in the first argument from the command line:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        my( $dataLine );

        while ( defined( $dataLine = <IN> ) )
        {
            print( STDOUT $dataLine );
        }
    }

After we open the file, we explicitly create a variable, $dataLine, into which we place each line obtained from the file by the line input (angle, diamond) operator <>. We have to check that $dataLine contains a defined value each time we attempt to read from the file, since the value will be undefined when we reach the end of the file.

This is one way to write the same code; this code is easier to write but harder to understand:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        while ( <IN> )
        {
            print( STDOUT $_ );
        }
    }

The difference here is that we didn’t create a variable for holding each line as it is read from the file. In a while loop, if and only if you don’t immediately assign the result of the line input operator <> to a variable, Perl assigns the result to the $_ variable. In addition, Perl checks to see if $_ contains a defined value; if it does, then Perl executes the code inside the curly braces of the while loop.

You’ll probably see this a lot in other people’s code because programmers want to save time entering code, so they tend to use the shortest code, even if it’s a bit obscure. (For more information, see Programming Perl, 3rd ed., pp. 80-81.)

$_ as the default argument for the print() function

There’s another simplification we can make to the above code. Where we had:

        print( STDOUT $_ );

we can instead use:

        print( STDOUT );

This is because when no argument other than the filehandle is given to the print() function, the function takes as its argument the $_ variable.

Our code now looks like this:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        while ( <IN> )
        {
            print( STDOUT );
        }
    }

$_ as the default argument for the chomp() function

chomp() is the function you use to remove a newline character from the end of a value. Most commonly, chomp() is used when reading lines of input from the keyboard or from a file. Here’s how we have used it when reading lines from a file:

    while ( defined( $dataLine = <IN> ) )
    {
        chomp( $dataLine );
    }

If we change the while loop so that it uses the $_ variable, we have to apply chomp() to $_, like this:

    while ( <IN> )
    {
        chomp( $_ );
    }

As a convenience (and to save our aching fingers from having to enter too much code), if we don’t give chomp() an argument, it takes $_ as its default argument. This allows us to write our loop this way:

    while ( <IN> )
    {
        chomp();
    }

$_ as the default operand of the m// regular expression matching operator

Here is an example of using a regular expression to scan a fasta file for the sequence IDs, which follow immediately after a ">" character at the beginning of a line. The way we have written this code is:

    while ( defined( $dataLine = <IN> ) )
    {
        if ( $dataLine =~ m/^>(\S+)/ )
        {
            print( $1 );
        }
    }

If change our while loop so that it uses the $_ variable, we have to match the regular expression against $_, like this:

    while ( <IN> )
    {
        if ( $_ =~ m/^>(\S+)/ )
        {
            print( $1 );
        }
    }

As a convenience, when we omit the operand of the regular expression match operator, Perl performs the match against $_, like this:

    while ( <IN> )
    {
        if ( m/^>(\S+)/ )
        {
            print( $1 );
        }
    }

$_ in a foreach loop

We can use a foreach loop when we want to examine each item in a list or array for some purpose, as in this example:

    @colors = ( 'red', 'yellow', 'blue' );
    foreach $color ( @colors )
    {
        print( "$color\n" );
    }

Each time through the loop, an element from the @colors array variable is assigned to the $color scalar variable. If we don’t specify the scalar variable to which each element is to be assigned, then Perl assigns the element to the $_ variable, letting us simplify our code to:

    @colors = ( 'red', 'yellow', 'blue' );
    foreach ( @colors )
    {
        print( "$_\n" );
    }

Obscure uses of $_

For other, more obscure uses of $_, see Programming Perl, 3rd. ed., pp. 658-659.

Using || as a short-circuit

Perl provides the C programming language-style logical && (and) and || (or) operators. Here are the truth tables for these two operators, which we first introduced in Lesson 4:

Truth table for && operator
1st value 2nd value Result
T T T
T F F
F T F
F F F
Truth table for || operator
1st value 2nd value Result
T T T
T F T
F T T
F F F

When we look at the truth table for the || operator, we can see that either one or the other value has to be true, or both values have to be true, for the result to be true. We can also see that if the first value is true, then we don’t have to check the second value, because we know that if the first value is true, then the result is always true. But if the first value is false, then we have to check the second value to find out what the result will be.

Note: when we work with true and false values, we perform Boolean logic. Variables that can take only true or false values are called Boolean variables. All Perl variables are Boolean in the sense that all Perl variables evaluate to either a true value or a false value.

Here’s an example of how to use the && and || operators:

    my( $first );
    my( $second );

    #   In Perl, a variable is false if its value is 0 (zero), the empty string
    #   (''), or its value is undefined. Otherwise, a variable is true.

    $first = 1;     # a true value
    $second = 0;    # a false value

    if ( $first && $second )
    {
        #   execute this code if both $first and $second are true
    }
    else
    {
        #   execute this code if either or both of $first and $second are false
        #   in this example, this code would be executed
    }

    if ( $first || $second )
    {
        #   execute this code if either or both of $first and $second are true
        #   in this example, this code would be executed
    }
    else
    {
        #   execute this code if neither $first nor $second is true
    }

In Perl code, these operators work on expressions, and expressions can be lines of code that result in a value. Knowing this, we can splice together two lines of code with the || operator, and Perl handles things like this:

The result of this is applied in a tricky way to cause Perl to die() when it can’t open a file. Remember that the open() function returns a true value when it opens a file successfully. Up to this point, we have been writing something like this:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        #   Do something with the file...
    }
    else
    {
        die( "Can't open file $ARGV[ 0 ]: $!." );
    }

But here’s how we can use the || operator as a code short-circuit. On the left side of the operator, we attempt to open the file as normal. If the open is successful, then the left side evaluates to a true value, and the expression on the right side of the || is ignored. But if the open fails, then the left side evaluates to a false value, and Perl must evaluate the expression on the right side of the || operator to determine the result.

So if we put the open() function on the left and a die() function on the right of the || operator, we get what we want. If the file is opened successfully, the open() function returns true, and Perl ignores the die() on the right. If the file is not opened successfully, the open() function on the left returns false, and Perl must evaluate the die() function on the right to determine the result. (And of course die() causes Perl to die, and Perl never finds out the result, but we don’t care because we made Perl do what we want it to do.)

The result is a tricky way to open a file and check the result of the open all on one line of code:

    open( IN, "<$ARGV[ 0 ]" ) || die( "Can't open $ARGV[ 0 ]: $!.\n\n" );

For more information, see Programming Perl, 3rd. ed, pp. 102-103.

unless, the reverse of if

if is used like this:

    if ( $value )
    {
        #   execute this code if $value is true
    }
    else
    {
        #   execute this code if $value is false
    }

Oftentimes, you’ll want to reverse this. One way to do this is:

    if ( ! $value )
    {
        #   execute this code if ! $value is true, that is, $value is false
    }
    else
    {
        #   execute this code if ! $value is false, that is, $value is true
    }

Perl supplies unless as the reverse of if, and unless can be used in place of the code immediately above, like this:

    unless ( $value )
    {
        #   execute this code if $value is false
    }
    else
    {
        #   execute this code if $value is true
    }

When I think about this too hard, my brain starts to hurt. But I use unless all the time in my code.

Obfuscated Perl

Obfuscation is the act of making something obscure or difficult to understand. A famous phrase from the 1970s was "eschew obfuscation", a difficult to understand phrase that means "avoid making things difficult to understand".

Above, we showed a simplified piece of code that reads a file line by line and prints the lines to the standard output:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        while ( <IN> )
        {
            print( STDOUT );
        }
    }

Since the default filehandle for the print() function is STDOUT, we can simplify the line to:

        print();

And since the parentheses after a function name are optional in Perl, we can shorten this line even further to:

        print;

Here’s the newly shortened piece of code:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        while ( <IN> )
        {
            print;
        }
    }

Did you know that the final line in a block, where a block is lines of code surrounded by curly braces ("{}"), doesn’t require a semicolon after it? (Programming Perl, 3rd ed., p. 111.) Let’s remove the semicolon after print, giving:

    if ( open( IN, "<$ARGV[ 0 ]" ) )
    {
        while ( <IN> )
        {
            print
        }
    }

This can get even shorter. If you use the line input operator <> without putting a filehandle inside, Perl will automagically open the file indicated by the first argument from the command line, then read lines from that file. Furthermore, once it’s done reading from that file, it will open the next file indicated by the second argument from the command line, and so on. (If no arguments are provided, then Perl reads input from the keyboard, STDIN.) And if a file can’t be opened, Perl simply goes on to the next one. (See Programming Perl, 3rd ed., pp. 82-83 for more information.)

This means we can shorten our code even more. Think of all the time we’ve wasted entering all that redundant code. Now we can have:

    while ( <> )
    {
        print
    }

Now we’re down to only four lines of code. But Perl doesn’t care if we make our code easy to read, so we can reduce the four lines to a single line, like this:

    while ( <> ) { print }

Furthermore, Perl doesn’t care if the white space is in the code, so let’s take that out, too, like this:

while(<>){print}

This one-liner is the equivalent of:

    my( $fileName );

    foreach $fileName ( @ARGV )
    {
        if ( open( IN, "<$fileName" ) )
        {
            my( $dataLine );

            while ( defined( $dataLine = <IN> ) )
            {
                print( STDOUT $dataLine );
            }
            close( IN );
        }
    }

This sort of simplification can be a marvelous feature that saves the programmer coding time. Unfortunately, code of this type is impossible to understand unless you have a deep understanding of the programming language.

So when should you use the short form and when should you use the long form? If you’re writing a script that you’re going to use once and then throw it away, use the short form. If you’re writing a script that’s going to grow complicated or that you’re going to give to someone else, then use the long version.