Beginning Perl Lesson 9

Table of Contents

How the shell passes commands to a Perl script

The shell is a program that runs in the terminal window. It is the shell that gives you the prompt and interprets the commands you enter at the prompt.

When you enter a command, the shell parses it into its components. The first item in a command is the name of a program or the name of a shell command that you want to run. Any other items that you enter on the command line are called arguments. Here some examples:

> ls /Users/challing/BeginningPerl

This command is the standard UNIX list command. The first item of the command, ls, is the name of the program to be run. The second item of the command, /Users/chall/BeginningPerl, is an argument giving the full path of the directory to be listed.

> cp lesson09.html lesson10.html

This command is the standard UNIX copy command. The first item of the command, cp, is the name of the program to be run. The second and third items are the arguments, the second argument being the name of the source file and the third argument being the name of the destination file.

perl countSequences.pl bacillus_subtilis.pep

This command starts the perl program. The first item of the command, perl, is the name of the program to be run. The two arguments, countSequences.pl and bacillus_subtilis.pep, are passed to the perl program.

When the perl program starts up, it gets the arguments on the command line that appear after the perl command. (In this example, the perl program gets the arguments countSequences.pl and bacillus_subtilis.pep.) Perl assumes that the first argument is the name of a Perl script, and it attempts to start running that script.

When Perl starts the script, it passes any remaining arguments on the command line to the script. This means that the script will not get the perl command or the name of the script in its argument list. In this example, the countSequences.pl script would receive only a single argument, bacillus_subtilis.pep.

Getting the name of the script from within the script

When Perl starts running a script, it places the name of the script it’s running into a special variable, $0. As we will see soon, this is helpful for formatting a help message for the user. Here’s a code snippet to see how to use this:

#!/usr/bin/perl
#
#   printScriptName.pl
#   15-Jan-2002
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script simply prints the name of the script itself, which is stored in
#   the special Perl variable $0.
#
#   Use: perl PrintScriptName.pl

use warnings;

    print( "The name of this script is '$0'.\n" );

If we experiment with this script, we can see that the script gets whatever relative or full path that is entered at the prompt.

> perl printScriptName.pl
The name of this script is 'printScriptName.pl'.

> perl /Users/challing/BeginningPerl/printScriptName.pl
The name of this script is '/Users/challing/BeginningPerl/printScriptName.pl'.

> perl ../BeginningPerl/printScriptName.pl
The name of this script is '../BeginningPerl/printScriptName.pl'.

Getting command line arguments from @ARGV

In Perl, the command line arguments that the script receives are placed into a special array variable, @ARGV. It’s very easy to write a Perl script that prints the arguments from the command line. We’ll use a foreach loop to step through the array elements one at a time.

#!/usr/bin/perl
#
#   printArgs.pl
#   08-Jan-2002
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script prints the arguments from the command line.
#
#   Use:
#       perl printArgs.pl arg1 arg2 ...
#   Example:
#       perl printArgs.pl these are some arguments

use warnings;

    my( $arg );
    my( $count );

    $count = 0;
    foreach $arg ( @ARGV )
    {
        $count++;
        print( STDOUT "Argument $count is '$arg'.\n" );
    }

I experimented with various styles of arguments, and the results are shown below. Notice that the shell separates the arguments using white space (one or more space or tab characters). You can pass an argument containing white space characters by surrounding it with quotation marks.

> perl printArgs.pl these are some arguments
Argument 1 is 'these'.
Argument 2 is 'are'.
Argument 3 is 'some'.
Argument 4 is 'arguments'.

> perl printArgs.pl "these are some arguments"
Argument 1 is 'these are some arguments'.

> perl printArgs.pl these         are some arguments
Argument 1 is 'these'.
Argument 2 is 'are'.
Argument 3 is 'some'.
Argument 4 is 'arguments'.

Checking the number of arguments

Your script often requires a certain number of arguments. For example, a copy script would require two arguments, the name of the source file and the name of the destination file. When you know how many arguments you require, you should check to see that the user provided exactly that many arguments.

You can determine the number of arguments with the scalar() function. Here’s how to check if you don’t have exactly two arguments:

    if ( 2 != scalar( @ARGV ) )
    {
        die( "Two arguments are required.\n" );
    }

Using one or more arguments

Sometimes you will write a script that will process an indeterminate number of files, but it will always process at least one file. An example might be a script that counts the number of sequences in one or more fasta files. Sample commands might be:

> perl countSequences.pl bacillus_subtilis.pep

and:

> perl countSequences.pl bacillus_subtilis.pep escherichia_coli.pep

In this case, you know scalar( @ARGV ) should be at least 1. You would check this in your code like this:

    if ( scalar( @ARGV ) < 1 )
    {
        die( "At least one argument is required.\n" );
    }

Providing a help message to the user

It is good form for a script to provide an error message to the user if the number of arguments is incorrect. Here’s an example:

    if ( scalar( @ARGV ) < 1 )
    {
        die( "\nUse: perl $0 fastaFileName\n" );
    }

The output looks like this:

> perl countSequences.pl

Use: perl countSequences.pl fastaFileName

The help message can provide a brief reminder of the arguments that a script expects. You could also provide an example of how to use the script, like this:

    if ( scalar( @ARGV ) < 1 )
    {
        die( "\n",
             "      Use: perl $0 flatFileName\n",
             "  Example: perl $0 bacillus_subtilis.pep\n" );
    }

The output looks like this:

> perl countSequences.pl

      Use: perl countSequences.pl flatFileName
  Example: perl countSequences.pl bacillus_subtilis.pep

Note that Perl’s die() subroutine can take more than one argument. This means you can pass several strings to the die() subroutine, as above. So you could provide a very long help message if you wanted to:

    if ( scalar( @ARGV ) < 1 )
    {
        die( "\n",
             "      Use: perl $0 fastaFileName\n",
             "  Example: perl $0 bacillus_subtilis.pep\n",
             "\n",
             "  This script counts the number of sequences in a fasta\n",
             "  file. It also checks the format of the fasta file and\n",
             "  warns you if it is not formatted correctly.\n",
             "\n" );
    }

The output looks like this:

> perl countSequences.pl

      Use: perl countSequences.pl fastaFileName
  Example: perl countSequences.pl bacillus_subtilis.pep

  This script counts the number of sequences in a fasta
  file. It also checks the format of the fasta file and
  warns you if it is not formatted correctly.

Use your own judgement as to what an appropriate help message should be.

Printing error messages to STDERR

Many scripts write their output to STDOUT and their error messages to STDERR. For all operating systems that I’m familiar with, the default is for both the STDOUT and the STDERR streams to be directed to the terminal window. Here is a simple script that prints to both STDOUT and STDERR:

#!/usr/bin/perl
#
#   outputStreams.pl
#   28-Mar-2002
#
#   Conrad Halling
#   conrad.halling@bfix.org
#
#   This script provides an example of sending error messages to STDERR and
#   output to STDOUT.

use warnings;

    #   Check a list of integers.
    #   Print even integers to STDOUT.
    #   Print an error message if the integer is odd.

    my( $integer );

    foreach $integer ( 1, 4, 7, 23, 8, 14, 27 )
    {
        #   Use the modulo operator to obtain the remainder after dividing
        #   the integer by 2. If the remainder is 0, then the integer is
        #   even.

        if ( $integer % 2 == 0 )
        {
            print( STDOUT "$integer\n" );
        }
        else
        {
            print( STDERR "$integer is not even\n" );
        }
    }

When we run this script, we see the following on the screen:

1 is not even
4
7 is not even
23 is not even
8
14
27 is not even

But you can redirect the STDOUT stream to a file using the > symbol, like this:

> perl outputStreams.pl > results.txt

When we run the script using the above command, only the error messages, which are printed to STDERR, appear on the screen. The results, which are printed to STDOUT, are written to the file results.txt. Here’s what appears on the screen:

1 is not even
7 is not even
23 is not even
27 is not even

And here’s what appears in the file results.txt:

4
8
14

You should always write your error messages to STDERR so error messages can be separated from other output. (Note that Perl’s die() subroutine sends its output to STDERR.)

A summary example script

This script counts sequences in one or more fasta files. Here are some design considerations:

#!/usr/bin/perl
#
#   countSequences.pl
#   27-Mar-2002
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script counts the numer of sequences in one or more fasta files.
#
#   Use: perl countSequences.pl fastaFile1 [fastaFile2...]

use warnings;

    my( $fastaFileName );
    my( $sequenceCount );

    #   Check arguments, which are stored in the special Perl variable @ARGV.
    #   The script assumes that all arguments are the names of fasta files.
    #   There must be at least one argument.

    if ( 0 == scalar( @ARGV ) )
    {
        #   In our error message, we get the name of this script from the
        #   special Perl variable $0.

        die( "\n  Use: perl $0 fastaFile1 [fastaFile2...]\n\n" );
    }

    #   Set our counter to 0.

    $sequenceCount = 0;

    #   Use a foreach loop to get each argument in order.

    foreach $fastaFileName ( @ARGV )
    {
        my( $inLine );
        my( $result );

        #   Attempt to open each file. If open() succeeds, then $result is
        #   TRUE. If open() fails, then $result is FALSE, and ! $result is
        #   TRUE.

        $result = open( FASTA, "<$fastaFileName" );
        if ( ! $result )
        {
            #   The open failed. Print an error message to STDERR, but don't
            #   die(), because there  might be other files to process.
            #   The reason open() failed is stored in the special Perl variable
            #   $!.

            print( STDERR
                "  Can't open file '$fastaFileName' for reading: $!.\n" );
        }

        else
        {
            #   The open succeeded. Count the sequences.
            #   Show progress.

            print( STDERR "  Processing file $fastaFileName...\n" );

            #   This is the standard loop we always use for reading
            #   lines from a file.

            while ( defined( $inLine = <FASTA> ) )
            {
                #   Later on we'll learn how to use a regular expression to
                #   identify the beginning of a sequence entry in a fasta
                #   file.
                #
                #   For now, we simply check that the first character is a
                #   '>' character using the substr() function.

                if ( ">" eq substr( $inLine, 0, 1 ) )
                {
                    ++$sequenceCount;
                }
            }
        }
    }

    #   Print the results to STDOUT.

    print( STDOUT "  Counted $sequenceCount sequences.\n" );

Provide yourself with some short fasta files, and experiment with this script. See what happens when you ask the script to count sequences from several fasta files, including one that doesn’t exist.

Here’s the output I got when I ran the script on three files, fasta1.seq, fasta4.seq (which doesn’t exist), and fasta2.seq:

> perl CountSequences.pl

  Use: perl CountSequences.pl fastaFile1 [fastaFile2...]

> perl CountSequences.pl fasta1.seq fasta4.seq fasta2.seq
  Processing file fasta1.seq...
  Can't open file 'fasta4.seq' for reading: No such file or directory.
  Processing file fasta2.seq...
  Counted 94 sequences.

Homework assignment

Write a script that counts the number of characters and the number of lines of one or more files. The file names are provided as command line arguments to the script. If the user doesn’t include any arguments, then the script should print a help message and quit.

To count the number of lines in the file, simple increment a counter variable each time a line is read successfully from the file.

To count the number of characters in the file, use Perl’s length() function on each line after you read it from the file, like this:

    $numberOfCharacters = length( $dataLine );

Add the length of each line to a variable containing the total length so far. Do not chomp() each line after you read it, since that will remove a character from the end of each line.

Print the name of each file (which you get from @ARGV), the number of lines, and the number of characters. If you want, you can print a grand total of the number of characters and lines for all the files at the end. You can check your results by using the UNIX wc (word count) utility, which counts the number of lines, words, and characters in a file. (We can’t count words until we learn how to use regular expressions.)