Beginning Perl Lesson 8

Table of Contents

Hash variables

We have already learned about scalar variables; a scalar variable can hold exactly one value. And we have learned about array variables, which hold many items in order.

Perl provides another variable type that can hold many items, the hash variable type. Elements are added to a hash as key-value pairs. The key serves as a unique index to the hash, and the value is what is associated with the key.

Here are metaphors that might help you remember the characteristics of the scalar, array, and hash variable types.

The name of a hash variable always begins with the percent (%) symbol.

Suppose we are writing a program that handles dates, and we need to convert the number of the month (where January is month 0 and December is month 11) to the name of the month. One way we could do this would be to put the names of the months into an array variable, like this:

@months = ( 'January',   'February', 'March',    'April',
            'May',       'June',     'July',     'August',
            'September', 'October',  'November', 'December' );

This is equivalent to the following initialization, which makes clearer the relationship between the array index and the value at that index:

my( @months );

$months[  0 ] = 'January';
$months[  1 ] = 'February';
$months[  2 ] = 'March';
$months[  3 ] = 'April';
$months[  4 ] = 'May';
$months[  5 ] = 'June';
$months[  6 ] = 'July';
$months[  7 ] = 'August';
$months[  8 ] = 'September';
$months[  9 ] = 'October';
$months[ 10 ] = 'November';
$months[ 11 ] = 'December';

Then when we wanted to convert the number of the month to the name of the month, we could simply use the number of the month as an index into the array to get the name, like this:

$month = $months[ 3 ];      #   gets 'April'

But now suppose we wanted to do the opposite, to convert the name of the month into the number of the month. What you need is some kind of array-type variable where the index would be the name of the month and the value would be the number of the month. We can’t do this with a standard array variable because the index of an array must be a number. So Perl provides a second type of array-type variable, the associative array or hash variable. Its use looks like this:

my( %months );

$months{ 'January'   } =  0;
$months{ 'February'  } =  1;
$months{ 'March'     } =  2;
$months{ 'April'     } =  3;
$months{ 'May'       } =  4;
$months{ 'June'      } =  5;
$months{ 'July'      } =  6;
$months{ 'August'    } =  7;
$months{ 'September' } =  8;
$months{ 'October'   } =  9;
$months{ 'November'  } = 10;
$months{ 'December'  } = 11;

Notice that the index of the hash variable, which is called the key of the hash variable, goes between curly braces ({}), whereas the index of an array variable goes between square brackets ([]). Also notice that anything can serve as key to a hash, whereas the index of an array must be a number.

When we want to convert the name of the month to the number of the month, we simply use the name of the month as a key into the hash to get the number, like this:

$number = $months{ 'April' };    #   puts 3 into $number

Initializing a hash

Hashes can be initialized just like arrays, except that we have to remember to use key-value pairs in the initialization. Here’s how we could initialize a hash with the names of the months as the keys and the numbers of the months as the values:

%months = ( 'January',  0, 'February',  1, 'March',      2,
            'April',    3, 'May',       4, 'June',       5,
            'July',     6, 'August',    7, 'September',  8,
            'October',  9, 'November', 10, 'December',  11 );

The trouble with initializing a hash this way is that the correspondence between the keys and the values is not obvious. So Perl provides another way to initialize a hash:

%months = ( 'January'   => 0, 'February' =>  1, 'March'     =>  2,
            'April'     => 3, 'May'      =>  4, 'June'      =>  5,
            'July'      => 6, 'August'   =>  7, 'September' =>  8,
            'October'   => 9, 'November' => 10, 'December'  => 11 );

With this syntax, the key comes before the => operator and the value comes after it. This way, it’s clear which items in the list make up a key-value pair. A clearer way to code this:

%months = ( 'January'   =>  0,
            'February'  =>  1,
            'March'     =>  2,
            'April'     =>  3,
            'May'       =>  4,
            'June'      =>  5,
            'July'      =>  6,
            'August'    =>  7,
            'September' =>  8,
            'October'   =>  9,
            'November'  => 10,
            'December'  => 11 );

So, why couldn’t we have done this with the comma-separated list that we used first, like this?

%months = ( 'January',    0,
            'February',   1,
            'March',      2,
            'April',      3,
            'May',        4,
            'June',       5,
            'July',       6,
            'August',     7,
            'September',  8,
            'October',    9,
            'November',  10,
            'December',  11 );

The answer is that we can. In Perl, the => operator is synonymous with the , (comma) operator. One advantage to using the => operator is that it makes it a little clearer which is the key and which is the value (the key points the way to the value).

A second advantage to using the => operator is that when we use a word as a key, we don’t have to put single or double quotes around the word. When we use the comma operator, we always have to put quotation marks around the word. So here’s our final example using the => operator:

%months = ( January   =>  0,
            February  =>  1,
            March     =>  2,
            April     =>  3,
            May       =>  4,
            June      =>  5,
            July      =>  6,
            August    =>  7,
            September =>  8,
            October   =>  9,
            November  => 10,
            December  => 11 );

Adding key-value pairs to a hash

Above, we showed how to initialize an entire hash. Of course, we often don’t know the key-value pairs of a hash beforehand. For example, we’ll often determine the key-value pairs of a hash when we’re parsing a data file.

We can add key-value pairs to a hash using a syntax that requires us to add them one pair at a time, like this:

my( %months );

$months{ 'January'   } =  0;
$months{ 'February'  } =  1;
$months{ 'March'     } =  2;
$months{ 'April'     } =  3;
$months{ 'May'       } =  4;
$months{ 'June'      } =  5;
$months{ 'July'      } =  6;
$months{ 'August'    } =  7;
$months{ 'September' } =  8;
$months{ 'October'   } =  9;
$months{ 'November'  } = 10;
$months{ 'December'  } = 11;

As we can see above, the key goes inside the braces, and the value comes after the equals sign.

Now, what’s with the $ in front of $months? Why isn’t it a % symbol? The answer is that a key provides an index to exactly one item in a hash, and when we refer to exactly one item, we’re referring to a scalar. So Perl syntax requires that we put a $ symbol in front of the name of the variable. It’s only when we refer to the hash as a whole that we put a % in front.

Using a key to get a value from a hash

When we want to get a value out of a hash, we have to use a key, like this:

$monthNumber = $months{ 'August' };

Since the value that is associated with the key 'August' is 7, $monthNumber will contain 7.

Removing hash elements

Perl provides the delete() function for removing a key-value pair from a hash, like this:

delete( $months{ 'August' } );

This removes the 'August' key and its associated value, 7, from the %months hash.

The simplest way to remove everything from a hash is to reinitialize it with an empty list, like this:

%months = ();

Testing for the existence of a key in a hash

If we want to see whether a particular key exists in a hash, we can use Perl’s exists() function. The function returns a true value if the key exists in the hash or a false value if the key does not exist in a hash. Keys are case-sensitive.

Here’s an example of testing to see if a key exists in a hash, using our %months hash:

$name = 'August';
if ( exists( $months{ $name } ) )
{
    print( STDOUT "$name is month number $months{ $name }.\n" );
}
else
{
    print( STDERR "Key $name not found in hash \%months!\n" );
}

Arithmetic manipulation of values in a hash

Hashes are extremely useful for counting the number of different items of something. For example, suppose we wanted go through a column of names from a spreadsheet and count how many times we have seen each name. A good way to do this is to use a hash, where the keys are the names and the values are the number of times we have seen each name.

Here’s how we do it. We define a hash variable called %names. Then we read each line from the file to get each name, which will serve as the key to the hash. We use the exists() function to determine whether the key is already present in the hash. If it isn’t, then we assign a new key-value pair to the hash, where the key is the new name and the value is 1 (because we have seen this name one time). On the other hand, if the exists() function tells us that the key is already in the hash, then we want to increment (add 1 to) the value.

So how do we change the value associated with a key? Assuming that we have a hash variable named %names and a key in the scalar variable $name, we know that we can access the value associated with the key and place it into the $value variable this way:

$value = $names{ $name };   #   extract the value associated with key $name

Then we could add 1 to $value, and then associate the new value with the key. Since keys are unique, the new key-value pair replaces the old key-value pair.

$value++;                   #   increment $value
$names{ $name } = $value;   #   provide a new value for the key

Now that we’ve seen the long way to increment a value, here’s four equivalent ways to increment a value, beginning with our example above and using progressively shorter code, until we finish by using the ++ operator for incrementing.

#   Examples of ways to increment the value associated with a key.

#   The long way, using an intermediate variable.

$value = $names{ $name };   #   extract the value associated with key $name
$value++;                   #   increment $value
$names{ $name } = $value;   #   provide a new value for the key

#   The short way, without using an intermediate variable.

$names{ $name } = $names{ $name } + 1;

#   Using the += operator.

$names{ $name } += 1;

#   Using the ++ operator.

$names{ $name }++;

The point here is that we can access values directly via their keys and perform mathematical operations on the values. Here are some examples:

#   Add two values together.

$combinedLength = $seqLengths{ $seqName1 } + $seqLengths{ $seqName2 };

#   Subtract 5 from a value. This changes the value associated with the
#   key.

$seqLengths{ $seqName1 } -= 5;

#   Take the square root of the area of a square to determine the length of a side
#   of the square.
#
#   %areas is the hash that contains the names of squares as keys and the areas of
#   the squares as values.
#
#   $square is a variable that contains the name of a square and thus serves as a key
#   to the hash.
#
#   $areas{ $square } is the value associated with the name $square. We take the
#   square root, using the sqrt() function, to determine the length of each side of
#   the square, and put the result into the variable $length.

$length = sqrt( $areas{ $square } );

Getting and using all hash keys

We’ll often load hash variables with data obtained from a file. For example, the keys could be sequence names, and the values could be the lengths of the sequences. Once we have loaded the data into the hash, we might want to get all the items back out to work with them.

Perl provides the keys() function for obtaining all the keys from a hash. The keys() function returns a list containing the hash keys. For example, if we wanted the keys from our %months hash variable, we could get them like this:

@monthNames = keys( %months );

Here’s a short script where we initialize the %months variable as above. Then we extract all the keys and print them.

#!/usr/bin/perl
#
#   hashKeys1.pl
#   20-Jun-2004
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script demonstrates how to initialize a hash from a list and extract
#   the keys and values from the hash.

use warnings;

    my( $monthName );
    my( @monthNames );
    my( %months );

    #   Initialize the hash. The keys are the names of the months and the
    #   values are the numbers of the months, where January is month 0 and
    #   December is month 11.

    %months = ( January   =>  0,
                February  =>  1,
                March     =>  2,
                April     =>  3,
                May       =>  4,
                June      =>  5,
                July      =>  6,
                August    =>  7,
                September =>  8,
                October   =>  9,
                November  => 10,
                December  => 11 );

    #   Use the keys() function to get all the keys, and assign the keys
    #   to an array.

    @monthNames = keys( %months );

    #   Use a foreach loop to get each item from the array. Each item is
    #   a key to the %months hash; we use the item to obtain the value
    #   associated with that key.

    foreach $monthName ( @monthNames )
    {
        print( "$monthName => $months{ $monthName }\n" );
    }

When I run this script, I get the following output:

> perl hashKeys1.pl
September => 8
January => 0
November => 10
April => 3
August => 7
March => 2
June => 5
July => 6
December => 11
May => 4
October => 9
February => 1

Why didn’t the elements come out in the same order as they went in? The answer is that they never do. The keys of a hash are not stored in any particular order, so when you extract the keys of a hash with the keys() function, they don’t come out in any particular order. You will find that, if you run this script, the keys and values will probably come out in a different order than the one given above. If you want the keys in order, then you have to sort them.

Sorting hash keys

When we use Perl’s keys() function on a hash variable, we get back a list containing the keys to the hash. We have already learned how to sort a list alphabetically or numerically. This means we have all the necessary pieces for sorting the keys of a hash.

Let’s revise our script so that we sort the keys alphabetically before we print them. The change is indicated in bold below:

#!/usr/bin/perl
#
#   hashKeys2.pl
#   20-Jun-2004
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script demonstrates how to initialize a hash from a list and extract
#   the keys and values from the hash, with the keys sorted in alphabetical
#   order.

use warnings;

    my( $monthName );
    my( @monthNames );
    my( %months );

    #   Initialize the hash. The keys are the names of the months and the
    #   values are the numbers of the months, where January is month 0 and
    #   December is month 11.

    %months = ( January   =>  0,
                February  =>  1,
                March     =>  2,
                April     =>  3,
                May       =>  4,
                June      =>  5,
                July      =>  6,
                August    =>  7,
                September =>  8,
                October   =>  9,
                November  => 10,
                December  => 11 );

    #   Use the keys() function to get all the keys, and assign the keys
    #   to an array.

    @monthNames = keys( %months );

    #   Sort the array into order by the name of the month.

    @monthNames = sort( @monthNames );

    #   Use a foreach loop to get each item from the array. Each item is
    #   a key to the %months hash; we use the item to obtain the value
    #   associated with that key.

    foreach $monthName ( @monthNames )
    {
        print( "$monthName => $months{ $monthName }\n" );
    }

The output is given below:

> perl hashKeys2.pl
April => 3
August => 7
December => 11
February => 1
January => 0
July => 6
June => 5
March => 2
May => 4
November => 10
October => 9
September => 8

This shows you can sort the keys alphabetically, but in this case it doesn’t really get the names of the months back into the order we want.

It looks like what we want to do is sort the keys in the numeric order of their values. The magic formula for doing this is shown in bold below. (I call this a magic formula because I’m not going to explain now how this works. For more information, see Perl Cookbook, pp. 144-145.)

#!/usr/bin/perl
#
#   hashKeys3.pl
#   20-Jun-2004
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This script demonstrates how to initialize a hash from a list and extract
#   the keys and values from the hash, with the keys sorted order according
#   to their values.

use warnings;

    my( $monthName );
    my( @monthNames );
    my( %months );

    #   Initialize the hash. The keys are the names of the months and the
    #   values are the numbers of the months, where January is month 0 and
    #   December is month 11.

    %months = ( January   =>  0,
                February  =>  1,
                March     =>  2,
                April     =>  3,
                May       =>  4,
                June      =>  5,
                July      =>  6,
                August    =>  7,
                September =>  8,
                October   =>  9,
                November  => 10,
                December  => 11 );

    #   Use the keys() function to get all the keys, and assign the keys
    #   to an array.

    @monthNames = keys( %months );

    #   Sort the array into order by the number of each month, that is, in
    #   order by the values associated with the keys.

    @monthNames = sort { $months{ $a } <=> $months{ $b } } @monthNames;

    #   Use a foreach loop to get each item from the array. Each item is
    #   a key to the %months hash; we use the item to obtain the value
    #   associated with that key.

    foreach $monthName ( @monthNames )
    {
        print( "$monthName => $months{ $monthName }\n" );
    }

Now the output is:

> perl hashKeys3.pl
January => 0
February => 1
March => 2
April => 3
May => 4
June => 5
July => 6
August => 7
September => 8
October => 9
November => 10
December => 11

If our values were not numeric but were words, then we would sort the keys by the alphabetical order of their values using the following magic formula, where we have substituted the cmp operator in place of the <=> operator:

@monthNames = sort { $months{ $a } cmp $months{ $b } } @monthNames;

A note about code style

It’s important to write your code so that it’s easy to read. An example of this was given above, where I demonstrated several different ways to initialize a hash variable.

This initialization is difficult to read; visually, the keys and the values are jumbled together.

%months = ( 'January' => 0, 'February' => 1, 'March' => 2,
            'April' => 3, 'May' => 4, 'June' => 5,
            'July' => 6, 'August' => 7, 'September' => 8,
            'October' => 9, 'November' => 10, 'December' => 11 );

It is much kinder to yourself (when you reread your code after time has passed) -- and to those who maintain your code -- that you format the initialization so that the relationships between the keys and values are clear.

%months = ( January   =>  0,
            February  =>  1,
            March     =>  2,
            April     =>  3,
            May       =>  4,
            June      =>  5,
            July      =>  6,
            August    =>  7,
            September =>  8,
            October   =>  9,
            November  => 10,
            December  => 11 );

A summary example script

Here’s a script that puts together what we learned today. This script reads a data file where the name of a human NCBI reference sequence (refSeq) appears on the first line of the file and the length of the refSeq as aligned against NCBI assembly 34 of the human genome appears on the second line. (The data were extracted from the UCSC Genome Bioinformatics web site, specifically from file refGene.txt.gz available from the UCSC human genome database downloads page.

The script loads each refSeq name and its associated length value into a hash variable. Then the script extracts all the keys and sorts them in order by their values, the sequence lengths. Finally, the script prints out the refSeq names and lengths in order from the longest refSeq to the shortest.

A short data file follows the script.

#!/usr/bin/perl
#
#   refSeqSizes.pl
#   20-Jun-2004
#
#   Conrad Halling
#   conrad.halling@sphaerula.com
#
#   This sample script reads the contigSizes.txt file, which contains
#   alternating lines with the  name of a Xenorhabdus contig and then the size
#   of the contig. This script reads the data into a hash, where the key is the
#   name of the contig and the value is the size of the contig. The script then
#   sorts the hash keys in order by contig size, and prints the names and sizes
#   to STDOUT.

use warnings;
use strict;

    my( $dataLine );
    my( $fileName );
    my( $refSeq );
    my( %refSeqs );
    my( $success );

    #   Ask the user for the name of the data file.

    print( STDERR "\n" );
    print( STDERR "  Enter the name of the data file:\n" );

    #   Get the name of the data file.

    $fileName = <>;
    chomp( $fileName );

    #   Open the file.

    $success = open( IN, "<$fileName" );
    if ( ! $success )
    {
        die( "\n  Can't open file $fileName for reading: $!.\n\n" );
    }

    #   Loop, reading two lines on each loop. The first line contains the name
    #   of a refSeq, the second the length of that refSeq.

    while ( defined( $dataLine = <IN> ) )
    {
        my( $name );
        my( $size );

        #   Process the line to get the name of the refSeq.

        chomp( $dataLine );
        $name = $dataLine;

        #   Read the next line, which contains the size of the refSeq.

        if ( defined( $dataLine = <IN> ) )
        {
            #   Process the line to get the size of the refSeq.

            chomp( $dataLine );
            $size = $dataLine;

            #   Store the name and size of the contig into the hash.
            #   The name is the key, and the size is the value.

            $refSeqs{ $name } = $size;
        }
    }

    #   Close the input file.

    close( IN );

    #   Get the keys, which are sorted by their values.
    #   See Perl Cookbook, pp. 144-145 for more information.
    #   The commented code sorts the keys from smallest value to largest.
    #   The used code sorts the keys from largest value to smallest.

#   foreach $refSeq ( sort { $refSeqs{ $a } <=> $refSeqs{ $b } }
#       keys( %refSeqs ) )

    foreach $refSeq ( sort { $refSeqs{ $b } <=> $refSeqs{ $a } }
        keys( %refSeqs ) )
    {
        #   Print the name of the refSeq (the key) and the size of the refSeq
        #   (the value).

        print( STDOUT "$refSeq\t$refSeqs{ $refSeq }\n" );
    }

Here is a short data file containing the lengths of ten human refSeq sequences as aligned against assembly 34 of the human genome. Copy these lines and paste them into a text file.

NM_024796
1300
NM_145291
2885
NM_153254
2111
NM_004195
1199
NM_148901
988
NM_148902
1178
NM_003327
1067
NM_016176
1955
NM_016547
2078
NM_016453
2962

The output of the script, when run on these data saved into file refSeqLengths.txt, is this:

> perl refSeqSizes.pl

  Enter the name of the data file:
refSeqLengths.txt
NM_016453       2962
NM_145291       2885
NM_153254       2111
NM_016547       2078
NM_016176       1955
NM_024796       1300
NM_004195       1199
NM_148902       1178
NM_003327       1067
NM_148901       988

Homework assignment

I have provided 20 lines of data below. The first of each pair of lines is the name of a bacterium. The second of each pair of lines is the genome size of that bacterium, in megabases. (The data was obtained from TIGR’s Comprehensive Microbial Resource.) Copy the data and save it into a text file.

Streptomyces coelicolor A3(2)
9.054
Escherichia coli K12-MG1655
4.639
Bacillus subtilis 168
4.214
Nostoc sp. PCC 7120
7.211
Myxococcus xanthus DK 1622
9.139
Synechocystis sp. PCC6803
3.573
Agrobacterium tumefaciens C58 Cereon
5.673
Synechococcus sp. WH8102
2.434
Staphylococcus aureus MW2
2.82
Helicobacter pylori 26695
1.667

Write a script that reads the data from the text file and loads it into a hash, where the keys are the names of the bacteria and the values are the genome sizes. Sort the hash by the keys and print the list in alphabetical order by species. Then sort the hash by the values and print the list in size order from largest genome to smallest genome.

The output should look like this:

> perl sort.pl

  Enter the name of the data file:
data.txt

Alphabetical order:

Agrobacterium tumefaciens C58 Cereon
5.673
Bacillus subtilis 168
4.214
Escherichia coli K12-MG1655
4.639
Helicobacter pylori 26695
1.667
Myxococcus xanthus DK 1622
9.139
Nostoc sp. PCC 7120
7.211
Staphylococcus aureus MW2
2.82
Streptomyces coelicolor A3(2)
9.054
Synechococcus sp. WH8102
2.434
Synechocystis sp. PCC6803
3.573

Size order:

Myxococcus xanthus DK 1622
9.139
Streptomyces coelicolor A3(2)
9.054
Nostoc sp. PCC 7120
7.211
Agrobacterium tumefaciens C58 Cereon
5.673
Escherichia coli K12-MG1655
4.639
Bacillus subtilis 168
4.214
Synechocystis sp. PCC6803
3.573
Staphylococcus aureus MW2
2.82
Synechococcus sp. WH8102
2.434
Helicobacter pylori 26695
1.667