Hash variables
We have already learned about scalar variables; a scalar variable can hold exactly one value. And we have learned about array variables, which hold many items in order.
Perl provides another variable type that can hold many items, the hash variable type. Elements are added to a hash as key-value pairs. The key serves as a unique index to the hash, and the value is what is associated with the key.
Here are metaphors that might help you remember the characteristics of the scalar, array, and hash variable types.
- A scalar variable is a box that can hold at most exactly one item. The label on the box is the name of the variable, and if we raise the lid and look inside, we can see the item in the box, the value of the variable.
- An array variable is a box with compartments. Each compartment has a number, beginning with 0, and the compartments are numbered in order. This box can expand to hold as many compartments as are needed. To get something out of the array box, we need to know the name of the box and the number of the compartment. The name of the box is equivalent to the name of the array variable, and the number of the compartment is equivalent to the index of the array.
- A hash variable is a filing cabinet. The filing cabinet contains folders, and each folder has a label with the name of the folder. In this filing cabinet, a folder can hold exactly one item. The folders can be in any order, but each folder must have a unique name. To get something out of the filing cabinet, you need to know the name of the filing cabinet and the name of the folder. The name of the file cabinet is equivalent to the name of the hash variable. The name of the folder is equivalent to the key used to obtain a value from a hash variable. And the item in the folder is equivalent to the value associated with the key.
The name of a hash variable always begins with the percent
(%) symbol.
Suppose we are writing a program that handles dates, and we need to convert the number of the month (where January is month 0 and December is month 11) to the name of the month. One way we could do this would be to put the names of the months into an array variable, like this:
@months = ( 'January', 'February', 'March', 'April',
'May', 'June', 'July', 'August',
'September', 'October', 'November', 'December' );
This is equivalent to the following initialization, which makes clearer the relationship between the array index and the value at that index:
my( @months ); $months[ 0 ] = 'January'; $months[ 1 ] = 'February'; $months[ 2 ] = 'March'; $months[ 3 ] = 'April'; $months[ 4 ] = 'May'; $months[ 5 ] = 'June'; $months[ 6 ] = 'July'; $months[ 7 ] = 'August'; $months[ 8 ] = 'September'; $months[ 9 ] = 'October'; $months[ 10 ] = 'November'; $months[ 11 ] = 'December';
Then when we wanted to convert the number of the month to the name of the month, we could simply use the number of the month as an index into the array to get the name, like this:
$month = $months[ 3 ]; # gets 'April'
But now suppose we wanted to do the opposite, to convert the name of the month into the number of the month. What you need is some kind of array-type variable where the index would be the name of the month and the value would be the number of the month. We can’t do this with a standard array variable because the index of an array must be a number. So Perl provides a second type of array-type variable, the associative array or hash variable. Its use looks like this:
my( %months );
$months{ 'January' } = 0;
$months{ 'February' } = 1;
$months{ 'March' } = 2;
$months{ 'April' } = 3;
$months{ 'May' } = 4;
$months{ 'June' } = 5;
$months{ 'July' } = 6;
$months{ 'August' } = 7;
$months{ 'September' } = 8;
$months{ 'October' } = 9;
$months{ 'November' } = 10;
$months{ 'December' } = 11;
Notice that the index of the hash variable, which is called the key of the
hash variable, goes between curly braces ({}), whereas the index of an
array variable goes between square brackets ([]). Also notice that
anything can serve as key to a hash, whereas the index of an array must be a
number.
When we want to convert the name of the month to the number of the month, we simply use the name of the month as a key into the hash to get the number, like this:
$number = $months{ 'April' }; # puts 3 into $number
Initializing a hash
Hashes can be initialized just like arrays, except that we have to remember to use key-value pairs in the initialization. Here’s how we could initialize a hash with the names of the months as the keys and the numbers of the months as the values:
%months = ( 'January', 0, 'February', 1, 'March', 2,
'April', 3, 'May', 4, 'June', 5,
'July', 6, 'August', 7, 'September', 8,
'October', 9, 'November', 10, 'December', 11 );
The trouble with initializing a hash this way is that the correspondence between the keys and the values is not obvious. So Perl provides another way to initialize a hash:
%months = ( 'January' => 0, 'February' => 1, 'March' => 2,
'April' => 3, 'May' => 4, 'June' => 5,
'July' => 6, 'August' => 7, 'September' => 8,
'October' => 9, 'November' => 10, 'December' => 11 );
With this syntax, the key comes before the => operator and the
value comes after it. This way, it’s clear which items in the list make up a
key-value pair. A clearer way to code this:
%months = ( 'January' => 0,
'February' => 1,
'March' => 2,
'April' => 3,
'May' => 4,
'June' => 5,
'July' => 6,
'August' => 7,
'September' => 8,
'October' => 9,
'November' => 10,
'December' => 11 );
So, why couldn’t we have done this with the comma-separated list that we used first, like this?
%months = ( 'January', 0,
'February', 1,
'March', 2,
'April', 3,
'May', 4,
'June', 5,
'July', 6,
'August', 7,
'September', 8,
'October', 9,
'November', 10,
'December', 11 );
The answer is that we can. In Perl, the => operator is synonymous
with the , (comma) operator. One advantage to using the
=> operator is that it makes it a little clearer which is the key
and which is the value (the key points the way to the value).
A second advantage to
using the => operator is that when we use a word as a key, we
don’t have to put single or double quotes around the word. When we use the
comma operator, we always have to put quotation marks around the word. So
here’s our final example using the => operator:
%months = ( January => 0,
February => 1,
March => 2,
April => 3,
May => 4,
June => 5,
July => 6,
August => 7,
September => 8,
October => 9,
November => 10,
December => 11 );
Adding key-value pairs to a hash
Above, we showed how to initialize an entire hash. Of course, we often don’t know the key-value pairs of a hash beforehand. For example, we’ll often determine the key-value pairs of a hash when we’re parsing a data file.
We can add key-value pairs to a hash using a syntax that requires us to add them one pair at a time, like this:
my( %months );
$months{ 'January' } = 0;
$months{ 'February' } = 1;
$months{ 'March' } = 2;
$months{ 'April' } = 3;
$months{ 'May' } = 4;
$months{ 'June' } = 5;
$months{ 'July' } = 6;
$months{ 'August' } = 7;
$months{ 'September' } = 8;
$months{ 'October' } = 9;
$months{ 'November' } = 10;
$months{ 'December' } = 11;
As we can see above, the key goes inside the braces, and the value comes after the equals sign.
Now, what’s with the $ in front of $months? Why
isn’t it a % symbol? The answer is that a key provides an index to
exactly one item in a hash, and when we refer to exactly one item, we’re
referring to a scalar. So Perl syntax requires that we put a $ symbol
in front of the name of the variable. It’s only when we refer to the hash as a
whole that we put a % in front.
Using a key to get a value from a hash
When we want to get a value out of a hash, we have to use a key, like this:
$monthNumber = $months{ 'August' };
Since the value that is associated with the key 'August' is 7,
$monthNumber will contain 7.
Removing hash elements
Perl provides the delete() function for removing a key-value pair from
a hash, like this:
delete( $months{ 'August' } );
This removes the 'August' key and its associated value, 7, from the
%months hash.
The simplest way to remove everything from a hash is to reinitialize it with an empty list, like this:
%months = ();
Testing for the existence of a key in a hash
If we want to see whether a particular key exists in a hash, we can use Perl’s
exists() function. The function returns a true value if the key exists
in the hash or a false value if the key does not exist in a hash. Keys are
case-sensitive.
Here’s an example of testing to see if a key exists in a hash, using our
%months hash:
$name = 'August';
if ( exists( $months{ $name } ) )
{
print( STDOUT "$name is month number $months{ $name }.\n" );
}
else
{
print( STDERR "Key $name not found in hash \%months!\n" );
}
Arithmetic manipulation of values in a hash
Hashes are extremely useful for counting the number of different items of something. For example, suppose we wanted go through a column of names from a spreadsheet and count how many times we have seen each name. A good way to do this is to use a hash, where the keys are the names and the values are the number of times we have seen each name.
Here’s how we do it. We define a hash variable called %names. Then
we read each line from the file to get each name, which will serve as the key to
the hash. We use the exists() function to determine whether the key is
already present in the hash. If it isn’t, then we assign a new key-value pair
to the hash, where the key is the new name and the value is 1 (because we have seen
this name one time). On the other hand, if the exists() function tells
us that the key is already in the hash, then we want to increment (add 1 to) the
value.
So how do we change the value associated with a key? Assuming that we have a hash
variable named %names and a key in the scalar variable
$name, we know that we can access the value associated with the key
and place it into the $value variable
this way:
$value = $names{ $name }; # extract the value associated with key $name
Then we could add 1 to $value, and then associate the new value with the
key. Since keys are unique, the new key-value pair replaces the old key-value pair.
$value++; # increment $value
$names{ $name } = $value; # provide a new value for the key
Now that we’ve seen the long way to increment a value, here’s four equivalent
ways to increment a value, beginning with our example above and using
progressively shorter code, until we finish by using the ++
operator for incrementing.
# Examples of ways to increment the value associated with a key.
# The long way, using an intermediate variable.
$value = $names{ $name }; # extract the value associated with key $name
$value++; # increment $value
$names{ $name } = $value; # provide a new value for the key
# The short way, without using an intermediate variable.
$names{ $name } = $names{ $name } + 1;
# Using the += operator.
$names{ $name } += 1;
# Using the ++ operator.
$names{ $name }++;
The point here is that we can access values directly via their keys and perform mathematical operations on the values. Here are some examples:
# Add two values together.
$combinedLength = $seqLengths{ $seqName1 } + $seqLengths{ $seqName2 };
# Subtract 5 from a value. This changes the value associated with the
# key.
$seqLengths{ $seqName1 } -= 5;
# Take the square root of the area of a square to determine the length of a side
# of the square.
#
# %areas is the hash that contains the names of squares as keys and the areas of
# the squares as values.
#
# $square is a variable that contains the name of a square and thus serves as a key
# to the hash.
#
# $areas{ $square } is the value associated with the name $square. We take the
# square root, using the sqrt() function, to determine the length of each side of
# the square, and put the result into the variable $length.
$length = sqrt( $areas{ $square } );
Getting and using all hash keys
We’ll often load hash variables with data obtained from a file. For example, the keys could be sequence names, and the values could be the lengths of the sequences. Once we have loaded the data into the hash, we might want to get all the items back out to work with them.
Perl provides the keys() function for obtaining all the keys from a
hash. The keys() function returns a list containing the hash keys.
For example, if we wanted the keys from our %months hash variable, we
could get them like this:
@monthNames = keys( %months );
Here’s a short script where we initialize the %months variable as
above. Then we extract all the keys and print them.
#!/usr/bin/perl
#
# hashKeys1.pl
# 20-Jun-2004
#
# Conrad Halling
# conrad.halling@sphaerula.com
#
# This script demonstrates how to initialize a hash from a list and extract
# the keys and values from the hash.
use warnings;
my( $monthName );
my( @monthNames );
my( %months );
# Initialize the hash. The keys are the names of the months and the
# values are the numbers of the months, where January is month 0 and
# December is month 11.
%months = ( January => 0,
February => 1,
March => 2,
April => 3,
May => 4,
June => 5,
July => 6,
August => 7,
September => 8,
October => 9,
November => 10,
December => 11 );
# Use the keys() function to get all the keys, and assign the keys
# to an array.
@monthNames = keys( %months );
# Use a foreach loop to get each item from the array. Each item is
# a key to the %months hash; we use the item to obtain the value
# associated with that key.
foreach $monthName ( @monthNames )
{
print( "$monthName => $months{ $monthName }\n" );
}
When I run this script, I get the following output:
> perl hashKeys1.pl September => 8 January => 0 November => 10 April => 3 August => 7 March => 2 June => 5 July => 6 December => 11 May => 4 October => 9 February => 1
Why didn’t the elements come out in the same order as
they went in? The answer is that they never do. The keys of a hash are not stored
in any particular order, so when you extract the keys of a hash with the keys()
function, they don’t come out in any particular order. You will find that, if
you run this script, the keys and values will probably come out in a different
order than the one given above. If you want the keys in order,
then you have to sort them.
Sorting hash keys
When we use Perl’s keys() function on a hash variable, we get back
a list containing the keys to the hash. We have already learned how to sort a
list alphabetically or numerically. This means we have all the necessary
pieces for sorting the keys of a hash.
Let’s revise our script so that we sort the keys alphabetically before we print them. The change is indicated in bold below:
#!/usr/bin/perl
#
# hashKeys2.pl
# 20-Jun-2004
#
# Conrad Halling
# conrad.halling@sphaerula.com
#
# This script demonstrates how to initialize a hash from a list and extract
# the keys and values from the hash, with the keys sorted in alphabetical
# order.
use warnings;
my( $monthName );
my( @monthNames );
my( %months );
# Initialize the hash. The keys are the names of the months and the
# values are the numbers of the months, where January is month 0 and
# December is month 11.
%months = ( January => 0,
February => 1,
March => 2,
April => 3,
May => 4,
June => 5,
July => 6,
August => 7,
September => 8,
October => 9,
November => 10,
December => 11 );
# Use the keys() function to get all the keys, and assign the keys
# to an array.
@monthNames = keys( %months );
# Sort the array into order by the name of the month.
@monthNames = sort( @monthNames );
# Use a foreach loop to get each item from the array. Each item is
# a key to the %months hash; we use the item to obtain the value
# associated with that key.
foreach $monthName ( @monthNames )
{
print( "$monthName => $months{ $monthName }\n" );
}
The output is given below:
> perl hashKeys2.pl April => 3 August => 7 December => 11 February => 1 January => 0 July => 6 June => 5 March => 2 May => 4 November => 10 October => 9 September => 8
This shows you can sort the keys alphabetically, but in this case it doesn’t really get the names of the months back into the order we want.
It looks like what we want to do is sort the keys in the numeric order of their values. The magic formula for doing this is shown in bold below. (I call this a magic formula because I’m not going to explain now how this works. For more information, see Perl Cookbook, pp. 144-145.)
#!/usr/bin/perl
#
# hashKeys3.pl
# 20-Jun-2004
#
# Conrad Halling
# conrad.halling@sphaerula.com
#
# This script demonstrates how to initialize a hash from a list and extract
# the keys and values from the hash, with the keys sorted order according
# to their values.
use warnings;
my( $monthName );
my( @monthNames );
my( %months );
# Initialize the hash. The keys are the names of the months and the
# values are the numbers of the months, where January is month 0 and
# December is month 11.
%months = ( January => 0,
February => 1,
March => 2,
April => 3,
May => 4,
June => 5,
July => 6,
August => 7,
September => 8,
October => 9,
November => 10,
December => 11 );
# Use the keys() function to get all the keys, and assign the keys
# to an array.
@monthNames = keys( %months );
# Sort the array into order by the number of each month, that is, in
# order by the values associated with the keys.
@monthNames = sort { $months{ $a } <=> $months{ $b } } @monthNames;
# Use a foreach loop to get each item from the array. Each item is
# a key to the %months hash; we use the item to obtain the value
# associated with that key.
foreach $monthName ( @monthNames )
{
print( "$monthName => $months{ $monthName }\n" );
}
Now the output is:
> perl hashKeys3.pl January => 0 February => 1 March => 2 April => 3 May => 4 June => 5 July => 6 August => 7 September => 8 October => 9 November => 10 December => 11
If our values were not numeric but were words, then we would sort the keys by the
alphabetical order of their values using the following magic
formula, where we have substituted the cmp operator in place of
the <=> operator:
@monthNames = sort { $months{ $a } cmp $months{ $b } } @monthNames;
A note about code style
It’s important to write your code so that it’s easy to read. An example of this was given above, where I demonstrated several different ways to initialize a hash variable.
This initialization is difficult to read; visually, the keys and the values are jumbled together.
%months = ( 'January' => 0, 'February' => 1, 'March' => 2,
'April' => 3, 'May' => 4, 'June' => 5,
'July' => 6, 'August' => 7, 'September' => 8,
'October' => 9, 'November' => 10, 'December' => 11 );
It is much kinder to yourself (when you reread your code after time has passed) -- and to those who maintain your code -- that you format the initialization so that the relationships between the keys and values are clear.
%months = ( January => 0,
February => 1,
March => 2,
April => 3,
May => 4,
June => 5,
July => 6,
August => 7,
September => 8,
October => 9,
November => 10,
December => 11 );
A summary example script
Here’s a script that puts together what we learned today. This script reads a data file where the name of a human NCBI reference sequence (refSeq) appears on the first line of the file and the length of the refSeq as aligned against NCBI assembly 34 of the human genome appears on the second line. (The data were extracted from the UCSC Genome Bioinformatics web site, specifically from file refGene.txt.gz available from the UCSC human genome database downloads page.
The script loads each refSeq name and its associated length value into a hash variable. Then the script extracts all the keys and sorts them in order by their values, the sequence lengths. Finally, the script prints out the refSeq names and lengths in order from the longest refSeq to the shortest.
A short data file follows the script.
#!/usr/bin/perl
#
# refSeqSizes.pl
# 20-Jun-2004
#
# Conrad Halling
# conrad.halling@sphaerula.com
#
# This sample script reads the contigSizes.txt file, which contains
# alternating lines with the name of a Xenorhabdus contig and then the size
# of the contig. This script reads the data into a hash, where the key is the
# name of the contig and the value is the size of the contig. The script then
# sorts the hash keys in order by contig size, and prints the names and sizes
# to STDOUT.
use warnings;
use strict;
my( $dataLine );
my( $fileName );
my( $refSeq );
my( %refSeqs );
my( $success );
# Ask the user for the name of the data file.
print( STDERR "\n" );
print( STDERR " Enter the name of the data file:\n" );
# Get the name of the data file.
$fileName = <>;
chomp( $fileName );
# Open the file.
$success = open( IN, "<$fileName" );
if ( ! $success )
{
die( "\n Can't open file $fileName for reading: $!.\n\n" );
}
# Loop, reading two lines on each loop. The first line contains the name
# of a refSeq, the second the length of that refSeq.
while ( defined( $dataLine = <IN> ) )
{
my( $name );
my( $size );
# Process the line to get the name of the refSeq.
chomp( $dataLine );
$name = $dataLine;
# Read the next line, which contains the size of the refSeq.
if ( defined( $dataLine = <IN> ) )
{
# Process the line to get the size of the refSeq.
chomp( $dataLine );
$size = $dataLine;
# Store the name and size of the contig into the hash.
# The name is the key, and the size is the value.
$refSeqs{ $name } = $size;
}
}
# Close the input file.
close( IN );
# Get the keys, which are sorted by their values.
# See Perl Cookbook, pp. 144-145 for more information.
# The commented code sorts the keys from smallest value to largest.
# The used code sorts the keys from largest value to smallest.
# foreach $refSeq ( sort { $refSeqs{ $a } <=> $refSeqs{ $b } }
# keys( %refSeqs ) )
foreach $refSeq ( sort { $refSeqs{ $b } <=> $refSeqs{ $a } }
keys( %refSeqs ) )
{
# Print the name of the refSeq (the key) and the size of the refSeq
# (the value).
print( STDOUT "$refSeq\t$refSeqs{ $refSeq }\n" );
}
Here is a short data file containing the lengths of ten human refSeq sequences as aligned against assembly 34 of the human genome. Copy these lines and paste them into a text file.
NM_024796 1300 NM_145291 2885 NM_153254 2111 NM_004195 1199 NM_148901 988 NM_148902 1178 NM_003327 1067 NM_016176 1955 NM_016547 2078 NM_016453 2962
The output of the script, when run on these data saved into file refSeqLengths.txt, is this:
> perl refSeqSizes.pl Enter the name of the data file: refSeqLengths.txt NM_016453 2962 NM_145291 2885 NM_153254 2111 NM_016547 2078 NM_016176 1955 NM_024796 1300 NM_004195 1199 NM_148902 1178 NM_003327 1067 NM_148901 988
Homework assignment
I have provided 20 lines of data below. The first of each pair of lines is the name of a bacterium. The second of each pair of lines is the genome size of that bacterium, in megabases. (The data was obtained from TIGR’s Comprehensive Microbial Resource.) Copy the data and save it into a text file.
Streptomyces coelicolor A3(2) 9.054 Escherichia coli K12-MG1655 4.639 Bacillus subtilis 168 4.214 Nostoc sp. PCC 7120 7.211 Myxococcus xanthus DK 1622 9.139 Synechocystis sp. PCC6803 3.573 Agrobacterium tumefaciens C58 Cereon 5.673 Synechococcus sp. WH8102 2.434 Staphylococcus aureus MW2 2.82 Helicobacter pylori 26695 1.667
Write a script that reads the data from the text file and loads it into a hash, where the keys are the names of the bacteria and the values are the genome sizes. Sort the hash by the keys and print the list in alphabetical order by species. Then sort the hash by the values and print the list in size order from largest genome to smallest genome.
The output should look like this:
> perl sort.pl Enter the name of the data file: data.txt Alphabetical order: Agrobacterium tumefaciens C58 Cereon 5.673 Bacillus subtilis 168 4.214 Escherichia coli K12-MG1655 4.639 Helicobacter pylori 26695 1.667 Myxococcus xanthus DK 1622 9.139 Nostoc sp. PCC 7120 7.211 Staphylococcus aureus MW2 2.82 Streptomyces coelicolor A3(2) 9.054 Synechococcus sp. WH8102 2.434 Synechocystis sp. PCC6803 3.573 Size order: Myxococcus xanthus DK 1622 9.139 Streptomyces coelicolor A3(2) 9.054 Nostoc sp. PCC 7120 7.211 Agrobacterium tumefaciens C58 Cereon 5.673 Escherichia coli K12-MG1655 4.639 Bacillus subtilis 168 4.214 Synechocystis sp. PCC6803 3.573 Staphylococcus aureus MW2 2.82 Synechococcus sp. WH8102 2.434 Helicobacter pylori 26695 1.667