SSAHA

Table of Contents

Summary

This page describes how to patch the SSAHA 3.1 source code so it can be compiled using the GCC 4.0.1 or 4.0.2 C++ compiler under Linux or Mac OS X.

Introduction

SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a sequence alignment program developed at the Wellcome Trust Sanger Institute. SSAHA uses a hash table in memory for high speed sequence alignment, and it is blazingly quick. SSAHA is written in C++, and the source code for version 3.1 is available under the GNU General Public License (GPL). The code uses the advanced features of C++, including templates, operator overloading, containers, inheritance, and polymorphism.

The SSAHA programmers have provided binary executables of SSAHA 3.1 at the web site. Unfortunately, the Linux binary executable is not compatible with the Fedora Core 4 distribution of Linux. The provided binary requires libstdc++.so.5, but Fedora Core 4 comes with a later version of this library, libstdc++.so.6. No binary executable of SSAHA 3.1 is provided for the Macintosh computer running Mac OS X.

This means that if you want to run SSAHA 3.1 on a Linux computer running a recent distribution or on a Mac OS X computer, you must compile SSAHA 3.1 from the source code. However, SSAHA 3.1 was written for the GCC 2.95 C++ compiler and will not compile using the GCC 4.0.1 C++ compiler available with Mac OS X 10.4, nor will it compile using the GCC 4.0.2 C++ compiler available with the Fedora Core 4 distribution of Linux.

I have made the minor changes to the code that were required to get SSAHA to compile and run under Fedora Core 4 and Mac OS X. The remainder of this page describes how you can obtain the code changes, apply them to the source code, and compile SSAHA on your Linux or Mac OS X computer.

Note: The SSAHA developers at the Sanger Institute released a new version of SSAHA, SSAHA2, in October, 2005. This version combines the SSAHA algorithm with the cross_match algorithm developed by Phil Green. The source code is not available, presumably because cross_match is not open source software.

How to Patch SSAHA 3.1

I updated the source code for SSAHA 3.1 so it will compile using GCC 4.0.1 and 4.0.2. I created a patch file which you can use to patch SSAHA 3.1 so you can compile it using the latest GCC compilers. This page provides directions for obtaining and applying the patch.

Step 1: Obtain the Source Code and the Patch File

Obtain the source code for SSAHA 3.1 from the Sanger Institute’s SSAHA web page; the code archive is named ssaha_v31.tar.gz.

The patch file is available at this link. Right-click the link and save it to your computer as ssaha_v31c_patch.diff.

Step 2: Move the Files to Your Working Directory

Create a working directory. These directions assume the working directory is named /home/challing/SSAHA; substitute the name of your own working directory wherever this appears. Move the ssaha_v31c.tar.gz and ssaha_v31c_patch.diff files into this directory. The commands used below assume you have downloaded these two files to your desktop.

$ mkdir /home/challing/SSAHA
$ mv /home/challing/Desktop/ssaha_v31c.tar.gz /home/challing/SSAHA/
$ mv /home/challing/Desktop/ssaha_v31c_patch.diff /home/challing/SSAHA/
$ cd /home/challing/SSAHA

Step 3: Extract the Code Archive

Create a subdirectory named ssaha-patched where you will extract the archive, patch the source code, and build the patched version of SSAHA. Copy the ssaha_v31c.tar.gz and ssaha_v31c_patch.diff files into this directory, preserving the originals.

$ mkdir ssaha-patched
$ cp ssaha_v31c.tar.gz ssaha-patched/
$ cp ssaha_v31c_patch.diff ssaha-patched/

Change directory to the subdirectory, extract the archive, and delete the ssaha_v31c.tar.gz file.

$ cd ssaha-patched
$ tar -zxf ssaha_v31c.tar.gz
$ rm ssaha_v31c.tar.gz

Step 4: Apply the Patch

The ssaha-patched directory should contain the following directories and file:

Execute the patch procedure as a dry run to make sure it will run correctly. The correct output is shown below the command. Note that the argument -p1 is dash pea one (-p1), not dash pea ell (-pl).

$ patch --dry-run -p1 -i ssaha_v31c_patch.diff
patching file Binary/makefile
patching file Global/GlobalDefinitions.cpp
patching file Global/GlobalDefinitions.h
patching file Global/SSAHAMain.cpp
patching file Global/SSAHAMain.h
patching file HashTable/HashTable.cpp
patching file HashTable/HashTableGeneric.cpp
patching file HashTable/HashTablePacked.cpp
patching file HashTable/HashTablePacked.h
patching file HashTable/HashTableTranslated.cpp
patching file HashTable/testHashTableNoOverlap.cpp
patching file QueryManager/MatchAligner.cpp
patching file QueryManager/MatchAligner.h
patching file QueryManager/MatchStore.h
patching file QueryManager/MatchStoreGapped.h
patching file QueryManager/MatchStoreUngapped.h
patching file QueryManager/QueryManager.cpp
patching file QueryManager/QueryManager.h
patching file QueryManager/testQueryManager.cpp
patching file SequenceReader/SequenceEncoder.cpp
patching file SequenceReader/SequenceReader.cpp
patching file SequenceReader/SequenceReader.h
patching file SequenceReader/SequenceReaderFasta.cpp
patching file SequenceReader/SequenceReaderFasta.h
patching file SequenceReader/SequenceReaderFilter.h
patching file SequenceReader/SequenceReaderLocal.cpp
patching file SequenceReader/SequenceReaderMulti.cpp
patching file SequenceReader/SequenceReaderMulti.h
patching file SequenceReader/SequenceReaderString.h
patching file SequenceReader/testSequenceReaderFasta.cpp

If the dry run gives no errors, execute the patch.

$ patch -p1 -i ssaha_v31c_patch.diff
patching file Binary/makefile
patching file Global/GlobalDefinitions.cpp
patching file Global/GlobalDefinitions.h
patching file Global/SSAHAMain.cpp
patching file Global/SSAHAMain.h
patching file HashTable/HashTable.cpp
patching file HashTable/HashTableGeneric.cpp
patching file HashTable/HashTablePacked.cpp
patching file HashTable/HashTablePacked.h
patching file HashTable/HashTableTranslated.cpp
patching file HashTable/testHashTableNoOverlap.cpp
patching file QueryManager/MatchAligner.cpp
patching file QueryManager/MatchAligner.h
patching file QueryManager/MatchStore.h
patching file QueryManager/MatchStoreGapped.h
patching file QueryManager/MatchStoreUngapped.h
patching file QueryManager/QueryManager.cpp
patching file QueryManager/QueryManager.h
patching file QueryManager/testQueryManager.cpp
patching file SequenceReader/SequenceEncoder.cpp
patching file SequenceReader/SequenceReader.cpp
patching file SequenceReader/SequenceReader.h
patching file SequenceReader/SequenceReaderFasta.cpp
patching file SequenceReader/SequenceReaderFasta.h
patching file SequenceReader/SequenceReaderFilter.h
patching file SequenceReader/SequenceReaderLocal.cpp
patching file SequenceReader/SequenceReaderMulti.cpp
patching file SequenceReader/SequenceReaderMulti.h
patching file SequenceReader/SequenceReaderString.h
patching file SequenceReader/testSequenceReaderFasta.cpp

Step 5: Compile and Install SSAHA

Build the software as normal from the Binary directory. See the Binary/README and Binary/makefile files for details. Begin by changing the directory to the Binary directory.

$ cd Binary

Set the environment variable required for building the software. (Substitute the name of your directory for /home/challing/SSAHA/ssaha-patched.) For the sh, ksh, and bash shells, the command is:

$ export CURRENT_SSAHA_VERSION=/home/challing/SSAHA/ssaha-patched

For the csh and tcsh shells, the command is:

$ setenv CURRENT_SSAHA_VERSION /home/challing/SSAHA/ssaha-patched

Build the software.

$ make ssaha

Once the software has been patched, it will report its version as:

This is SSAHA Version 3.2, released 1st March 2004,
patched 26 January 2006.

The unpatched software reports its version as:

This is SSAHA Version 3.2, released 1st March 2004.

Install the ssaha executable by copying it to an appropriate directory, or simply run it from the build directory.

Notes

Testing SSAHA

SSAHA also comes with test programs. The binary tests are created using the command:

$ make test

Run the tests with the following commands:

$ ./testSSAHA.csh

$ ./testHashTable

$ ./testHashTableNoOverlap

$ ./testQueryManager

$ ./testSequenceReaderFasta

$ ./testTimeStamp

I compiled the patched version of SSAHA on a Linux computer running Fedora Core 4 and on a Macintosh computer running Mac OS X 10.4.4 and ran all the test programs. All tests ran successfully on both computers.

Test Program Status
Fedora Core 4 Mac OS X
testSSAHA.csh succeeded succeeded
testHashTable succeeded succeeded
testHashTableNoOverlap succeeded succeeded
testQueryManager succeeded succeeded
testSequenceReaderFasta succeeded succeeded
testTimeStamp succeeded succeeded

Bug Fixes

At first, SSAHA failed three tests on my PowerBook, which uses a PowerPC G4 processor. The PowerPC processor uses most significant byte first ordering, whereas a Pentium processor uses least significant byte first ordering. I corrected a bug on line 221 of the file SequenceEncoder.cpp, where the original code neglected to correct for byte ordering. After this fix, all tests succeeded on my PowerBook.

When working with very large sequences (e.g., human chromosome 2, which has a length greater than 243,000,000 bp) and small word lengths (e.g., -wl 5), an integer variable would overflow, resulting in no alignments being presented.

$ ssaha alu1.fa chr2.fa -sl 1 -wl 5 -pf -mp 25 -da 0
[... output omitted ...]
Info: would expect 57984.2 hits per word for a random database of this size.
Info: will ignore hits on words that occur more than -2147483648 times
 in the database.

This bug occurred on line 785 of SSAHAMain.cpp. The original code was:

  queryParams.maxStore=1+(int)(expectedNumHits*queryParams.maxStore);

The corrected code is:

  queryParams.maxStore=(int)(expectedNumHits*queryParams.maxStore);

I also corrected the initialization of the defaultParams.maxToStore value from 100000 to 10000 on line 134 of SSAHAMain.h so that it would match the documentation, which states that the default value is 10000.

Compiler Warnings

When I updated the software, I eliminated nearly all warnings produced by the -Wall flag for the GCC 4.0.1 C++ compiler available with Mac OS X 10.4. These included warnings about unused variables, a warning about a variable that was possibly used before it was initialized, warnings about the initialization order of member variables in objects, warnings about missing virtual destructors for base classes, and those annoying warnings about the comparison of signed with unsigned variables.

There is only one type of warning that remains. Some of the code files use the deprecated header file strstream. I did not attempt to fix this problem since I discovered it could involve considerable changes in the code for handling character strings.

The warning produced by GCC is:

In file included from /usr/include/c++/4.0.0/backward/strstream:51,
                 from ssaha-patched/SequenceReader/SequenceReaderFilter.cpp:34:
/usr/include/c++/4.0.0/backward/backward_warning.h:32:2: warning:
#warning This file includes at least one deprecated or antiquated header.
Please consider using one of the 32 headers found in section 17.4.1.2 of
the C++ standard. Examples include substituting the <X> header for the <X.h>
header for C++ includes, or <iostream> instead of the deprecated header
<iostream.h>. To disable this warning use -Wno-deprecated.

EnsemblServer

I did not modify the code in the EnsemblServer directory.

Using the GNU diff and patch Utilities

For more information on using the GNU diff and patch utilities for creating and using patch files, see http://www.network-theory.co.uk/docs/diff/diff_84.html.

I set up two code trees, ssaha-3.1c containing the original code as distributed from the Sanger Institute web site, and ssaha-3.1d, containing the modified code. I created the patch file with the following command:

$ diff -ur ssaha-3.1c/ ssaha-3.1d/ > ssaha_v31c_patch.diff