Archive for the 'Bioinformatics' Category
The 20 June 2008 issue of Science contains a paper by Art Löytynoja and Nick Goldman titled “Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis.” The authors tackled the difficult problem of how to handle correctly the placement of insertions and deletions in a multiple sequence alignment. A multiple sequence alignment is typically the input into phylogeny tools that attempt to determine the evolutionary relationship among the sequences. The misplacement of insertions and deletions in a multiple sequence alignment can result misinterpretations of the relationships among the sequences.
Typical multiple sequence alignment tools, such as CLUSTAL W, MUSCLE, MAFFT, and T-COFFEE, do not handle indels accurately. The authors developed new tools, PRANK and PRANK+F, that take into account the computed phylogeny of the sequences when placing insertions and deletions into the multiple alignment.
In this paper, the authors describe their refinements of the multiple alignment algorithm, and they provide theory and results that demonstrate that their algorithm improves the quality of multiple sequence alignments in a biologically meaningful way. The implications of their results are strongest for nucleotide sequence alignments, but the authors contend that their results are important also for peptide sequence alignments.
Source:
Löytnoja A, Goldman N. 2008. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320:1632-1635. DOI: 10.1126/science.1158395.
Other bloggers who have have already commented on this paper include:
The Goldman group maintains a web page listing its publications; there are many other interesting papers given there.
June 29 2008 | Bioinformatics | Comments Off
Utterly bewitched by this dramatic and thrilling panel from an xkcd.com comic strip, I have decided that it is time I learned the Python programming language.

I come to Python as a bioinformatics scientist who is experienced in C, C++, PHP, Perl, and R. For the past two years, I have used Perl for 95% of the software I have written. But there are days when I grow weary of trying to write object-oriented Perl code in a clear and maintainable way. Perl 5 just isn’t a language designed to support object-oriented programming in a natural way. Hence, I thought I would try Python.
This post is Part 1 of what I plan to be a series of posts about the book Learning Python, 3rd Edition, which was written by Mark Lutz and published in October, 2007, by O’Reilly. This part of my review covers “Part I: Getting Started”, the first 62 pages of the book
In the preface, Mr. Lutz states that Learning Python is intended to be an introduction to the Python programming language for programmers and that it deliberately covers only the core Python language without providing guidance on application programming. Mr. Lutz encourages readers to continue on to Programming Python, 3rd Edition (of which he is also the author).
The preface also contains a three-page section, “Preparing for Python 3.0,” which contains a brain-numbing list of features and changes anticipated in Python 3.0. I believe it would have been more appropriate to place this material in an appendix.
Learning Python is organized into eight major parts:
- Getting Started
- Types and Operations
- Statements and Syntax
- Functions
- Modules
- Classes and OOP (Object Oriented Programming)
- Exceptions and Tools
- Appendixes
The book provides a deliberately bottom-up introduction to Python. This means the reader has to be patient, because Mr. Lutz teaches about each piece of the Python language without showing right away how to use the pieces together to write scripts. Consequently, I have found that I need to jump ahead in the book because I want to begin scripting right away.
Each chapter contains a short quiz, with the answers provided with each quiz. Each part of the book ends with a set of exercises, for which the solutions are provided in Appendix B.
In Chapter 1, Mr. Lutz discusses the strengths and weaknesses of Python. The strengths are many and significant:
- Python is designed to produce readable code, which makes the code easy to maintain and reuse.
- Python is designed up front to support object-oriented programming.
- Python code is more succinct than C++ or Java code.
- Python is portable and runs on Linux/Unix, Mac OS X, and Windows.
- Python comes with a large standard library.
- Python works well with other languages.
- Python is free.
The only weakness that Mr. Lutz acknowledges is that Python sometimes doesn’t run as quickly as compiled C and C++ code. A weakness mentioned by many others is that, contrary to other C-like programming languages (C, C++, Java, C#, Perl, and PHP), white space in Python code has syntactic significance. Many programmers object to this, and it does seem to generate problems with parsing the code.
Chapter 1 devotes a lot of space to discussing the philosphy behind Python.
In philosophy, Python adopts a somewhat minimalist approach. This means that although there are usually multiple ways to accomplish a coding task, there is usually just one obvious way…. In the Python way of thinking, explicit is better than implicit, and simple is better than complex.
This discussion provides a sharp contrast to Perl, for which the motto is, “There is more than one way to do it” (TIMTOWTDI). Perl contains many language constructs that make commands implicit rather than explicit, and this can make Perl code difficult to understand.
Mr. Lutz devotes nearly two pages (pp. 16–17) to a sidebar titled “Python is Engineering, Not Art,” in which the author discusses the relative merits of Python and Perl.
The short story is this: you can do everything in Python that you can in Perl, but you can read your code after you do it…. [author’s italics]
The somewhat longer story reflects the backgrounds of the designers of the two languages…. Python’s creator [Guido van Rossum] is a mathematician by training; as such, he produced a language with a high degree of uniformity—its syntax and toolset are remarkably coherent….
By contrast, the creator of the Perl language [Larry Wall] is a linguist, and its design reflects this heritage. There are many ways to accomplish the same tasks in Perl, and language constructs interact in context-sensitive and sometimes quite subtle ways—much like natural language….
But as anyone who has done any substantial code maintenance should be able to attest, freedom of expression is great for art, but lousy for engineering.
I am not going to delve into this debate. Since I have studied both linguistics and mathematics, I can easily see the strengths and weaknesses of both points of view, and this will help me decide when I should use Perl and when I should use Python.
Much of the material in Chapter 1, although interesting, is in my opinion unnecessary for a learning book. The entire section, “Is Python a ‘Scripting Language’?”, seems to me more appropriate for the Programming Python book. In my opinion, a learning book should get to writing code as quickly as possible, with philosophical meditations saved for another book.
Chapter 2 discusses how the Python interpreter works to run Python scripts. The chapter directs the reader to Appendix A, which contains instructions for obtaining and installing Python for Windows, Linux, or Unix. (Mac OS X 10.5 comes with Python 2.5.1 already installed.) The remainder of the chapter provides background information about the Python interpreter and can be omitted by the reader in a hurry.
Chapter 3 describes how to run Python scripts. It begins with teaching the reader how to start Python from the shell prompt in order to use Python’s interactive command line. The chapter patiently explains how to do this for each of the major operating systems and problems the reader needs to watch for. Then the chapter explains how to write a short Python script and invoke the script from the command line or by double-clicking its icon, and it explains in detail how to overcome the difficulties with double-clicking an icon in Windows. I found these details in Chapter 3 very useful, and it is clear that Mr. Lutz has learned from his training courses what difficulties the novice user will encounter when first using Python.
Chapter 3 continues with a brief introduction on importing modules using import and the differences between import and reload. The chapter concludes with an introduction to the IDLE user interface and other integrated development environments (IDEs).
Part I concludes with a set of very good exercises. The exercises get the reader started with using the Python command line, the IDLE IDE, and the Python documentation available from the command line and from the Python web site.
Resources:
April 05 2008 | Bioinformatics and Computing | Comments Off
Web Database Applications with PHP and MySQL, 2nd Edition, was written by Hugh E. Williams and David Lane and published by O’Reilly in May, 2004.
I purchased this book in 2005 when I was doing some consulting for a microarray company in Massachusetts, where I was adapting the BioArray Software Environment (BASE 1.2), which was written in PHP, to their process. I had set the book aside until recently, when I was motivated to review PHP so I could modify the WordPress theme that I use for this blog.
PHP is the programming language usually referred to by the P in the acronym LAMP, which stands for Linux Apache MySQL PHP. PHP was designed from the beginning to work closely with the Apache httpd web server and with the MySQL database management system. PHP code is easily embedded into HTML, and this makes it easy for relative novices to use a three-tier architecture for their web sites.
I found Web Database Applications with PHP and MySQL an excellent introduction to PHP and MySQL for someone who is skilled at programming using another language. The introductory material on PHP was just enough to get me started, and I quickly learned to refer to the PHP documentation (which is also excellent) when I needed enlightenment. The book also provides several chapters that provide a solid introduction to MySQL 4.1.
The last five chapters of the book are devoted to a complete working example of an online wine store. All source code is available at the authors’ web site, http://www.webdatabasebook.com/. This is an invaluable resource that can serve as the basis for many other PHP projects.
In my opinion, this book is not appropriate for someone who is learning to program or use databases for the first time. Good alternatives might be Learning PHP & MySQL, 2nd Edition, by Michele E. Davis and Jon A. Phillips (published August, 2007, by O’Reilly), and Learning SQL, by Alan Beaulieu (published August, 2005, by O’Reilly). (I read Learning SQL recently, and I recommend it highly. Watch for a review at a later date.)
Since the book was published nearly four years ago, some of the material is dated. In 2004, PHP 5.0 was still in beta; PHP has reached 5.2 by now, and support for PHP 4 is about to end. Similarly, the book recommends using MySQL 4.0 or 4.1, whereas today MySQL 5.0 is very stable. The book provides an excellent set of appendices that explain how to install and configure Apache httpd, MySQL, and PHP for Linux, Windows, and Mac OS X. Again, these instructions are now dated, but they should still provide a useful guide.
I found the index somewhat frustrating to use; some items I was interested in are not found there. These include:
@, the error control operator
-
& and =&, operators used for working with references to variables
- reference, the term referring to a variable that is not passed by value
define, the function used to create constants
- magic constants, such as
__FILE__, __LINE__, __FUNCTION__, __CLASS__, and __METHOD__
I don’t use PHP in my daily work; rather, I use Perl. (Python is the next language I plan to learn, but that’s another story.) I like how PHP has combined arrays and hashes into a single array type. I like PHP’s Boolean values, true and false. And I like how easy it is to embed PHP into HTML; it makes creating a web page dynamically more intuitive than Perl CGI does. But I dislike the inconsistent and ugly naming of many standard functions (for example, strtoupper, the PHP equivalent of Perl’s uc), and this seems to be a common complaint about PHP. I also don’t like the large number of global variables and constants that PHP uses.
There is a BioPHP project (PHP for Bioinformatics), at http://biophp.org/, but it is not as well developed as the BioPerl or BioPython projects. The only large bioinformatics project that I’m aware of that uses PHP is the BioArray Software Environment (BASE) 1.2. Version 2.0 of BASE has been completely rewritten in Java.
March 23 2008 | Bioinformatics and Books and Computing | Comments Off
This is a story of how clueless I can be, but how sometimes, given a sufficient number of opportunities, I can become clueful again.
On 13 March 2007, Bosco Ho wrote a post entitled “Notes to a Young Computational Biologist” on his Trapped in the USA blog. I don’t remember how I happened to come across this post the first time, because I wasn’t reading blogs systematically then, but something about this topic clicked in me, and on 25 March I posted a long comment about some things I thought Dr. Ho had omitted.
Time passed, and in October or November, I received an email from a recruiter who wanted to interest me in a bioinformatics or programming job on the West Coast. I couldn’t figure out where she had heard of me, except that she mentioned in the email that she had seen my name on the boscoh.com web site. This mystified me, because I had forgotten all about the events in March. I did a little poking around on the web site, but I couldn’t find my own name. So I concluded that she was completely mistaken—that she actually wanted to recruit Bosco Ho but had sent me the email in error—and I decided not to respond.
The benefit of this apparent error was that I learned (for what I thought was the first time) about Dr. Ho’s site, which is full of great writing and useful information. I now work with quite a few scientists who are experts at protein structure, but this is a new field for me because I was trained as a DNA jockey (you know—molecular biology, cloning, sequencing, and all that). Trapped in the USA provided me with additional background reading on protein structure.
At the beginning of this year, I decided it was time to start another web site, so I registered sphaerula.com. One of the things my hosting provider offers is an installation of WordPress, and that’s how I got started blogging. Since mid-February, I have been immersed in reading biology and bioinformatics blogs and writing posts of my own. I’ve discovered many wonderful blogs, and I discovered Trapped in the USA for the third time.
On weekends, I have been methodically reading blogs from beginning to end. (This is going to take me a long time as I follow the branches from the blogrolls.) When I began reading Trapped in the USA, I rediscovered Dr. Ho’s “Notes to a Young Computational Biologist” post, and I rediscovered my comments.
So in the course of a little more than 11 months, I’ve made a discovery, a rediscovery, and a rerediscovery. The difference is that now I am fully engaged in blogging and in reading blogs, and this time the discovery will stick with me. I know now that Dr. Ho is a widely respected blogger in the bioinformatics blogging community, and I won’t forget it this time.
By the way, if you happen to be that recruiter, I apologize for not responding. I love the West Coast, having grown up in Oregon and earned my Ph.D. at Berkeley. But I have a good job, I love Boston, and I’m not inclined to move. But I’m happy to recommend a talented computational biologist who is named Bosco Ho.
March 06 2008 | Bioinformatics | 1 Comment »