Beginning Perl Lesson 10

Table of Contents

Introduction to regular expressions

A regular expression is a combinations of characters that defines a pattern that is used for searching. Regular expressions are a powerful method for finding and extracting information.

In this lesson, we’ll learn the basics of regular expressions and begin to use them to parse the output produced by NCBI’s BLAST sequence alignment software. Our goal is to begin extracting information from BLAST output. The extracted information might be used to report a result, build a web page, or populate a database.

I used BLASTP to find protein sequence alignments among a small number of proteins. First, let’s look at the structure of the output file, bsub1.blastp.txt. Here is the first part of the output:

BLASTP 2.2.13 [Nov-27-2005]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
         (428 letters)

Database: bsub1.fasta
           7 sequences; 2750 total letters

Searching.......done

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]                     799   0.0
gi|74491472|gb|EAO54687.1| 3-phosphoshikimate 1-carboxyvinyltran...   544   e-158
gi|15614230|ref|NP_242533.1| 3-phosphoshikimate 1-carboxyvinyltr...   532   e-155
gi|49241782|emb|CAG40473.1| 3-phosphoshikimate 1-carboxyvinyltra...   336   6e-96

>gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
          Length = 428

 Score =  799 bits (2064), Expect = 0.0
 Identities = 415/428 (96%), Positives = 415/428 (96%)

Query: 1   MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH 60
           MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH
Sbjct: 1   MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH 60

Query: 61  IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP 120
           IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP
Sbjct: 61  IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP 120

Query: 121 MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG 180
           MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG
Sbjct: 121 MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG 180

Query: 181 TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA 240
           TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA
Sbjct: 181 TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA 240

Query: 241 GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV 300
           GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV
Sbjct: 241 GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV 300

Query: 301 EIGGXXXXXXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP 360
           EIGG             ALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP
Sbjct: 301 EIGGDIIPRLIDEIPIIALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP 360

Query: 361 TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH 420
           TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH
Sbjct: 361 TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH 420

Query: 421 LNKLSKKS 428
           LNKLSKKS
Sbjct: 421 LNKLSKKS 428


>gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Bacillus thuringiensis
           serovar israelensis ATCC 35646]
          Length = 432

 Score =  544 bits (1401), Expect = e-158
 Identities = 274/418 (65%), Positives = 330/418 (78%), Gaps = 1/418 (0%)

Query: 9   LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
           L+G I IPGDKSISHR+VMFG++A G TT+K FL GADCLSTI CF++MGV I Q+  +V
Sbjct: 16  LNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKEMGVEITQNGDEV 75

Query: 69  VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
            + GKG++ L+EP+++LDVGNSGTTIRLM GILA  PF+S V GDESIAKRPMKRVT PL
Sbjct: 76  TVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESIAKRPMKRVTNPL 135

Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
           K+MGA IDGR  G FTPL++ G  LK I+Y+SPVASAQ+KSA+LLAGL+AEG T VTEPH
Sbjct: 136 KQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGLRAEGVTAVTEPH 195

Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248
            SRDHTERML AFGVK++ +  +V ++GGQKLTA DI VPGD+SSAAFFL AGA++PNS+
Sbjct: 196 ISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAFFLVAGAIIPNSK 255

Query: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGXXXX 308
           ++L+NVG+NPTRTGIIDVL+ MGA   I+P  +  +EP  ++ IETSSLK +EIGG
Sbjct: 256 LILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSSLKGIEIGGDIIP 315

Query: 309 XXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368
                    AL ATQAEG TVI+DA ELKVKETNRIDTVV+EL KLGA IE T DGM +Y
Sbjct: 316 RLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGARIEATDDGMIIY 375

Query: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSK 426
           GK  LKG   V+S+GDHRIGMML IA C+ E  I IE  +A+ VSYPTFF+ L KL+K
Sbjct: 376 GKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPTFFDELQKLAK 432


>gi|15614230|ref|NP_242533.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Bacillus halodurans C-125]
          Length = 431

 Score =  532 bits (1371), Expect = e-155
 Identities = 273/419 (65%), Positives = 320/419 (76%)

Query: 9   LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
           L G I +PGDKSISHR+VMFGALA GTTTV+ FLPGADCLSTI CF+K+GV IEQ+   V
Sbjct: 13  LKGTIKVPGDKSISHRAVMFGALAKGTTTVEGFLPGADCLSTISCFQKLGVSIEQAEERV 72

Query: 69  VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
            + GKG D L+EP  +LDVGNSGTT RL+LGIL+  PF+S + GDESI KRPMKRVTEPL
Sbjct: 73  TVKGKGWDGLREPSDILDVGNSGTTTRLILGILSTLPFHSVIIGDESIGKRPMKRVTEPL 132

Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
           K MGA+IDGR  G  TPLS+ G  LKGID+ SPVASAQ+KSA+LLAGL+AEG T+VTEP
Sbjct: 133 KSMGAQIDGRDHGNLTPLSIRGGQLKGIDFHSPVASAQMKSAILLAGLRAEGKTSVTEPA 192

Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248
           K+RDHTERML AFGV + +D  +VSI GGQ LT   + VPGDISSAAFFL AGAMVP+SR
Sbjct: 193 KTRDHTERMLEAFGVNIEKDGLTVSIEGGQMLTGQHVVVPGDISSAAFFLVAGAMVPHSR 252

Query: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGXXXX 308
           I L NVG+NPTR GI++VL+ MGA L ++     G EP  DL IETS L+ VEIGG
Sbjct: 253 ITLTNVGINPTRAGILEVLKQMGATLAMENERVQGGEPVADLTIETSVLQGVEIGGDIIP 312

Query: 309 XXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368
                    A+LATQA G TVIKDA ELKVKETNRIDTVVSEL KLGA I  T DGM +
Sbjct: 313 RLIDEIPIIAVLATQASGRTVIKDAEELKVKETNRIDTVVSELTKLGASIHATDDGMIIE 372

Query: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSKK 427
           G   LKGG  VSSHGDHRIGM + IA+ + E+P+ +E T+AI VSYP+FF+HL++L  +
Sbjct: 373 GPTPLKGGVTVSSHGDHRIGMAMAIAALLAEKPVTVEGTEAIAVSYPSFFDHLDRLKSE 431


>gi|49241782|emb|CAG40473.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Staphylococcus aureus subsp.
           aureus MRSA252]
          Length = 432

 Score =  336 bits (861), Expect = 6e-96
 Identities = 185/422 (43%), Positives = 261/422 (61%), Gaps = 6/422 (1%)

Query: 9   LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
           L GEI +PGDKS++HR++M  +LA G +T+   L G DC  T+D FR +GV I++    +
Sbjct: 13  LKGEIEVPGDKSMTHRAIMLASLAEGVSTIYKPLLGEDCRRTMDIFRLLGVDIKEDEDKL 72

Query: 69  VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
           V++  G  A K P  +L  GNSGTT RL+ G+L+G    S ++GD SI KRPM RV  PL
Sbjct: 73  VVNSPGYKAFKTPHQVLYTGNSGTTTRLLAGLLSGLGIESVLSGDVSIGKRPMDRVLRPL 132

Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
           K M A I+G     +TPL +  + +KGI+Y   VASAQ+KSA+L A L ++  T + E
Sbjct: 133 KSMNANIEG-IEDNYTPLIIKPSVIKGINYKMEVASAQVKSAILFASLFSKEATIIKELD 191

Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSI--AGGQKLTAADIFVPGDISSAAFFLAAGAMVPN 246
            SR+HTE M   F + +  +  S++      + +  AD  VPGDISSAAFF+ A  + P
Sbjct: 192 VSRNHTETMFRHFNIPIEAEGLSITTIPEAIRYIKPADFHVPGDISSAAFFIVAALITPG 251

Query: 247 SRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIE-TSSLKAVEIGGX 305
           S + + NVG+NPTR+GIID+++ MG  +++  +  +GAEP   + I+ T  L+ ++I G
Sbjct: 252 SDVTIHNVGINPTRSGIIDIVEKMGGNIQLF-NQTTGAEPTASIRIQYTPMLQPIKIEGE 310

Query: 306 XXXXXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGM 365
                       ALL TQA GT+ IKDA ELKVKETNRIDT    L  LG E++PT DG+
Sbjct: 311 LVPKAIDELPVIALLCTQAVGTSTIKDAEELKVKETNRIDTTADMLNLLGFELQPTNDGL 370

Query: 366 KVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLS 425
            ++  +  K  A V S  DHRIGMML +AS ++ EP++I+  DA++VS+P F   L  L
Sbjct: 371 IIHPSE-FKTNATVDSLTDHRIGMMLAVASLLSSEPVKIKQFDAVNVSFPGFLPKLKLLE 429

Query: 426 KK 427
            +
Sbjct: 430 NE 431


  Database: bsub1.fasta
    Posted date:  Jan 1, 2006  8:45 PM
  Number of letters in database: 2750
  Number of sequences in database:  7

Lambda     K      H
   0.314    0.132    0.364

Gapped
Lambda     K      H
   0.267   0.0410    0.140


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 3470
Number of Sequences: 7
Number of extensions: 133
Number of successful extensions: 12
Number of sequences better than 1.0e-20: 4
Number of HSP's better than  0.0 without gapping: 4
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 0
Number of HSP's gapped (non-prelim): 4
length of query: 428
length of database: 2750
effective HSP length: 45
effective length of query: 383
effective length of database: 2435
effective search space:   932605
effective search space used:   932605
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 42 (22.0 bits)
S2: 212 (86.3 bits)
BLASTP 2.2.13 [Nov-27-2005]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus thuringiensis serovar israelensis
ATCC 35646]
         (432 letters)

Database: bsub1.fasta
           7 sequences; 2750 total letters

Searching.......done

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|74491472|gb|EAO54687.1| 3-phosphoshikimate 1-carboxyvinyltran...   835   0.0
gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]                     573   e-167
gi|15614230|ref|NP_242533.1| 3-phosphoshikimate 1-carboxyvinyltr...   557   e-162
gi|49241782|emb|CAG40473.1| 3-phosphoshikimate 1-carboxyvinyltra...   353   e-101

>gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Bacillus thuringiensis
           serovar israelensis ATCC 35646]
          Length = 432

 Score =  835 bits (2158), Expect = 0.0
 Identities = 432/432 (100%), Positives = 432/432 (100%)

Query: 1   MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC 60
           MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC
Sbjct: 1   MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC 60

Query: 61  FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD 120
           FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD
Sbjct: 61  FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD 120

Query: 121 ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL 180
           ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL
Sbjct: 121 ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL 180

Query: 181 AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS 240
           AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS
Sbjct: 181 AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS 240

Query: 241 AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE 300
           AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE
Sbjct: 241 AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE 300

Query: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360
           TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK
Sbjct: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360

Query: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420
           LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY
Sbjct: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420

Query: 421 PTFFDELQKLAK 432
           PTFFDELQKLAK
Sbjct: 421 PTFFDELQKLAK 432


>gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
          Length = 428

 Score =  573 bits (1477), Expect = e-167
 Identities = 286/418 (68%), Positives = 343/418 (82%), Gaps = 1/418 (0%)

Query: 16  LNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKEMGVEITQNGDEV 75
           L+G I IPGDKSISHR+VMFG++A G TT+K FL GADCLSTI CF++MGV I Q+  +V
Sbjct: 9   LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68

Query: 76  TVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESIAKRPMKRVTNPL 135
            + GKG++ L+EP+++LDVGNSGTTIRLM GILA  PF+S V GDESIAKRPMKRVT PL
Sbjct: 69  VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128

Query: 136 KQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGLRAEGVTAVTEPH 195
           K+MGA IDGR  G FTPL++ G  LK I+Y+SPVASAQ+KSA+LLAGL+AEG T VTEPH
Sbjct: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188

Query: 196 ISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAFFLVAGAIIPNSK 255
            SRDHTERML AFGVK++ +  +V ++GGQKLTA DI VPGD+SSAAFFL AGA++PNS+
Sbjct: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248

Query: 256 LILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSSLKGIEIGGDIIP 315
           ++L+NVG+NPTRTGIIDVL+ MGA   I+P  +  +EP  ++ IETSSLK +EIGGDIIP
Sbjct: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGDIIP 308

Query: 316 RLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGARIEATDDGMIIY 375
           RLIDEIP+IAL ATQAEG TVI+DA ELKVKETNRIDTVV+EL KLGA IE T DGM +Y
Sbjct: 309 RLIDEIPIIALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368

Query: 376 GKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPTFFDELQKLAK 432
           GK  LKG   V+S+GDHRIGMML IA C+ E  I IE  +A+ VSYPTFF+ L KL+K
Sbjct: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSK 426


>gi|15614230|ref|NP_242533.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Bacillus halodurans C-125]
          Length = 431

 Score =  557 bits (1436), Expect = e-162
 Identities = 282/428 (65%), Positives = 343/428 (80%), Gaps = 1/428 (0%)

Query: 4   VKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKE 63
           ++ +T+ P   GL G I +PGDKSISHRAVMFG++A+G TT++GFL GADCLSTISCF++
Sbjct: 1   MENKTVIPHAKGLKGTIKVPGDKSISHRAVMFGALAKGTTTVEGFLPGADCLSTISCFQK 60

Query: 64  MGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESI 123
           +GV I Q  + VTV GKG +GL+EP  +LDVGNSGTT RL+ GIL+  PF S + GDESI
Sbjct: 61  LGVSIEQAEERVTVKGKGWDGLREPSDILDVGNSGTTTRLILGILSTLPFHSVIIGDESI 120

Query: 124 AKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGL 183
            KRPMKRVT PLK MGA IDGR+ G  TPL+IRGG LK I++ SPVASAQ+KSAILLAGL
Sbjct: 121 GKRPMKRVTEPLKSMGAQIDGRDHGNLTPLSIRGGQLKGIDFHSPVASAQMKSAILLAGL 180

Query: 184 RAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAF 243
           RAEG T+VTEP  +RDHTERMLEAFGV + ++G TV + GGQ LT   + VPGD+SSAAF
Sbjct: 181 RAEGKTSVTEPAKTRDHTERMLEAFGVNIEKDGLTVSIEGGQMLTGQHVVVPGDISSAAF 240

Query: 244 FLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSS 303
           FLVAGA++P+S++ L NVG+NPTR GI++VL++MGAT  +E       EP A++TIETS
Sbjct: 241 FLVAGAMVPHSRITLTNVGINPTRAGILEVLKQMGATLAMENERVQGGEPVADLTIETSV 300

Query: 304 LKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGA 363
           L+G+EIGGDIIPRLIDEIP+IA+ ATQA G TVI+DA ELKVKETNRIDTVV+ELTKLGA
Sbjct: 301 LQGVEIGGDIIPRLIDEIPIIAVLATQASGRTVIKDAEELKVKETNRIDTVVSELTKLGA 360

Query: 364 RIEATDDGMIIYGKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPT 422
            I ATDDGMII G + LKG  TV+S+GDHRIGM +AIA  LAE  + +E  EA+ VSYP+
Sbjct: 361 SIHATDDGMIIEGPTPLKGGVTVSSHGDHRIGMAMAIAALLAEKPVTVEGTEAIAVSYPS 420

Query: 423 FFDELQKL 430
           FFD L +L
Sbjct: 421 FFDHLDRL 428


>gi|49241782|emb|CAG40473.1| 3-phosphoshikimate
           1-carboxyvinyltransferase [Staphylococcus aureus subsp.
           aureus MRSA252]
          Length = 432

 Score =  353 bits (905), Expect = e-101
 Identities = 199/430 (46%), Positives = 273/430 (63%), Gaps = 6/430 (1%)

Query: 4   VKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKE 63
           V E+ I  ++  L G I +PGDKS++HRA+M  S+AEG +TI   L G DC  T+  F+
Sbjct: 2   VNEQIID-ISGPLKGEIEVPGDKSMTHRAIMLASLAEGVSTIYKPLLGEDCRRTMDIFRL 60

Query: 64  MGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESI 123
           +GV+I ++ D++ V   G +  + P  VL  GNSGTT RL++G+L+     S + GD SI
Sbjct: 61  LGVDIKEDEDKLVVNSPGYKAFKTPHQVLYTGNSGTTTRLLAGLLSGLGIESVLSGDVSI 120

Query: 124 AKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGL 183
            KRPM RV  PLK M ANI+G E+  +TPL I+   +K I Y   VASAQVKSAIL A L
Sbjct: 121 GKRPMDRVLRPLKSMNANIEGIEDN-YTPLIIKPSVIKGINYKMEVASAQVKSAILFASL 179

Query: 184 RAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKL--SGGQKLTATDIQVPGDVSSA 241
            ++  T + E  +SR+HTE M   F + +  EG ++       + +   D  VPGD+SSA
Sbjct: 180 FSKEATIIKELDVSRNHTETMFRHFNIPIEAEGLSITTIPEAIRYIKPADFHVPGDISSA 239

Query: 242 AFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE- 300
           AFF+VA  I P S + + NVG+NPTR+GIID++EKMG    +     GA EP A+I I+
Sbjct: 240 AFFIVAALITPGSDVTIHNVGINPTRSGIIDIVEKMGGNIQLFNQTTGA-EPTASIRIQY 298

Query: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360
           T  L+ I+I G+++P+ IDE+PVIAL  TQA G + I+DA ELKVKETNRIDT    L
Sbjct: 299 TPMLQPIKIEGELVPKAIDELPVIALLCTQAVGTSTIKDAEELKVKETNRIDTTADMLNL 358

Query: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420
           LG  ++ T+DG+II+        TV+S  DHRIGMMLA+A  L+   + I+  +AV VS+
Sbjct: 359 LGFELQPTNDGLIIHPSEFKTNATVDSLTDHRIGMMLAVASLLSSEPVKIKQFDAVNVSF 418

Query: 421 PTFFDELQKL 430
           P F  +L+ L
Sbjct: 419 PGFLPKLKLL 428


  Database: bsub1.fasta
    Posted date:  Jan 1, 2006  8:45 PM
  Number of letters in database: 2750
  Number of sequences in database:  7

Lambda     K      H
   0.315    0.134    0.370

Gapped
Lambda     K      H
   0.267   0.0410    0.140


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 3539
Number of Sequences: 7
Number of extensions: 176
Number of successful extensions: 9
Number of sequences better than 1.0e-20: 4
Number of HSP's better than  0.0 without gapping: 4
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 0
Number of HSP's gapped (non-prelim): 4
length of query: 432
length of database: 2750
effective HSP length: 45
effective length of query: 387
effective length of database: 2435
effective search space:   942345
effective search space used:   942345
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 42 (22.0 bits)
S2: 213 (86.7 bits)

Simple pattern matching

If we look at the structure of a BLAST output file, we can see that for each query sequence, the output begins with a line containing the name of the BLAST program used to perform the alignment (here, BLASTP), the version string (here, 2.2.13), and the version date (here, [Nov-27-2005]). The entire line looks like this:

BLASTP 2.2.13 [Nov-27-2005]

Simple pattern matching requires three things:

Let’s search the string "BLASTP 2.2.13 [Nov-27-2005]" for "BLASTP". We do it like this:

    #   Create a text string in which to find a pattern.

    my( $textString ) = 'BLASTP 2.2.13 [Nov-27-2005]';

    #   Search the string for a pattern.

    if ( $textString =~ m/BLASTP/ )
    {
        print( STDOUT "Pattern found!\n" );
    }
    else
    {
        print( STDOUT "Pattern not found!\n" );
    }

You can see from this example that the pattern we’re searching for goes inside the // part of the m// operator. The line of code

    if ( $textString =~ m/BLASTP/ )

can be read as:

if $textString contains a match with the regular expression "BLASTP"

Now that we know the basic building blocks for regular expression searching, here’s a script that can serve as the basis for scanning a text file in order to extract information.

#!/usr/bin/perl
#
#   parseBlast1.pl
#   02-Jan-2006
#
#   Conrad Halling
#   conrad.halling@sphaerula.com

use warnings;

    my( $dataLine );

    #   Check for an argument containing the name of the BLAST output file.

    if ( scalar( @ARGV ) != 1 )
    {
        die(
            "\n",
            "      Use: perl $0 blastOutputFileName\n",
            "  Example: perl $0 bsub1.blastp.txt\n",
            "\n" );
    }

    #   Open the BLASTP output file. The file name is hard-coded here for
    #   convenience. Normally, the script would get the file name from an
    #   argument.

    if ( ! open( BLASTFILE, "<$ARGV[ 0 ]" ) )
    {
        die( "\n  Can't open file '$ARGV[ 0 ]' for reading: $!.\n" );
    }

    #   Read the file line by line. If we find a line that contains
    #   "BLASTP" in it, then print that line to STDOUT.

    while ( defined( $dataLine = <BLASTFILE> ) )
    {
        if ( $dataLine =~ m/BLASTP/ )
        {
            print( STDOUT $dataLine );
        }
    }

    close( BLASTFILE );

The output from the script, when run against the bsub1.blastp.txt output file, is:

> perl parseBlast1.pl bsub1.blastp.txt
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]
BLASTP 2.2.13 [Nov-27-2005]