Introduction to regular expressions
A regular expression is a combinations of characters that defines a pattern that is used for searching. Regular expressions are a powerful method for finding and extracting information.
In this lesson, we’ll learn the basics of regular expressions and begin to use them to parse the output produced by NCBI’s BLAST sequence alignment software. Our goal is to begin extracting information from BLAST output. The extracted information might be used to report a result, build a web page, or populate a database.
I used BLASTP to find protein sequence alignments among a small number of proteins.
First, let’s look at the structure of the output file,
bsub1.blastp.txt.
Here is the first part of the output:
BLASTP 2.2.13 [Nov-27-2005]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
(428 letters)
Database: bsub1.fasta
7 sequences; 2750 total letters
Searching.......done
Score E
Sequences producing significant alignments: (bits) Value
gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis] 799 0.0
gi|74491472|gb|EAO54687.1| 3-phosphoshikimate 1-carboxyvinyltran... 544 e-158
gi|15614230|ref|NP_242533.1| 3-phosphoshikimate 1-carboxyvinyltr... 532 e-155
gi|49241782|emb|CAG40473.1| 3-phosphoshikimate 1-carboxyvinyltra... 336 6e-96
>gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
Length = 428
Score = 799 bits (2064), Expect = 0.0
Identities = 415/428 (96%), Positives = 415/428 (96%)
Query: 1 MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH 60
MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH
Sbjct: 1 MKRDKVQTLHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVH 60
Query: 61 IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP 120
IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP
Sbjct: 61 IEQSSSDVVIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRP 120
Query: 121 MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG 180
MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG
Sbjct: 121 MKRVTEPLKKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEG 180
Query: 181 TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA 240
TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA
Sbjct: 181 TTTVTEPHKSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAA 240
Query: 241 GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV 300
GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV
Sbjct: 241 GAMVPNSRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAV 300
Query: 301 EIGGXXXXXXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP 360
EIGG ALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP
Sbjct: 301 EIGGDIIPRLIDEIPIIALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEP 360
Query: 361 TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH 420
TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH
Sbjct: 361 TADGMKVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEH 420
Query: 421 LNKLSKKS 428
LNKLSKKS
Sbjct: 421 LNKLSKKS 428
>gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus thuringiensis
serovar israelensis ATCC 35646]
Length = 432
Score = 544 bits (1401), Expect = e-158
Identities = 274/418 (65%), Positives = 330/418 (78%), Gaps = 1/418 (0%)
Query: 9 LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
L+G I IPGDKSISHR+VMFG++A G TT+K FL GADCLSTI CF++MGV I Q+ +V
Sbjct: 16 LNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKEMGVEITQNGDEV 75
Query: 69 VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
+ GKG++ L+EP+++LDVGNSGTTIRLM GILA PF+S V GDESIAKRPMKRVT PL
Sbjct: 76 TVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESIAKRPMKRVTNPL 135
Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
K+MGA IDGR G FTPL++ G LK I+Y+SPVASAQ+KSA+LLAGL+AEG T VTEPH
Sbjct: 136 KQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGLRAEGVTAVTEPH 195
Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248
SRDHTERML AFGVK++ + +V ++GGQKLTA DI VPGD+SSAAFFL AGA++PNS+
Sbjct: 196 ISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAFFLVAGAIIPNSK 255
Query: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGXXXX 308
++L+NVG+NPTRTGIIDVL+ MGA I+P + +EP ++ IETSSLK +EIGG
Sbjct: 256 LILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSSLKGIEIGGDIIP 315
Query: 309 XXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368
AL ATQAEG TVI+DA ELKVKETNRIDTVV+EL KLGA IE T DGM +Y
Sbjct: 316 RLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGARIEATDDGMIIY 375
Query: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSK 426
GK LKG V+S+GDHRIGMML IA C+ E I IE +A+ VSYPTFF+ L KL+K
Sbjct: 376 GKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPTFFDELQKLAK 432
>gi|15614230|ref|NP_242533.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus halodurans C-125]
Length = 431
Score = 532 bits (1371), Expect = e-155
Identities = 273/419 (65%), Positives = 320/419 (76%)
Query: 9 LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
L G I +PGDKSISHR+VMFGALA GTTTV+ FLPGADCLSTI CF+K+GV IEQ+ V
Sbjct: 13 LKGTIKVPGDKSISHRAVMFGALAKGTTTVEGFLPGADCLSTISCFQKLGVSIEQAEERV 72
Query: 69 VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
+ GKG D L+EP +LDVGNSGTT RL+LGIL+ PF+S + GDESI KRPMKRVTEPL
Sbjct: 73 TVKGKGWDGLREPSDILDVGNSGTTTRLILGILSTLPFHSVIIGDESIGKRPMKRVTEPL 132
Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
K MGA+IDGR G TPLS+ G LKGID+ SPVASAQ+KSA+LLAGL+AEG T+VTEP
Sbjct: 133 KSMGAQIDGRDHGNLTPLSIRGGQLKGIDFHSPVASAQMKSAILLAGLRAEGKTSVTEPA 192
Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248
K+RDHTERML AFGV + +D +VSI GGQ LT + VPGDISSAAFFL AGAMVP+SR
Sbjct: 193 KTRDHTERMLEAFGVNIEKDGLTVSIEGGQMLTGQHVVVPGDISSAAFFLVAGAMVPHSR 252
Query: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGXXXX 308
I L NVG+NPTR GI++VL+ MGA L ++ G EP DL IETS L+ VEIGG
Sbjct: 253 ITLTNVGINPTRAGILEVLKQMGATLAMENERVQGGEPVADLTIETSVLQGVEIGGDIIP 312
Query: 309 XXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368
A+LATQA G TVIKDA ELKVKETNRIDTVVSEL KLGA I T DGM +
Sbjct: 313 RLIDEIPIIAVLATQASGRTVIKDAEELKVKETNRIDTVVSELTKLGASIHATDDGMIIE 372
Query: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSKK 427
G LKGG VSSHGDHRIGM + IA+ + E+P+ +E T+AI VSYP+FF+HL++L +
Sbjct: 373 GPTPLKGGVTVSSHGDHRIGMAMAIAALLAEKPVTVEGTEAIAVSYPSFFDHLDRLKSE 431
>gi|49241782|emb|CAG40473.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Staphylococcus aureus subsp.
aureus MRSA252]
Length = 432
Score = 336 bits (861), Expect = 6e-96
Identities = 185/422 (43%), Positives = 261/422 (61%), Gaps = 6/422 (1%)
Query: 9 LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
L GEI +PGDKS++HR++M +LA G +T+ L G DC T+D FR +GV I++ +
Sbjct: 13 LKGEIEVPGDKSMTHRAIMLASLAEGVSTIYKPLLGEDCRRTMDIFRLLGVDIKEDEDKL 72
Query: 69 VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
V++ G A K P +L GNSGTT RL+ G+L+G S ++GD SI KRPM RV PL
Sbjct: 73 VVNSPGYKAFKTPHQVLYTGNSGTTTRLLAGLLSGLGIESVLSGDVSIGKRPMDRVLRPL 132
Query: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
K M A I+G +TPL + + +KGI+Y VASAQ+KSA+L A L ++ T + E
Sbjct: 133 KSMNANIEG-IEDNYTPLIIKPSVIKGINYKMEVASAQVKSAILFASLFSKEATIIKELD 191
Query: 189 KSRDHTERMLSAFGVKLSEDQTSVSI--AGGQKLTAADIFVPGDISSAAFFLAAGAMVPN 246
SR+HTE M F + + + S++ + + AD VPGDISSAAFF+ A + P
Sbjct: 192 VSRNHTETMFRHFNIPIEAEGLSITTIPEAIRYIKPADFHVPGDISSAAFFIVAALITPG 251
Query: 247 SRIVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIE-TSSLKAVEIGGX 305
S + + NVG+NPTR+GIID+++ MG +++ + +GAEP + I+ T L+ ++I G
Sbjct: 252 SDVTIHNVGINPTRSGIIDIVEKMGGNIQLF-NQTTGAEPTASIRIQYTPMLQPIKIEGE 310
Query: 306 XXXXXXXXXXXXALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGM 365
ALL TQA GT+ IKDA ELKVKETNRIDT L LG E++PT DG+
Sbjct: 311 LVPKAIDELPVIALLCTQAVGTSTIKDAEELKVKETNRIDTTADMLNLLGFELQPTNDGL 370
Query: 366 KVYGKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLS 425
++ + K A V S DHRIGMML +AS ++ EP++I+ DA++VS+P F L L
Sbjct: 371 IIHPSE-FKTNATVDSLTDHRIGMMLAVASLLSSEPVKIKQFDAVNVSFPGFLPKLKLLE 429
Query: 426 KK 427
+
Sbjct: 430 NE 431
Database: bsub1.fasta
Posted date: Jan 1, 2006 8:45 PM
Number of letters in database: 2750
Number of sequences in database: 7
Lambda K H
0.314 0.132 0.364
Gapped
Lambda K H
0.267 0.0410 0.140
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 3470
Number of Sequences: 7
Number of extensions: 133
Number of successful extensions: 12
Number of sequences better than 1.0e-20: 4
Number of HSP's better than 0.0 without gapping: 4
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 0
Number of HSP's gapped (non-prelim): 4
length of query: 428
length of database: 2750
effective HSP length: 45
effective length of query: 383
effective length of database: 2435
effective search space: 932605
effective search space used: 932605
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 42 (22.0 bits)
S2: 212 (86.3 bits)
BLASTP 2.2.13 [Nov-27-2005]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus thuringiensis serovar israelensis
ATCC 35646]
(432 letters)
Database: bsub1.fasta
7 sequences; 2750 total letters
Searching.......done
Score E
Sequences producing significant alignments: (bits) Value
gi|74491472|gb|EAO54687.1| 3-phosphoshikimate 1-carboxyvinyltran... 835 0.0
gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis] 573 e-167
gi|15614230|ref|NP_242533.1| 3-phosphoshikimate 1-carboxyvinyltr... 557 e-162
gi|49241782|emb|CAG40473.1| 3-phosphoshikimate 1-carboxyvinyltra... 353 e-101
>gi|74491472|gb|EAO54687.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus thuringiensis
serovar israelensis ATCC 35646]
Length = 432
Score = 835 bits (2158), Expect = 0.0
Identities = 432/432 (100%), Positives = 432/432 (100%)
Query: 1 MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC 60
MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC
Sbjct: 1 MKNVKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISC 60
Query: 61 FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD 120
FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD
Sbjct: 61 FKEMGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGD 120
Query: 121 ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL 180
ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL
Sbjct: 121 ESIAKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILL 180
Query: 181 AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS 240
AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS
Sbjct: 181 AGLRAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSS 240
Query: 241 AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE 300
AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE
Sbjct: 241 AAFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE 300
Query: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360
TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK
Sbjct: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360
Query: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420
LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY
Sbjct: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420
Query: 421 PTFFDELQKLAK 432
PTFFDELQKLAK
Sbjct: 421 PTFFDELQKLAK 432
>gi|143816|gb|AAA20869.1| AroE [Bacillus subtilis]
Length = 428
Score = 573 bits (1477), Expect = e-167
Identities = 286/418 (68%), Positives = 343/418 (82%), Gaps = 1/418 (0%)
Query: 16 LNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKEMGVEITQNGDEV 75
L+G I IPGDKSISHR+VMFG++A G TT+K FL GADCLSTI CF++MGV I Q+ +V
Sbjct: 9 LHGEIHIPGDKSISHRSVMFGALAAGTTTVKNFLPGADCLSTIDCFRKMGVHIEQSSSDV 68
Query: 76 TVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESIAKRPMKRVTNPL 135
+ GKG++ L+EP+++LDVGNSGTTIRLM GILA PF+S V GDESIAKRPMKRVT PL
Sbjct: 69 VIHGKGIDALKEPESLLDVGNSGTTIRLMLGILAGRPFYSAVAGDESIAKRPMKRVTEPL 128
Query: 136 KQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGLRAEGVTAVTEPH 195
K+MGA IDGR G FTPL++ G LK I+Y+SPVASAQ+KSA+LLAGL+AEG T VTEPH
Sbjct: 129 KKMGAKIDGRAGGEFTPLSVSGASLKGIDYVSPVASAQIKSAVLLAGLQAEGTTTVTEPH 188
Query: 196 ISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAFFLVAGAIIPNSK 255
SRDHTERML AFGVK++ + +V ++GGQKLTA DI VPGD+SSAAFFL AGA++PNS+
Sbjct: 189 KSRDHTERMLSAFGVKLSEDQTSVSIAGGQKLTAADIFVPGDISSAAFFLAAGAMVPNSR 248
Query: 256 LILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSSLKGIEIGGDIIP 315
++L+NVG+NPTRTGIIDVL+ MGA I+P + +EP ++ IETSSLK +EIGGDIIP
Sbjct: 249 IVLKNVGLNPTRTGIIDVLQNMGAKLEIKPSADSGAEPYGDLIIETSSLKAVEIGGDIIP 308
Query: 316 RLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGARIEATDDGMIIY 375
RLIDEIP+IAL ATQAEG TVI+DA ELKVKETNRIDTVV+EL KLGA IE T DGM +Y
Sbjct: 309 RLIDEIPIIALLATQAEGTTVIKDAAELKVKETNRIDTVVSELRKLGAEIEPTADGMKVY 368
Query: 376 GKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPTFFDELQKLAK 432
GK LKG V+S+GDHRIGMML IA C+ E I IE +A+ VSYPTFF+ L KL+K
Sbjct: 369 GKQTLKGGAAVSSHGDHRIGMMLGIASCITEEPIEIEHTDAIHVSYPTFFEHLNKLSK 426
>gi|15614230|ref|NP_242533.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Bacillus halodurans C-125]
Length = 431
Score = 557 bits (1436), Expect = e-162
Identities = 282/428 (65%), Positives = 343/428 (80%), Gaps = 1/428 (0%)
Query: 4 VKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKE 63
++ +T+ P GL G I +PGDKSISHRAVMFG++A+G TT++GFL GADCLSTISCF++
Sbjct: 1 MENKTVIPHAKGLKGTIKVPGDKSISHRAVMFGALAKGTTTVEGFLPGADCLSTISCFQK 60
Query: 64 MGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESI 123
+GV I Q + VTV GKG +GL+EP +LDVGNSGTT RL+ GIL+ PF S + GDESI
Sbjct: 61 LGVSIEQAEERVTVKGKGWDGLREPSDILDVGNSGTTTRLILGILSTLPFHSVIIGDESI 120
Query: 124 AKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGL 183
KRPMKRVT PLK MGA IDGR+ G TPL+IRGG LK I++ SPVASAQ+KSAILLAGL
Sbjct: 121 GKRPMKRVTEPLKSMGAQIDGRDHGNLTPLSIRGGQLKGIDFHSPVASAQMKSAILLAGL 180
Query: 184 RAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKLSGGQKLTATDIQVPGDVSSAAF 243
RAEG T+VTEP +RDHTERMLEAFGV + ++G TV + GGQ LT + VPGD+SSAAF
Sbjct: 181 RAEGKTSVTEPAKTRDHTERMLEAFGVNIEKDGLTVSIEGGQMLTGQHVVVPGDISSAAF 240
Query: 244 FLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIETSS 303
FLVAGA++P+S++ L NVG+NPTR GI++VL++MGAT +E EP A++TIETS
Sbjct: 241 FLVAGAMVPHSRITLTNVGINPTRAGILEVLKQMGATLAMENERVQGGEPVADLTIETSV 300
Query: 304 LKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTKLGA 363
L+G+EIGGDIIPRLIDEIP+IA+ ATQA G TVI+DA ELKVKETNRIDTVV+ELTKLGA
Sbjct: 301 LQGVEIGGDIIPRLIDEIPIIAVLATQASGRTVIKDAEELKVKETNRIDTVVSELTKLGA 360
Query: 364 RIEATDDGMIIYGKSALKGN-TVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSYPT 422
I ATDDGMII G + LKG TV+S+GDHRIGM +AIA LAE + +E EA+ VSYP+
Sbjct: 361 SIHATDDGMIIEGPTPLKGGVTVSSHGDHRIGMAMAIAALLAEKPVTVEGTEAIAVSYPS 420
Query: 423 FFDELQKL 430
FFD L +L
Sbjct: 421 FFDHLDRL 428
>gi|49241782|emb|CAG40473.1| 3-phosphoshikimate
1-carboxyvinyltransferase [Staphylococcus aureus subsp.
aureus MRSA252]
Length = 432
Score = 353 bits (905), Expect = e-101
Identities = 199/430 (46%), Positives = 273/430 (63%), Gaps = 6/430 (1%)
Query: 4 VKERTIQPVNNGLNGNITIPGDKSISHRAVMFGSIAEGKTTIKGFLSGADCLSTISCFKE 63
V E+ I ++ L G I +PGDKS++HRA+M S+AEG +TI L G DC T+ F+
Sbjct: 2 VNEQIID-ISGPLKGEIEVPGDKSMTHRAIMLASLAEGVSTIYKPLLGEDCRRTMDIFRL 60
Query: 64 MGVEITQNGDEVTVVGKGLEGLQEPKAVLDVGNSGTTIRLMSGILANTPFFSCVQGDESI 123
+GV+I ++ D++ V G + + P VL GNSGTT RL++G+L+ S + GD SI
Sbjct: 61 LGVDIKEDEDKLVVNSPGYKAFKTPHQVLYTGNSGTTTRLLAGLLSGLGIESVLSGDVSI 120
Query: 124 AKRPMKRVTNPLKQMGANIDGREEGTFTPLTIRGGDLKAIEYISPVASAQVKSAILLAGL 183
KRPM RV PLK M ANI+G E+ +TPL I+ +K I Y VASAQVKSAIL A L
Sbjct: 121 GKRPMDRVLRPLKSMNANIEGIEDN-YTPLIIKPSVIKGINYKMEVASAQVKSAILFASL 179
Query: 184 RAEGVTAVTEPHISRDHTERMLEAFGVKVTREGKTVKL--SGGQKLTATDIQVPGDVSSA 241
++ T + E +SR+HTE M F + + EG ++ + + D VPGD+SSA
Sbjct: 180 FSKEATIIKELDVSRNHTETMFRHFNIPIEAEGLSITTIPEAIRYIKPADFHVPGDISSA 239
Query: 242 AFFLVAGAIIPNSKLILQNVGMNPTRTGIIDVLEKMGATFTIEPINEGASEPAANITIE- 300
AFF+VA I P S + + NVG+NPTR+GIID++EKMG + GA EP A+I I+
Sbjct: 240 AFFIVAALITPGSDVTIHNVGINPTRSGIIDIVEKMGGNIQLFNQTTGA-EPTASIRIQY 298
Query: 301 TSSLKGIEIGGDIIPRLIDEIPVIALAATQAEGITVIRDAHELKVKETNRIDTVVAELTK 360
T L+ I+I G+++P+ IDE+PVIAL TQA G + I+DA ELKVKETNRIDT L
Sbjct: 299 TPMLQPIKIEGELVPKAIDELPVIALLCTQAVGTSTIKDAEELKVKETNRIDTTADMLNL 358
Query: 361 LGARIEATDDGMIIYGKSALKGNTVNSYGDHRIGMMLAIAGCLAEGKIIIEDAEAVGVSY 420
LG ++ T+DG+II+ TV+S DHRIGMMLA+A L+ + I+ +AV VS+
Sbjct: 359 LGFELQPTNDGLIIHPSEFKTNATVDSLTDHRIGMMLAVASLLSSEPVKIKQFDAVNVSF 418
Query: 421 PTFFDELQKL 430
P F +L+ L
Sbjct: 419 PGFLPKLKLL 428
Database: bsub1.fasta
Posted date: Jan 1, 2006 8:45 PM
Number of letters in database: 2750
Number of sequences in database: 7
Lambda K H
0.315 0.134 0.370
Gapped
Lambda K H
0.267 0.0410 0.140
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 3539
Number of Sequences: 7
Number of extensions: 176
Number of successful extensions: 9
Number of sequences better than 1.0e-20: 4
Number of HSP's better than 0.0 without gapping: 4
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 0
Number of HSP's gapped (non-prelim): 4
length of query: 432
length of database: 2750
effective HSP length: 45
effective length of query: 387
effective length of database: 2435
effective search space: 942345
effective search space used: 942345
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 42 (22.0 bits)
S2: 213 (86.7 bits)
Simple pattern matching
If we look at the structure of a BLAST output file, we can see that for each query
sequence, the output begins with a line containing the name of the BLAST program used to
perform the alignment (here, BLASTP), the version string (here,
2.2.13), and the version date (here, [Nov-27-2005]).
The entire line looks like this:
BLASTP 2.2.13 [Nov-27-2005]
Simple pattern matching requires three things:
- a string of text in which we want to search for the pattern
-
the operator
m//, which contains the pattern to search for in the form of a regular expression placed between the two/characters -
and the operator
=~, which returns a true value when the pattern is present
Let’s search the string "BLASTP 2.2.13 [Nov-27-2005]" for
"BLASTP". We do it like this:
# Create a text string in which to find a pattern.
my( $textString ) = 'BLASTP 2.2.13 [Nov-27-2005]';
# Search the string for a pattern.
if ( $textString =~ m/BLASTP/ )
{
print( STDOUT "Pattern found!\n" );
}
else
{
print( STDOUT "Pattern not found!\n" );
}
You can see from this example that the pattern we’re searching for goes inside
the // part of the m// operator. The line of code
if ( $textString =~ m/BLASTP/ )
can be read as:
if
$textStringcontains a match with the regular expression"BLASTP"
Now that we know the basic building blocks for regular expression searching, here’s a script that can serve as the basis for scanning a text file in order to extract information.
#!/usr/bin/perl
#
# parseBlast1.pl
# 02-Jan-2006
#
# Conrad Halling
# conrad.halling@sphaerula.com
use warnings;
my( $dataLine );
# Check for an argument containing the name of the BLAST output file.
if ( scalar( @ARGV ) != 1 )
{
die(
"\n",
" Use: perl $0 blastOutputFileName\n",
" Example: perl $0 bsub1.blastp.txt\n",
"\n" );
}
# Open the BLASTP output file. The file name is hard-coded here for
# convenience. Normally, the script would get the file name from an
# argument.
if ( ! open( BLASTFILE, "<$ARGV[ 0 ]" ) )
{
die( "\n Can't open file '$ARGV[ 0 ]' for reading: $!.\n" );
}
# Read the file line by line. If we find a line that contains
# "BLASTP" in it, then print that line to STDOUT.
while ( defined( $dataLine = <BLASTFILE> ) )
{
if ( $dataLine =~ m/BLASTP/ )
{
print( STDOUT $dataLine );
}
}
close( BLASTFILE );
The output from the script, when run against the bsub1.blastp.txt output file, is:
> perl parseBlast1.pl bsub1.blastp.txt BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005] BLASTP 2.2.13 [Nov-27-2005]