Hello
I would be grateful for any help with this.
I want to pull an id number (UniProt protein accession number) from a file using a regex. This works OK.
I then wanted to use the number as part of a url to pull the relevant page back , so I could parse some information about the protein from the page.
The code is very basic.
My perl script:
#!/usr/bin/perl
# A script to pull out an id number from a file using a regex.
#The id number(s0 are put into an array
@accnumber.
#The file I read in is html_test2.txt (attached to this mail).
#Then use the id number as part of a url to get and store a webpage.
#In this case to simplify things I just want to take the first
#element of the
@accnumber array and use that in the url
use LWP::Simple;
$a = 0;
#ask for the file name
print "please enter file name", "
";
#open and read the file
$filename1 = <>;
open fileone, "$filename1"
or die;
while (!eof(fileone))
{
my $line = ;
if ( $line =~/UNIPROT:?w+s(w{6})s/)
{
@accnumber[$a]= $1."
";
$a++;
}
}
close fileone;
$query_number =
@accnumber[0];
#as a sanity check I print the number to STDOUT
print $query_number;
#I call the subroutine to return the webpage
get_page($query_number);
sub get_page {
my $address = $_[0];
my $url = '
http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='.$address
.'_ORYSA&pager.offset=0';
my $html_file = 'page.html';
my $status = getstore($url, $html_file);
die "No URL::Error" unless is_success($status);
}
exit;
and the text file I parse to get my id number using the regex:
BLASTP 2.0MP-WashU [13-Dec-2004] [decunix5.0a-ev6-IP32LF64 2004-12-15T17:03:39]
Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.
Reference: Gish, W. (1996-2004)
http://blast.wustl.edu Query= 24061 17154533 emb|CAC80823.1 (AJ251791) putative IAA1 protein [Oryza
sativa] 1e-130 235 236 99.5% top hit
(237 letters; record 1)
Database: uniprot
1,880,849 sequences; 604,459,357 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
UNIPROT:Q75KX3_ORYSA Q84PD9 Putative auxin-responsive pro... 1203 1.2e-121 1
##################################################################
Thanks for any help.
Comments
";
: $a++;
:
See how here you put a newline character after what you capture...
: $query_number = @accnumber[0];
:
...which ends up in $query_number...
: get_page($query_number);
:
...and passed to get_page...
: my $address = $_[0];
:
:
: my $url = 'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='
: .$address
: .'_ORYSA&pager.offset=0';
:
And ends up in the URL. I'm thinking the newline character making it into the URL is the problem.
Hope this helps,
Jonathan
###
for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
(tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
/(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
";
: : $a++;
: :
: See how here you put a newline character after what you capture...
:
: : $query_number = @accnumber[0];
: :
: ...which ends up in $query_number...
:
: : get_page($query_number);
: :
: ...and passed to get_page...
:
: : my $address = $_[0];
: :
: :
: : my $url = 'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='
: : .$address
: : .'_ORYSA&pager.offset=0';
: :
: And ends up in the URL. I'm thinking the newline character making it into the URL is the problem.
:
: Hope this helps,
:
: Jonathan
:
: ###
: for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
: (tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
: /(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
:
:
I'll try to modify the code in line with your suggestion.
: : : @accnumber[$a]= $1."
";
: : : $a++;
: : :
: : See how here you put a newline character after what you capture...
: :
: : : $query_number = @accnumber[0];
: : :
: : ...which ends up in $query_number...
: :
: : : get_page($query_number);
: : :
: : ...and passed to get_page...
: :
: : : my $address = $_[0];
: : :
: : :
: : : my $url = 'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='
: : : .$address
: : : .'_ORYSA&pager.offset=0';
: : :
: : And ends up in the URL. I'm thinking the newline character making it into the URL is the problem.
: :
: : Hope this helps,
: :
: : Jonathan
: :
: : ###
: : for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
: : (tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
: : /(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
: :
: :
:
: