HELP: parsing unicode web sites - Programmers Heaven

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.

HELP: parsing unicode web sites

andrewwan1980andrewwan1980 Posts: 16Member
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

image

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you

Comments

  • JonathanJonathan Posts: 2,914Member
    Hi,

    A unicode advice page:
    http://juerd.nl/site.plp/perluniadvice

    Notes that recent versions of LWP are unicode aware - are you running a fairly recent version of the module and/or recent version of Perl? It also suggests looking at HTTP::Response::Charset.

    Thanks,

    Jonathan
    ###
    for(74,117,115,116){$::a.=chr};(($_.='qwertyui')&&
    (tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
    /(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
  • andrewwan1980andrewwan1980 Posts: 16Member
    Thanks to those who helped. Here's my working script:

    [code]#!/usr/bin/perl
    # tom365crawl2.pl
    # http://www.cs.utk.edu/cs594ipm/perl/crawltut.html
    # http://perldoc.perl.org/Encode.html
    # http://juerd.nl/site.plp/perluniadvice
    # http://www.perlmonks.org/?node_id=620068

    use warnings;
    use strict;

    use File::stat;
    use Tie::File;

    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
    #use File::Slurp;

    use Encode;

    my $site1 = "http://www.tom365.com/"; # Full url like http://www.tom365.com/movie_2004/html/????.html
    my $delim1a = "
    image";
    my $folder1 = "movie_2004/html/";
    my $url1;
    my $start1 = 1000;
    my $end1 = 1000;
    my $contents1;
    my $image1;

    my $browser1 = LWP::UserAgent->new();
    $browser1->timeout(10);
    my $request1;
    my $response1;

    my $count;
    for ($count=$start1; $count<=$end1; $count++) {
    $url1 = $site1 . $folder1 . $count . ".html";
    printf "Downloading %s
    ", $url1;

    # Method 1
    #$contents1 = get($url1);

    # Method 2
    $request1 = HTTP::Request->new(GET => $url1);
    $response1 = $browser1->request($request1);
    if ($response1->is_error()) {
    printf "%s
    ", $response1->status_line;
    }
    $contents1 = $response1->decoded_content();

    #open(NEWFILE1, "> Debug.txt");
    #(print NEWFILE1 $contents1) or die "Can't write to Debug.txt: $!";
    #close(NEWFILE1);

    #print $contents1;

    if ($contents1 =~ /
    image/m) {
    $image1 = "$1";
    printf "Downloading %s
    ", $image1;
    `wget -q -O $count.jpg $image1`;

    #if ($image1 =~ //([^/]*)$/m) {
    # printf "Renaming %s to $count.jpg
    ", $1;
    #} else {
    # printf "Could not rename %s to $count.jpg
    ", $image1;
    #}
    } else {
    #open(NEWFILE1, "> $count.txt");
    #(print NEWFILE1 "Download failed.
    ") or die "Can't write to $image1: $!";
    #close(NEWFILE1);
    }
    }[/code]
Sign In or Register to comment.