HELP: parsing unicode web sites

I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:


If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you


  • Hi,

    A unicode advice page:

    Notes that recent versions of LWP are unicode aware - are you running a fairly recent version of the module and/or recent version of Perl? It also suggests looking at HTTP::Response::Charset.


    (tr/yuiqwert/her anot/))for($::b);for($::c){$_.=$^X;
    /(p.{2}l)/;$_=$1}$::b=~/(..)$/;print("$::a$::b $::c hack$1.");
  • Thanks to those who helped. Here's my working script:


    use warnings;
    use strict;

    use File::stat;
    use Tie::File;

    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor; # Allows you to extract the links off of an HTML page.
    #use File::Slurp;

    use Encode;

    my $site1 = ""; # Full url like
    my $delim1a = "
    my $folder1 = "movie_2004/html/";
    my $url1;
    my $start1 = 1000;
    my $end1 = 1000;
    my $contents1;
    my $image1;

    my $browser1 = LWP::UserAgent->new();
    my $request1;
    my $response1;

    my $count;
    for ($count=$start1; $count<=$end1; $count++) {
    $url1 = $site1 . $folder1 . $count . ".html";
    printf "Downloading %s
    ", $url1;

    # Method 1
    #$contents1 = get($url1);

    # Method 2
    $request1 = HTTP::Request->new(GET => $url1);
    $response1 = $browser1->request($request1);
    if ($response1->is_error()) {
    printf "%s
    ", $response1->status_line;
    $contents1 = $response1->decoded_content();

    #open(NEWFILE1, "> Debug.txt");
    #(print NEWFILE1 $contents1) or die "Can't write to Debug.txt: $!";

    #print $contents1;

    if ($contents1 =~ /
    image/m) {
    $image1 = "$1";
    printf "Downloading %s
    ", $image1;
    `wget -q -O $count.jpg $image1`;

    #if ($image1 =~ //([^/]*)$/m) {
    # printf "Renaming %s to $count.jpg
    ", $1;
    #} else {
    # printf "Could not rename %s to $count.jpg
    ", $image1;
    } else {
    #open(NEWFILE1, "> $count.txt");
    #(print NEWFILE1 "Download failed.
    ") or die "Can't write to $image1: $!";
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!