Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.

Perl script to check doc_b against doc_a for inconsistence

satimissatimis Posts: 21Member
Hi folks,

I'm going to make a script checking inconsistence on 2 documents, say doc_a and doc_b and have no idea how to start.

doc_b is reproduced from doc_a, (original document) not with 'copy and paste' command.

Making it simple first, as highlighted in following example, an one line document:-

1)
Original document "doc_a"[code]Check this link to sea what scannars are supported by SANE[/code]
Already having 2 typing mistakes
sea
scannars

2)
The reproduced document "doc_b" must maintain these 2 mistakes for consistence.
[code]check thes link to sea what scannars are suppurted by SeNE[/code]
Unfortunately another 3 typing mistakes were further made;
thes
suppurted
SeNE

What I expect to have in the printout is;[code]
Original Mistake Line No. Word No.
this thes 1 2
supported suppurted 1 9
SANE SeNE 1 11[/code]
not just printing out their contents and saying "differ"

Kindly advise how to start. TIA

B.R.
satimis

Comments

  • WeirdofreakWeirdofreak Posts: 439Member
    Well, you may want to check out diff first, if you're doing this because you need a tool rather than because you feel like it. It's more suited to large documents with big differences though, rather than trying to catch individual words.

    First you'd want to split each line into the separate words. If the words are different, print them out with the line/word number. You may want to use tabstops to align them, or [grey]" " x 10 - (length $word)[/grey]. If the array from the original file is longer than the other one, print out [grey]splice @arr1, $#arr2[/grey] with a message saying that that's what's missing, and if @arr2 is longer, do [grey]split @arr2, $#arr1[/grey] and say that it shouldn't be there. It won't be very good if you add a word in the middle of a line ("foo bar baz quux" becoming "foo bar baz blech quux" will tell you that quux shouldn't be there, rather than blech), but it should suffice.

    You'll probably want to not store the file in an array to save memory. Instead, do [grey]while (<$file>) { ... }[/qrey] unless you need to keep it for some reason.
  • satimissatimis Posts: 21Member
    Hi

    Tks for your advice.

    I'm only a newbei on Perl. I'll use following as starting point.

    Script:-
    open (FILE, "doc_a.txt") or die;
    @doc_a = ;
    close FILE;

    open (FILE, "doc_b.txt") or die;
    @doc_b = ;
    close FILE;

    $n_a = @doc_a;
    $n_b = @doc_b;

    if ($n_a != $n_b) {
    print "Error: documents are not the same length";
    exit(0);
    }

    else {
    for my $i (0 .. $#doc_a) {
    my @line_a = split(/ /,$doc_a[$i]);
    my @line_b = split(/ /,$doc_b[$i]);
    &compare(@line_a,@line_b,$i);
    }
    &print_results();

    .....
    .....
    etc.
    - End -

    I have not resolved how to have the mistakes (mistyping words on doc_b, inconsistent to doc_a) printed out in a table with line number and word number as demonstrated in my first posting. Could you please give me some suggestion. Tks.


    : You'll probably want to not store the file in an array to save memory. Instead, do [grey]while (<$file>) { ... }[/qrey] unless you need to keep it for some reason.
    :

    Could you please advise me more detail how to achieve it. TIA

    B.R.
    satimis
  • WeirdofreakWeirdofreak Posts: 439Member
    Untested, but try this.

    [code]print "Original Mistake Line Word
    ";
    my $short = ($#line_a > $#line_b ? $#line_b : $#line_a);
    for my $j (0 .. $short) {
    my ($a, $b, $l, $w) = ($line_a[$j], $line_b[$j], $i + 1, $j + 1); # redundant, but looks better
    print "$a $b $l $w
    " if $a ne $b;
    }
    if ($#line_a > $#line_b) {
    print "File b is missing '", @line_a[@line_b .. $#line_a], "' on line $l
    "; # I'm not sure if you need to explicitly scalarise @line_b or not
    } elsif ($#line_b > $#line_a) {
    print "File b should not have '", @line_b[@line_a .. $#line_b], "' on line $l
    "; # agan, you may need to scalarise @line_a
    }[/code]

    The formatting will probably get messed up with that method, but it looks the nicest in code form :-). You may want to look at formats - this sort of thing is what I think they were made for, although they're slightly archaic. I think you'd want something like
    [code]format STDOUT =
    @<<<<<<<<<<<< @<<<<<<<<<<<< @<<<<<< @<<<<<<

    $a, $b, $l, $w,
    .[/code]
    But I don't know much about them at all, including how to actually use them, so you're on your own there.
Sign In or Register to comment.