i get unicode error while using sgml parser

i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....

code...........

import sys, re
from sgmllib import SGMLParser, SGMLParseError

def handle_data(self, x):
try:
x=unicode(x)
print "
*****unicode***
",x
except UnicodeError:
print "unicode error"
return
if not self.ignore:
self.s += x
return


do reply

Comments

  • : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
    :
    : code...........
    : [code]
    : import sys, re
    : from sgmllib import SGMLParser, SGMLParseError
    :
    : def handle_data(self, x):
    : try:
    : x=unicode(x)
    : print "
    *****unicode***
    ",x
    : except UnicodeError:
    : print "unicode error"
    : return
    : if not self.ignore:
    : self.s += x
    : return
    [/code]

    Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:

    [code]
    except UnicodeError, ex:
    print ex
    return
    [/code]

    See if that provides any more useful information than just "unicode error".


    [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]

    [code]
    $ select * from users where clue > 0
    no rows returned
    [/code]

  • : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
    : :
    : : code...........
    : : [code]
    : : import sys, re
    : : from sgmllib import SGMLParser, SGMLParseError
    : :
    : : def handle_data(self, x):
    : : try:
    : : x=unicode(x)
    : : print "
    *****unicode***
    ",x
    : : except UnicodeError:
    : : print "unicode error"
    : : return

    : : if not self.ignore:
    : : self.s += x
    : : return
    : [/code]
    :
    : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
    :
    : [code]
    : except UnicodeError, ex:
    : print ex
    : return
    : [/code]
    :
    : See if that provides any more useful information than just "unicode error".
    :
    :
    : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
    :
    : [code]
    : $ select * from users where clue > 0
    : no rows returned
    : [/code]
    :
    :
    first of all thanks for responding

    have change the code as u said
    code..........
    except UnicodeError, ex:
    print ex
    return
    its giving msg..............
    'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)

    m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().

    code......

    def html2txt( contents_of_file) :
    p = HTMLConverter()

    try:
    p.feed(contents_of_file)
    except UnicodeError:
    print >>sys.stderr, "warning: skipped:"
    except SGMLParseError :
    print >>sys.stderr, "SGMLparse error"
    textcontent = p.return_output()
    title = p.get_title()
    p.close()



    plz do reply or ask if any further query
  • [b][red]This message was edited by Gregry2 at 2006-7-18 21:22:18[/red][/b][hr]
    : : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
    : : :
    : : : code...........
    : : : [code]
    : : : import sys, re
    : : : from sgmllib import SGMLParser, SGMLParseError
    : : :
    : : : def handle_data(self, x):
    : : : try:
    : : : x=unicode(x)
    : : : print "
    *****unicode***
    ",x
    : : : except UnicodeError:
    : : : print "unicode error"
    : : : return
    :
    : : : if not self.ignore:
    : : : self.s += x
    : : : return
    : : [/code]
    : :
    : : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
    : :
    : : [code]
    : : except UnicodeError, ex:
    : : print ex
    : : return
    : : [/code]
    : :
    : : See if that provides any more useful information than just "unicode error".
    : :
    : :
    : : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
    : :
    : : [code]
    : : $ select * from users where clue > 0
    : : no rows returned
    : : [/code]
    : :
    : :
    : first of all thanks for responding
    :
    : have change the code as u said
    : code..........
    : except UnicodeError, ex:
    : print ex
    : return
    : its giving msg..............
    : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
    :
    : m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
    :
    : code......
    :
    : def html2txt( contents_of_file) :
    : p = HTMLConverter()
    :
    : try:
    : p.feed(contents_of_file)
    : except UnicodeError:
    : print >>sys.stderr, "warning: skipped:"
    : except SGMLParseError :
    : print >>sys.stderr, "SGMLparse error"
    : textcontent = p.return_output()
    : title = p.get_title()
    : p.close()
    :
    :
    :
    : plz do reply or ask if any further query
    :
    [red]
    'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)

    Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").

    but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that :)

    {2}rIng
    [/red]

  • : [b][red]This message was edited by Gregry2 at 2006-7-18 21:22:18[/red][/b][hr]
    : : : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
    : : : :
    : : : : code...........
    : : : : [code]
    : : : : import sys, re
    : : : : from sgmllib import SGMLParser, SGMLParseError
    : : : :
    : : : : def handle_data(self, x):
    : : : : try:
    : : : : x=unicode(x)
    : : : : print "
    *****unicode***
    ",x
    : : : : except UnicodeError:
    : : : : print "unicode error"
    : : : : return
    : :
    : : : : if not self.ignore:
    : : : : self.s += x
    : : : : return
    : : : [/code]
    : : :
    : : : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
    : : :
    : : : [code]
    : : : except UnicodeError, ex:
    : : : print ex
    : : : return
    : : : [/code]
    : : :
    : : : See if that provides any more useful information than just "unicode error".
    : : :
    : : :
    : : : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
    : : :
    : : : [code]
    : : : $ select * from users where clue > 0
    : : : no rows returned
    : : : [/code]
    : : :
    : : :
    : : first of all thanks for responding
    : :
    : : have change the code as u said
    : : code..........
    : : except UnicodeError, ex:
    : : print ex
    : : return
    : : its giving msg..............
    : : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
    : :
    : : m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
    : :
    : : code......
    : :
    : : def html2txt( contents_of_file) :
    : : p = HTMLConverter()
    : :
    : : try:
    : : p.feed(contents_of_file)
    : : except UnicodeError:
    : : print >>sys.stderr, "warning: skipped:"
    : : except SGMLParseError :
    : : print >>sys.stderr, "SGMLparse error"
    : : textcontent = p.return_output()
    : : title = p.get_title()
    : : p.close()
    : :
    : :
    : :
    : : plz do reply or ask if any further query
    : :
    : [red]
    : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
    :
    : Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").
    :
    : but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that :)
    :
    : {2}rIng
    : [/red]
    :
    :

    i know its not able to decode unicode.......what is codec?...and what is entitydefs below..........?
    can anybody help how do i overcome this unicode problem........plz...

    entitydefs = {
    'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
    'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
    }

  • : : [b][red]This message was edited by Gregry2 at 2006-7-18 21:22:18[/red][/b][hr]
    : : : : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
    : : : : :
    : : : : : code...........
    : : : : : [code]
    : : : : : import sys, re
    : : : : : from sgmllib import SGMLParser, SGMLParseError
    : : : : :
    : : : : : def handle_data(self, x):
    : : : : : try:
    : : : : : x=unicode(x)
    : : : : : print "
    *****unicode***
    ",x
    : : : : : except UnicodeError:
    : : : : : print "unicode error"
    : : : : : return
    : : :
    : : : : : if not self.ignore:
    : : : : : self.s += x
    : : : : : return
    : : : : [/code]
    : : : :
    : : : : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
    : : : :
    : : : : [code]
    : : : : except UnicodeError, ex:
    : : : : print ex
    : : : : return
    : : : : [/code]
    : : : :
    : : : : See if that provides any more useful information than just "unicode error".
    : : : :
    : : : :
    : : : : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
    : : : :
    : : : : [code]
    : : : : $ select * from users where clue > 0
    : : : : no rows returned
    : : : : [/code]
    : : : :
    : : : :
    : : : first of all thanks for responding
    : : :
    : : : have change the code as u said
    : : : code..........
    : : : except UnicodeError, ex:
    : : : print ex
    : : : return
    : : : its giving msg..............
    : : : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
    : : :
    : : : m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
    : : :
    : : : code......
    : : :
    : : : def html2txt( contents_of_file) :
    : : : p = HTMLConverter()
    : : :
    : : : try:
    : : : p.feed(contents_of_file)
    : : : except UnicodeError:
    : : : print >>sys.stderr, "warning: skipped:"
    : : : except SGMLParseError :
    : : : print >>sys.stderr, "SGMLparse error"
    : : : textcontent = p.return_output()
    : : : title = p.get_title()
    : : : p.close()
    : : :
    : : :
    : : :
    : : : plz do reply or ask if any further query
    : : :
    : : [red]
    : : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
    : :
    : : Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").
    : :
    : : but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that :)
    : :
    : : {2}rIng
    : : [/red]
    : :
    : :
    :
    : i know its not able to decode unicode.......what is codec?...and what is entitydefs below..........?
    : can anybody help how do i overcome this unicode problem........plz...
    :
    : entitydefs = {
    : 'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
    : 'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
    : }
    :
    :

    A codec is an acronym that means coder-decoder or compressor-decompressor, meaning of course in the former place it can code/decode and in the latter, it compresses/decompresses. ts probably the former in this case, but I cant really tell you whats going on since in the lang I use mostly (C/C++) you can't really "code/decode" characters, sinec they are just numbers anyway(just matter how you use it).

    And the enititydefs, it looks like html entities, right? only that some are missing...or are they?

    You must know html entities (since ur doing html processing), like " ","'"...

    Anyhow, how to get te unicode codec, the only thing I know is theres a unicode package or something in the library, you can use that, but for specifics, wait for infidel.

    I've changed the title, so it'll get his attention.
    {2}rIng
  • : I've changed the title, so it'll get his attention.

    LOL. I've been in training this week, so I haven't been able to respond. If I get a chance I'll try, but otherwise it might have to wait until next week before I can look at it. It would really help to have the URL of the page that is trying to be processed, so I can see what characters are causing the problem.


    [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]

    [code]
    $ select * from users where clue > 0
    no rows returned
    [/code]

  • : : I've changed the title, so it'll get his attention.
    :
    : LOL. I've been in training this week, so I haven't been able to respond. If I get a chance I'll try, but otherwise it might have to wait until next week before I can look at it. It would really help to have the URL of the page that is trying to be processed, so I can see what characters are causing the problem.
    :
    :
    : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
    :
    : [code]
    : $ select * from users where clue > 0
    : no rows returned
    : [/code]
    :

    these are some urls...............
    http://www.hindinest.com/
    http://www.hindimilap.com/
    http://www.radioaustralia.net.au/australia/hd/
    code..........

    import sys, re
    from sgmllib import SGMLParser, SGMLParseError

    debug = 1;

    class HTMLConverter(SGMLParser):

    RE1 = re.compile(u"([u0900-u0dff])[

    ]+([u0900-u0dff])")
    RE2 = re.compile(r"[

    s]+")

    entitydefs = {
    'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
    'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
    }

    def __init__(self):
    SGMLParser.__init__(self)
    self.ignore = 0
    self.pre = 0
    self.is_title = 0
    self.title = u""
    self.title_not_processed = 1
    self.s = u""
    self.output = []
    return

    def close(self):
    try:
    SGMLParser.close(self)
    except SGMLParseError:
    print "SGMLParseError occurred"
    self.newline()
    return

    def convstr(self, s):
    # remove all newlines between two zenkaku characters.
    s = HTMLConverter.RE1.sub(r"12", s)
    # print "
    ****after removing newline b/w 2 zenkaku character****
    ",s
    # replace all contiguous blanks into a single space.
    s = HTMLConverter.RE2.sub(r" ", s)
    return s.strip()

    def handle_data(self, x):
    try:
    x=unicode(x)
    # print "
    *****unicode***
    ",x
    except UnicodeError:

    return
    if not self.ignore:
    self.s += x
    # print "
    $$$$$$$$$$$$$$$$$$$handle data$$$$$$$$$$$$$$$
    ",self.s
    return

    def newline(self, attrs=[]):
    if self.title_not_processed:
    self.process_title(self.convstr(self.title))
    self.title_not_processed = 0
    if self.s:
    if self.pre:
    self.process_text(self.s)
    else:
    s = self.convstr(self.s)
    if s: self.process_text(s)
    self.s = u""
    return

    def begin_ignore(self, attrs):
    self.ignore += 1
    return

    def end_ignore(self):
    self.ignore -= 1
    return

    # uncomment if you want to extract formatted texts as they are.
    def start_pre(self, attrs):
    self.pre = 1
    self.newline()
    return

    def end_pre(self):
    self.newline()
    self.pre = 0
    return

    def start_title(self, attrs):
    self.is_title = 1
    return

    def end_title(self):
    self.is_title = 0
    self.title = self.s
    self.s = ""
    return
    def get_title(self):
    doc_title = self.title
    return doc_title.strip()


    start_body = newline
    start_p = end_p = newline
    do_br = newline
    do_hr = newline
    start_th = end_th = newline
    start_td = end_td = newline
    start_li = end_li = newline
    start_dt = end_dt = newline
    start_dd = end_dd = newline
    start_h1 = end_h1 = newline
    start_h2 = end_h2 = newline
    start_h3 = end_h3 = newline
    start_h4 = end_h4 = newline
    start_h5 = end_h5 = newline
    start_h6 = end_h6 = newline
    start_pre = end_pre = newline
    start_div = end_div = newline
    def process_title(self, t):
    self.output.append(t)
    return

    def process_text(self, s):
    self.output.append(s)
    # print "
    ***processing text***
    ",s,"
    sself&&&&&&&
    ",self
    return

    def return_output(self) :
    # print "
    ***return output***
    ",self.output
    return "
    ".join(self.output)
    def html2txt( contents_of_file) :

    """ This function takes the contents in html format and returns the text equivalent."""

    #global output
    #output = []
    textcontent = None

    if debug :
    print "############################",contents_of_file
    print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

    contents_of_file = re.sub("<[a|A] ","__LNK_STRT__<a ",contents_of_file);
    contents_of_file = re.sub("</[a|A]>","__LNK_END__",contents_of_file);
    print "
    after insertin LNK_STR and befor HTMLConverter
    ",contents_of_file


    p = HTMLConverter()

    try:
    p.feed(contents_of_file)
    except UnicodeError:

    print >>sys.stderr, "warning: skipped:"
    except SGMLParseError :
    print >>sys.stderr, "SGMLparse error"
    textcontent = p.return_output()
    title = p.get_title()
    p.close()

    if debug :
    @",textcontent;
    print "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^"

    textcontent = RemoveUnnecessaryLinks(textcontent);
    return (textcontent,title)


    code..................
  • Thanks. That should help. I don't have Python available today, I'm still in a training class. I'll check it out on Monday.


    [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]

    [code]
    $ select * from users where clue > 0
    no rows returned
    [/code]

  • : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....

    Ok, I think I got it working. The code that follows has a few minor changes from yours. First, I took out all of the exception handlers so I could see exactly where things were breaking. I also re-indented it because two-space indents annoy me :-)

    I couldn't figure out how to change the codec of the SGML parser, so I used the unicode() function on the HTML before feeding it to the class (see function main() below). I also used the Latin-1 encoding (ISO-8859-1) because using UTF-8 gave me errors.

    [code]
    ##Sample URLs:
    ## http://www.hindinest.com/
    ## http://www.hindimilap.com/
    ## http://www.radioaustralia.net.au/australia/hd/

    import sys, re
    from sgmllib import SGMLParser, SGMLParseError

    debug = 0;

    class HTMLConverter(SGMLParser):

    RE1 = re.compile(u"([u0900-u0dff])[

    ]+([u0900-u0dff])")
    RE2 = re.compile(r"[

    s]+")

    entitydefs = {
    'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
    'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
    }

    def __init__(self):
    SGMLParser.__init__(self)
    self.ignore = 0
    self.pre = 0
    self.is_title = 0
    self.title = u""
    self.title_not_processed = 1
    self.s = u""
    self.output = []
    return

    def close(self):
    try:
    SGMLParser.close(self)
    except SGMLParseError, ex:
    print ex
    raise
    self.newline()
    return

    def convstr(self, s):
    # remove all newlines between two zenkaku characters.
    s = HTMLConverter.RE1.sub(r"12", s)
    #print "
    ****after removing newline b/w 2 zenkaku character****
    ",s
    # replace all contiguous blanks into a single space.
    s = HTMLConverter.RE2.sub(r" ", s)
    return s.strip()

    def handle_data(self, x):
    if not self.ignore:
    self.s += x
    return

    def newline(self, attrs=[]):
    if self.title_not_processed:
    self.process_title(self.convstr(self.title))
    self.title_not_processed = 0
    if self.s:
    if self.pre:
    self.process_text(self.s)
    else:
    s = self.convstr(self.s)
    if s: self.process_text(s)
    self.s = u""
    return

    def begin_ignore(self, attrs):
    self.ignore += 1
    return

    def end_ignore(self):
    self.ignore -= 1
    return

    # uncomment if you want to extract formatted texts as they are.
    def start_pre(self, attrs):
    self.pre = 1
    self.newline()
    return

    def end_pre(self):
    self.newline()
    self.pre = 0
    return

    def start_title(self, attrs):
    self.is_title = 1
    return

    def end_title(self):
    self.is_title = 0
    self.title = self.s
    self.s = ""
    return

    def get_title(self):
    doc_title = self.title
    return doc_title.strip()

    start_body = newline
    start_p = end_p = newline
    do_br = newline
    do_hr = newline
    start_th = end_th = newline
    start_td = end_td = newline
    start_li = end_li = newline
    start_dt = end_dt = newline
    start_dd = end_dd = newline
    start_h1 = end_h1 = newline
    start_h2 = end_h2 = newline
    start_h3 = end_h3 = newline
    start_h4 = end_h4 = newline
    start_h5 = end_h5 = newline
    start_h6 = end_h6 = newline
    start_pre = end_pre = newline
    start_div = end_div = newline

    def process_title(self, t):
    self.output.append(t)
    return

    def process_text(self, s):
    self.output.append(s)
    # print "
    ***processing text***
    ",s,"
    sself&&&&&&&
    ",self
    return

    def return_output(self) :
    # print "
    ***return output***
    ",self.output
    return "
    ".join(self.output)


    def html2txt( contents_of_file) :

    """ This function takes the contents in html format and returns the text equivalent."""

    #global output
    #output = []
    textcontent = None

    if debug:
    print "############################", contents_of_file
    print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

    contents_of_file = re.sub("<[a|A] ","__LNK_STRT__<a ",contents_of_file);
    contents_of_file = re.sub("</[a|A]>","__LNK_END__",contents_of_file);

    if debug:
    print "
    after inserting LNK_STR and before HTMLConverter
    ",contents_of_file

    p = HTMLConverter()

    p.feed(contents_of_file)

    textcontent = p.return_output()
    title = p.get_title()
    p.close()

    if debug:
    @", textcontent;
    print "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^"

    #textcontent = RemoveUnnecessaryLinks(textcontent);
    return (textcontent,title)


    def main(url):
    import urllib
    html = urllib.urlopen(url).read()
    uhtml = unicode(html, 'iso-8859-1')
    txt, title = html2txt(uhtml)
    print txt


    if __name__ == '__main__':
    main("http://www.hindinest.com/")
    [/code]


    [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]

    [code]
    $ select * from users where clue > 0
    no rows returned
    [/code]

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories