i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
code...........
import sys, re
from sgmllib import SGMLParser, SGMLParseError
def handle_data(self, x):
try:
x=unicode(x)
print "
*****unicode***
",x
except UnicodeError:
print "unicode error"
return
if not self.ignore:
self.s += x
return
do reply
Comments
:
: code...........
: [code]
: import sys, re
: from sgmllib import SGMLParser, SGMLParseError
:
: def handle_data(self, x):
: try:
: x=unicode(x)
: print "
*****unicode***
",x
: except UnicodeError:
: print "unicode error"
: return
: if not self.ignore:
: self.s += x
: return
[/code]
Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
[code]
except UnicodeError, ex:
print ex
return
[/code]
See if that provides any more useful information than just "unicode error".
[size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
[code]
$ select * from users where clue > 0
no rows returned
[/code]
: :
: : code...........
: : [code]
: : import sys, re
: : from sgmllib import SGMLParser, SGMLParseError
: :
: : def handle_data(self, x):
: : try:
: : x=unicode(x)
: : print "
*****unicode***
",x
: : except UnicodeError:
: : print "unicode error"
: : return
: : if not self.ignore:
: : self.s += x
: : return
: [/code]
:
: Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
:
: [code]
: except UnicodeError, ex:
: print ex
: return
: [/code]
:
: See if that provides any more useful information than just "unicode error".
:
:
: [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
:
: [code]
: $ select * from users where clue > 0
: no rows returned
: [/code]
:
:
first of all thanks for responding
have change the code as u said
code..........
except UnicodeError, ex:
print ex
return
its giving msg..............
'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
code......
def html2txt( contents_of_file) :
p = HTMLConverter()
try:
p.feed(contents_of_file)
except UnicodeError:
print >>sys.stderr, "warning: skipped:"
except SGMLParseError :
print >>sys.stderr, "SGMLparse error"
textcontent = p.return_output()
title = p.get_title()
p.close()
plz do reply or ask if any further query
: : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
: : :
: : : code...........
: : : [code]
: : : import sys, re
: : : from sgmllib import SGMLParser, SGMLParseError
: : :
: : : def handle_data(self, x):
: : : try:
: : : x=unicode(x)
: : : print "
*****unicode***
",x
: : : except UnicodeError:
: : : print "unicode error"
: : : return
:
: : : if not self.ignore:
: : : self.s += x
: : : return
: : [/code]
: :
: : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
: :
: : [code]
: : except UnicodeError, ex:
: : print ex
: : return
: : [/code]
: :
: : See if that provides any more useful information than just "unicode error".
: :
: :
: : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
: :
: : [code]
: : $ select * from users where clue > 0
: : no rows returned
: : [/code]
: :
: :
: first of all thanks for responding
:
: have change the code as u said
: code..........
: except UnicodeError, ex:
: print ex
: return
: its giving msg..............
: 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
:
: m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
:
: code......
:
: def html2txt( contents_of_file) :
: p = HTMLConverter()
:
: try:
: p.feed(contents_of_file)
: except UnicodeError:
: print >>sys.stderr, "warning: skipped:"
: except SGMLParseError :
: print >>sys.stderr, "SGMLparse error"
: textcontent = p.return_output()
: title = p.get_title()
: p.close()
:
:
:
: plz do reply or ask if any further query
:
[red]
'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").
but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that
{2}rIng
[/red]
: : : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
: : : :
: : : : code...........
: : : : [code]
: : : : import sys, re
: : : : from sgmllib import SGMLParser, SGMLParseError
: : : :
: : : : def handle_data(self, x):
: : : : try:
: : : : x=unicode(x)
: : : : print "
*****unicode***
",x
: : : : except UnicodeError:
: : : : print "unicode error"
: : : : return
: :
: : : : if not self.ignore:
: : : : self.s += x
: : : : return
: : : [/code]
: : :
: : : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
: : :
: : : [code]
: : : except UnicodeError, ex:
: : : print ex
: : : return
: : : [/code]
: : :
: : : See if that provides any more useful information than just "unicode error".
: : :
: : :
: : : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
: : :
: : : [code]
: : : $ select * from users where clue > 0
: : : no rows returned
: : : [/code]
: : :
: : :
: : first of all thanks for responding
: :
: : have change the code as u said
: : code..........
: : except UnicodeError, ex:
: : print ex
: : return
: : its giving msg..............
: : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
: :
: : m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
: :
: : code......
: :
: : def html2txt( contents_of_file) :
: : p = HTMLConverter()
: :
: : try:
: : p.feed(contents_of_file)
: : except UnicodeError:
: : print >>sys.stderr, "warning: skipped:"
: : except SGMLParseError :
: : print >>sys.stderr, "SGMLparse error"
: : textcontent = p.return_output()
: : title = p.get_title()
: : p.close()
: :
: :
: :
: : plz do reply or ask if any further query
: :
: [red]
: 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
:
: Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").
:
: but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that
:
: {2}rIng
: [/red]
:
:
i know its not able to decode unicode.......what is codec?...and what is entitydefs below..........?
can anybody help how do i overcome this unicode problem........plz...
entitydefs = {
'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
}
: : : : : i get unicode error while extracting running text from a Hindi(Devnagri)html page.can any body help me here....
: : : : :
: : : : : code...........
: : : : : [code]
: : : : : import sys, re
: : : : : from sgmllib import SGMLParser, SGMLParseError
: : : : :
: : : : : def handle_data(self, x):
: : : : : try:
: : : : : x=unicode(x)
: : : : : print "
*****unicode***
",x
: : : : : except UnicodeError:
: : : : : print "unicode error"
: : : : : return
: : :
: : : : : if not self.ignore:
: : : : : self.s += x
: : : : : return
: : : : [/code]
: : : :
: : : : Without knowing what page it is you're trying to extract, or what data is being passed to the handle_data() function, I can't help very much. Try changing the except block to this:
: : : :
: : : : [code]
: : : : except UnicodeError, ex:
: : : : print ex
: : : : return
: : : : [/code]
: : : :
: : : : See if that provides any more useful information than just "unicode error".
: : : :
: : : :
: : : : [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
: : : :
: : : : [code]
: : : : $ select * from users where clue > 0
: : : : no rows returned
: : : : [/code]
: : : :
: : : :
: : : first of all thanks for responding
: : :
: : : have change the code as u said
: : : code..........
: : : except UnicodeError, ex:
: : : print ex
: : : return
: : : its giving msg..............
: : : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
: : :
: : : m trying to extract hindi fonts from a html page (url). and textconent (i,e output of html2txt)is input to handle_data().
: : :
: : : code......
: : :
: : : def html2txt( contents_of_file) :
: : : p = HTMLConverter()
: : :
: : : try:
: : : p.feed(contents_of_file)
: : : except UnicodeError:
: : : print >>sys.stderr, "warning: skipped:"
: : : except SGMLParseError :
: : : print >>sys.stderr, "SGMLparse error"
: : : textcontent = p.return_output()
: : : title = p.get_title()
: : : p.close()
: : :
: : :
: : :
: : : plz do reply or ask if any further query
: : :
: : [red]
: : 'ascii' codec can't decode byte 0x96 in position 35: ordinal not in range(128)
: :
: : Its self-explanatory, you're using the ascii codec, not the unicode codec, and the ascii codec is complaining since your trying to read it unicode characters (hence the "ordinal not in range(128)").
: :
: : but how to use the unicode codec, well, im no python buff, I need to read up on the library, let infidel answer that
: :
: : {2}rIng
: : [/red]
: :
: :
:
: i know its not able to decode unicode.......what is codec?...and what is entitydefs below..........?
: can anybody help how do i overcome this unicode problem........plz...
:
: entitydefs = {
: 'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
: 'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
: }
:
:
A codec is an acronym that means coder-decoder or compressor-decompressor, meaning of course in the former place it can code/decode and in the latter, it compresses/decompresses. ts probably the former in this case, but I cant really tell you whats going on since in the lang I use mostly (C/C++) you can't really "code/decode" characters, sinec they are just numbers anyway(just matter how you use it).
And the enititydefs, it looks like html entities, right? only that some are missing...or are they?
You must know html entities (since ur doing html processing), like " ","'"...
Anyhow, how to get te unicode codec, the only thing I know is theres a unicode package or something in the library, you can use that, but for specifics, wait for infidel.
I've changed the title, so it'll get his attention.
{2}rIng
LOL. I've been in training this week, so I haven't been able to respond. If I get a chance I'll try, but otherwise it might have to wait until next week before I can look at it. It would really help to have the URL of the page that is trying to be processed, so I can see what characters are causing the problem.
[size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
[code]
$ select * from users where clue > 0
no rows returned
[/code]
:
: LOL. I've been in training this week, so I haven't been able to respond. If I get a chance I'll try, but otherwise it might have to wait until next week before I can look at it. It would really help to have the URL of the page that is trying to be processed, so I can see what characters are causing the problem.
:
:
: [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
:
: [code]
: $ select * from users where clue > 0
: no rows returned
: [/code]
:
these are some urls...............
http://www.hindinest.com/
http://www.hindimilap.com/
http://www.radioaustralia.net.au/australia/hd/
code..........
import sys, re
from sgmllib import SGMLParser, SGMLParseError
debug = 1;
class HTMLConverter(SGMLParser):
RE1 = re.compile(u"([u0900-u0dff])[
]+([u0900-u0dff])")
RE2 = re.compile(r"[
s]+")
entitydefs = {
'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
}
def __init__(self):
SGMLParser.__init__(self)
self.ignore = 0
self.pre = 0
self.is_title = 0
self.title = u""
self.title_not_processed = 1
self.s = u""
self.output = []
return
def close(self):
try:
SGMLParser.close(self)
except SGMLParseError:
print "SGMLParseError occurred"
self.newline()
return
def convstr(self, s):
# remove all newlines between two zenkaku characters.
s = HTMLConverter.RE1.sub(r"12", s)
# print "
****after removing newline b/w 2 zenkaku character****
",s
# replace all contiguous blanks into a single space.
s = HTMLConverter.RE2.sub(r" ", s)
return s.strip()
def handle_data(self, x):
try:
x=unicode(x)
# print "
*****unicode***
",x
except UnicodeError:
return
if not self.ignore:
self.s += x
# print "
$$$$$$$$$$$$$$$$$$$handle data$$$$$$$$$$$$$$$
",self.s
return
def newline(self, attrs=[]):
if self.title_not_processed:
self.process_title(self.convstr(self.title))
self.title_not_processed = 0
if self.s:
if self.pre:
self.process_text(self.s)
else:
s = self.convstr(self.s)
if s: self.process_text(s)
self.s = u""
return
def begin_ignore(self, attrs):
self.ignore += 1
return
def end_ignore(self):
self.ignore -= 1
return
# uncomment if you want to extract formatted texts as they are.
def start_pre(self, attrs):
self.pre = 1
self.newline()
return
def end_pre(self):
self.newline()
self.pre = 0
return
def start_title(self, attrs):
self.is_title = 1
return
def end_title(self):
self.is_title = 0
self.title = self.s
self.s = ""
return
def get_title(self):
doc_title = self.title
return doc_title.strip()
start_body = newline
start_p = end_p = newline
do_br = newline
do_hr = newline
start_th = end_th = newline
start_td = end_td = newline
start_li = end_li = newline
start_dt = end_dt = newline
start_dd = end_dd = newline
start_h1 = end_h1 = newline
start_h2 = end_h2 = newline
start_h3 = end_h3 = newline
start_h4 = end_h4 = newline
start_h5 = end_h5 = newline
start_h6 = end_h6 = newline
start_pre = end_pre = newline
start_div = end_div = newline
def process_title(self, t):
self.output.append(t)
return
def process_text(self, s):
self.output.append(s)
# print "
***processing text***
",s,"
sself&&&&&&&
",self
return
def return_output(self) :
# print "
***return output***
",self.output
return "
".join(self.output)
def html2txt( contents_of_file) :
""" This function takes the contents in html format and returns the text equivalent."""
#global output
#output = []
textcontent = None
if debug :
print "############################",contents_of_file
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
contents_of_file = re.sub("<[a|A] ","__LNK_STRT__<a ",contents_of_file);
contents_of_file = re.sub("</[a|A]>","__LNK_END__",contents_of_file);
print "
after insertin LNK_STR and befor HTMLConverter
",contents_of_file
p = HTMLConverter()
try:
p.feed(contents_of_file)
except UnicodeError:
print >>sys.stderr, "warning: skipped:"
except SGMLParseError :
print >>sys.stderr, "SGMLparse error"
textcontent = p.return_output()
title = p.get_title()
p.close()
if debug :
@",textcontent;
print "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^"
textcontent = RemoveUnnecessaryLinks(textcontent);
return (textcontent,title)
code..................
[size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
[code]
$ select * from users where clue > 0
no rows returned
[/code]
Ok, I think I got it working. The code that follows has a few minor changes from yours. First, I took out all of the exception handlers so I could see exactly where things were breaking. I also re-indented it because two-space indents annoy me :-)
I couldn't figure out how to change the codec of the SGML parser, so I used the unicode() function on the HTML before feeding it to the class (see function main() below). I also used the Latin-1 encoding (ISO-8859-1) because using UTF-8 gave me errors.
[code]
##Sample URLs:
## http://www.hindinest.com/
## http://www.hindimilap.com/
## http://www.radioaustralia.net.au/australia/hd/
import sys, re
from sgmllib import SGMLParser, SGMLParseError
debug = 0;
class HTMLConverter(SGMLParser):
RE1 = re.compile(u"([u0900-u0dff])[
]+([u0900-u0dff])")
RE2 = re.compile(r"[
s]+")
entitydefs = {
'nbsp':u' ', 'thinsp':u' ', 'emsp':u' ', 'ensp':u' ',
'amp':u'&', 'lt':u'<', 'gt':u'>', 'quot':u'"', 'apos':u"'"
}
def __init__(self):
SGMLParser.__init__(self)
self.ignore = 0
self.pre = 0
self.is_title = 0
self.title = u""
self.title_not_processed = 1
self.s = u""
self.output = []
return
def close(self):
try:
SGMLParser.close(self)
except SGMLParseError, ex:
print ex
raise
self.newline()
return
def convstr(self, s):
# remove all newlines between two zenkaku characters.
s = HTMLConverter.RE1.sub(r"12", s)
#print "
****after removing newline b/w 2 zenkaku character****
",s
# replace all contiguous blanks into a single space.
s = HTMLConverter.RE2.sub(r" ", s)
return s.strip()
def handle_data(self, x):
if not self.ignore:
self.s += x
return
def newline(self, attrs=[]):
if self.title_not_processed:
self.process_title(self.convstr(self.title))
self.title_not_processed = 0
if self.s:
if self.pre:
self.process_text(self.s)
else:
s = self.convstr(self.s)
if s: self.process_text(s)
self.s = u""
return
def begin_ignore(self, attrs):
self.ignore += 1
return
def end_ignore(self):
self.ignore -= 1
return
# uncomment if you want to extract formatted texts as they are.
def start_pre(self, attrs):
self.pre = 1
self.newline()
return
def end_pre(self):
self.newline()
self.pre = 0
return
def start_title(self, attrs):
self.is_title = 1
return
def end_title(self):
self.is_title = 0
self.title = self.s
self.s = ""
return
def get_title(self):
doc_title = self.title
return doc_title.strip()
start_body = newline
start_p = end_p = newline
do_br = newline
do_hr = newline
start_th = end_th = newline
start_td = end_td = newline
start_li = end_li = newline
start_dt = end_dt = newline
start_dd = end_dd = newline
start_h1 = end_h1 = newline
start_h2 = end_h2 = newline
start_h3 = end_h3 = newline
start_h4 = end_h4 = newline
start_h5 = end_h5 = newline
start_h6 = end_h6 = newline
start_pre = end_pre = newline
start_div = end_div = newline
def process_title(self, t):
self.output.append(t)
return
def process_text(self, s):
self.output.append(s)
# print "
***processing text***
",s,"
sself&&&&&&&
",self
return
def return_output(self) :
# print "
***return output***
",self.output
return "
".join(self.output)
def html2txt( contents_of_file) :
""" This function takes the contents in html format and returns the text equivalent."""
#global output
#output = []
textcontent = None
if debug:
print "############################", contents_of_file
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
contents_of_file = re.sub("<[a|A] ","__LNK_STRT__<a ",contents_of_file);
contents_of_file = re.sub("</[a|A]>","__LNK_END__",contents_of_file);
if debug:
print "
after inserting LNK_STR and before HTMLConverter
",contents_of_file
p = HTMLConverter()
p.feed(contents_of_file)
textcontent = p.return_output()
title = p.get_title()
p.close()
if debug:
@", textcontent;
print "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^"
#textcontent = RemoveUnnecessaryLinks(textcontent);
return (textcontent,title)
def main(url):
import urllib
html = urllib.urlopen(url).read()
uhtml = unicode(html, 'iso-8859-1')
txt, title = html2txt(uhtml)
print txt
if __name__ == '__main__':
main("http://www.hindinest.com/")
[/code]
[size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]
[code]
$ select * from users where clue > 0
no rows returned
[/code]