parse HTML to Document - Programmers Heaven

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

parse HTML to Document

Josh CodeJosh Code Posts: 675Member
Parsing XML using the standard API is easy enough but parsing the virtually unstandardized HTML from most websites is a challenge.

Do you know how to parse most websites including ones with coding problems into a org.w3c.dom.Document object?


Comments

  • zibadianzibadian Posts: 6,349Member
    : Parsing XML using the standard API is easy enough but parsing the virtually unstandardized HTML from most websites is a challenge.
    :
    : Do you know how to parse most websites including ones with coding problems into a org.w3c.dom.Document object?
    :
    :
    :
    You need to parse it as if it conforms to the standard HTML. If you get an error, then the code needs to decide what to do based on the severity of the error and on its current state. For example:
    [code]
    Bad HTML Example Still as bad
    Continueing example
    [/code]
    should ignore the -tag (unless the text is italic of course), and continue to treat the "Still as bad" as bold. The next -tag should be treated as being preceded by a -tag. In correct HTML this code should be parsed as:
    [code]
    Bad HTML Example Still as bad
    Continueing example
    [/code]
    Because of this error-handling decision-tree, browsers are quite slow.
Sign In or Register to comment.