Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.

parse HTML to Document

Josh CodeJosh Code Posts: 675Member
Parsing XML using the standard API is easy enough but parsing the virtually unstandardized HTML from most websites is a challenge.

Do you know how to parse most websites including ones with coding problems into a org.w3c.dom.Document object?


Comments

  • zibadianzibadian Posts: 6,349Member
    : Parsing XML using the standard API is easy enough but parsing the virtually unstandardized HTML from most websites is a challenge.
    :
    : Do you know how to parse most websites including ones with coding problems into a org.w3c.dom.Document object?
    :
    :
    :
    You need to parse it as if it conforms to the standard HTML. If you get an error, then the code needs to decide what to do based on the severity of the error and on its current state. For example:
    [code]
    Bad HTML Example Still as bad
    Continueing example
    [/code]
    should ignore the -tag (unless the text is italic of course), and continue to treat the "Still as bad" as bold. The next -tag should be treated as being preceded by a -tag. In correct HTML this code should be parsed as:
    [code]
    Bad HTML Example Still as bad
    Continueing example
    [/code]
    Because of this error-handling decision-tree, browsers are quite slow.
Sign In or Register to comment.