Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.

Removing HTML tags

tsholotsholo Posts: 11Member
[b][red]This message was edited by tsholo at 2006-7-27 6:23:49[/red][/b][hr]
Hi,

I am trying to remove HTML tags from index.html

I managed,But my code does not remove comments,links and other things that are irrelevant.
[/grey]

Here is my Code below:

#!/usr/bin/python
import re
import os, sys, glob
from os import system
from urllib import urlopen

[grey]
page = urlopen("http://www.ee.uct.ac.za").read()
myfile = open('testfile.txt', 'w')
myfile.write(page)
myfile = open('testfile.txt', 'r')

#Removing all the HTML tags from the file
myfile = re.sub('<(?!(?:as|/a|!))[^>]*>','',page)
print myfile
[/grey]


And this is What I get:

[grey]










Welcome - Department of Electrical Engineering









Department of Electrical Engineering | University of Cape Town

























Department of Electrical Engineering







Welcome to the Department of Electrical Engineering.





Undergraduate Programmes: The basic undergraduate degree is a four year program. It allows a number of

specializations. These include

Electronic Engineering,

Power Engineering, and

Computer Engineering



There are two additional four year programmes that are combinations of

Electrical and Mechanical Engineering courses.

Electro-mechanical Engineering (under Mechanical Engineering) and

Mechatronics Engineering

(under Electrical Engineering)



Postgraduate Research: The Department has active research programmes in many areas of electrical and electronic engineering.



Geography: The University of Cape Town is situated in Cape Town, a city at

the Southern tip of Africa. Our department is on the upper campus at the foot of Devil's

Peak which is relatively close to Table Bay. The department is

situated in University Avenue on the Groote Schuur

campus and occupies space in the old Engineering Building with all laboratory facilities

located in new building extensions.



History: The University of Cape Town evolved from the south African College founded in 1829.

The faculty of engineering at the South African College was created in 1903.

UCT was given independence by an act on parliament in 1918.

The first engineering graduates with University of Cape Town

degrees were in the same year.









































Thinking of studying Electrical Engineering?













Find out more about what Electrical Engineering is all about.









































Electrical Engineering















People

Vacant Posts

Research

Courses

Contacts

Sponsors

Useful Links

























University of Cape Town















University of Cape Town

Engineering and Built Environment

Monday Paper (news)



























Today's date is

Thu, 27 Jul 2006





 





[/grey]


Can Someone Please help me.

Comments

  • infidelinfidel Posts: 2,900Member
    [b][red]This message was edited by Moderator at 2006-7-27 10:27:28[/red][/b][hr]
    : I am trying to remove HTML tags from index.html
    :
    : I managed,But my code does not remove comments,links and other things that are irrelevant.
    : [/grey]
    :
    : Here is my Code below:
    :
    : #!/usr/bin/python
    : import re
    : import os, sys, glob
    : from os import system
    : from urllib import urlopen
    :
    : [grey]
    : page = urlopen("http://www.ee.uct.ac.za").read()
    : myfile = open('testfile.txt', 'w')
    : myfile.write(page)
    : myfile = open('testfile.txt', 'r')
    :
    : #Removing all the HTML tags from the file
    : myfile = re.sub('<(?!(?:as|/a|!))[^>]*>','',page)
    : print myfile
    : [/grey]

    I'm not good yet with complex regex patterns like you have there, but clearly you either need to add some more complexity to that regex, or you need to create at least one more regex substitution to get the things that this one misses.

    Or you could use an XML parser and just spit out the text parts. If there's a chance that the HTML is not well-formed XML, though, there is a class available called BeautifulSoup that can parse pages that have mistakes in them: http://www.crummy.com/software/BeautifulSoup/download/ (though I am unable to access that page currently).


    [size=5][italic][blue][RED]i[/RED]nfidel[/blue][/italic][/size]

    [code]
    $ select * from users where clue > 0
    no rows returned
    [/code]



  • PatrickMc2008PatrickMc2008 Posts: 11Member
    I use the sal command in biterScripting as follows. Let's assume the name of the file is xyz.html and I want output into xyz.txt.

    # Read file from .html
    var str content
    cat xyz.html > $content

    # Remove all <>.
    while ( { sen -r "^<&>^" $content } > 0 )
    sal -r "^<&>^" "" $content

    # Write updated content to .txt
    repro $content > xyz.txt

    Works for me !

    Patrick

Sign In or Register to comment.