How to read SGML files using Java

I'm doing my project on Text Categorization.I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd"> The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.

I need help to write a java program to read those 21578 documents or transform them into 21578 seperated text files.



  • how far are you with the implementation? as far as i know sgml is a readable format. so you can use the java IO classes to read and write the documents.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


In this Discussion