Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Searching idea

jtmuzixjtmuzix Member Posts: 15
Hello,
I'm a digital subscriber to Scientific American. I have downloaded every magazine that has been in print from their website. Each magazine is its own .pdf file respectively. What I'd like to do is create a program that is able to search each magazine for keywords in these pdf files and return filenames back. This is probably much more complex than I can imagine, especially considering there is no dbms. Is doing something like this feasible or would it take a lot of cpu time without a dbms and an index, primary key? If something like this is feasible, could you please just give me a rough idea on how to accomplish this task? I wonder if there are applications that are able to do this already if you just give them a directory full of pdf files? Any information on this would be great.
Thanks in advance.
Jason T.

Comments

  • ultimageultimage Member Posts: 119
    [b][red]This message was edited by ultimage at 2005-5-4 10:45:25[/red][/b][hr]
    : Hello,
    : I'm a digital subscriber to Scientific American. I have downloaded every magazine that has been in print from their website. Each magazine is its own .pdf file respectively. What I'd like to do is create a program that is able to search each magazine for keywords in these pdf files and return filenames back. This is probably much more complex than I can imagine, especially considering there is no dbms. Is doing something like this feasible or would it take a lot of cpu time without a dbms and an index, primary key? If something like this is feasible, could you please just give me a rough idea on how to accomplish this task? I wonder if there are applications that are able to do this already if you just give them a directory full of pdf files? Any information on this would be great.
    : Thanks in advance.
    : Jason T.
    :

    first i have a question for you. Are these pdf files text searchable?

    if they are i would recommend this:

    create a db and use vb.net's inherent sql funcionality to parse each file and store in this db EVERY word in each pdf file and in what files each word is contained. Initially this process would probably take a while to complete (searching and storing each word), but the actual keyword searches that take place after this is done wont be using the files themselves, just the db. how you would accomplish this is beyond me at the moment, but im sure you will need a some way to parse each pdf file one at a time. you may also want to only store words that are capitalized (acronyms), and words longer than 3 characters. what i mean is, dont store words like: a, an, the, them, you, who, what....etc.
    But do store words like ACLU, PDF, TXT, etc.

    if this helps let me know, if not then im sorry i wasted your time.

    Ultimage
    ultimage@insightbb.com



Sign In or Register to comment.