Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Fuzzy Matching in Access

OpnSeasonOpnSeason Member Posts: 12
[b][red]This message was edited by OpnSeason at 2006-5-11 2:37:41[/red][/b][hr]
[b][red]This message was edited by OpnSeason at 2006-3-17 17:22:45[/red][/b][hr]
Hello All,

For those who are interested in Approximate String Matching or those who could use these algorithms; I have a complete suite of Approximate String Matching algorithms written in Visual Basic in an Access database.

In 2004 I decided to jump into the world of Fuzzy Matching with both feet.

As it is, I am working for a company that deals with names, addresses, etc. very intensely. It is a fair sized company that
uses Access on a grand scale. Since I am an Access programmer, I work in an Access gold mine!

I knew that if I could get a good handle on Fuzzy Matching, that when I hit the right person at the right time, the company could greatly benefit from my research on Fuzzy Matching. The right time and the right person are not here yet.

Nevertheless, since I have reaped much free source code and information from the Web, it is now time to return the favor.

I developed a package that is sort of a demo/tutorial on Approximate String Matching algorithms in Access that is very
robust in Fuzzy Matching. It would overtax the post in this forum for me to include it in a post.

To summarize, it works with the basic name - Last, First, and Middle. It has a user interface that allows a user to type in
what would be a good name and what would be a questionable name to resemble the good name. The weighted results of all the various algorithms can be chosen, or an individual algorithm can be chosen to display how closely the names match.

In addition, it has a table of 17,295 known good names with unique ID numbers as a reference table, and table of 1200
morphed names that are typical of names entered in a database with no input conventions. These morphed names have typos, transpositions, variations on maiden names, etc. 1200 good names were selected for alteration and the unique ID of each original good name was stored in the table with the altered names to determine the accuracy of the matching process.

The morphed names were compared to the known good names in a query with an approximate join using the suite of algorithms to determine match percentage. The altered names, the ID number of the original good name, the ID number of the name it matched to, and the match percentage were stored in a results table to determine the results of the matching run.

These tables were used to test and tweak the algorithms by comparing the morphed names with the known good names. The results of 1322 names were saved to a results table with match scores.

The matching process was executed in a query with an approximate join using the suite of algorithms.


The match results:

Total Approximate Matches: 1188
(Recall) Precision Pct: 99.00%

Total Unmatched Names: 12
Unmatched Pct: 1.00%

Total Other Matches: 134
Other Matches Pct: .77%


The tables are accessible in the database, so anyone can run their own tests. The interface is set up to accommodate this
as well.

The algorithms used: Dice coefficient as a threshold algorithm, Levenshtein Distance algorithm, Longest Common Subsequence, and the DoubleMetaphone. The names were passed to the algorithms by way of the bigram model.

I will email it to anyone who requests it.

It is in two platforms, Office 97 and Office 2000 as FuzzyMatching97.zip (692 KB) and FuzzyMatching2k.zip (721 KB).
The zip files include ApprxStrMatchingEngine97.pps or ApprxStrMatchingEngine2k.pps respectively, StrMatching97.mde or StrMatching2k.mde respectively, IEEESoundexV5.pdf, and VBAlgorithms.txt.

IEEESoundexV5.pdf is an abstract about Approximate Sting Matching that fired my curiosity about the subject, and pertains to the package.

VBAlgorithms.txt contains the entire suite of algorithms in Visual Basic extracted from the MDB modules.

The PowerPoint presentations describe the workings of the MDE and give a good overview of Fuzzy Matching.

Matching is divine....

It can be downloaded at http://www.kdkeys.net/forums/thread/6450.aspx




«134

Comments

  • EPEREZEPEREZ Member Posts: 2
    Can you send FuzzyMatching for Office 2000 to this email, please?
    winbatch67@hotmail.com

    Thanks,

    Eusebio
  • OpnSeasonOpnSeason Member Posts: 12
    : Can you send FuzzyMatching for Office 2000 to this email, please?
    : winbatch67@hotmail.com
    :
    : Thanks,
    :
    : Eusebio
    :

    Hello Eusebio,

    It is on its way to you.

    I am open to discuss any questions about it.

    Thanks,

    OpnSeason

  • turkeyboyturkeyboy Member Posts: 1
    [b][red]This message was edited by turkeyboy at 2006-3-16 16:55:42[/red][/b][hr]
    Hi,

    I'm trying to match company names from two databases. The names are not always exactly the same. Would your program be helpful? If so, if you could send it to me at scott_dyreng@unc.edu I'd appreciate it. All data is currently in Excel.

    Thanks,

    Scott
  • OpnSeasonOpnSeason Member Posts: 12
    : [b][red]This message was edited by turkeyboy at 2006-3-16 16:55:42[/red][/b][hr]
    : Hi,
    :
    : I'm trying to match company names from two databases. The names are not always exactly the same. Would your program be helpful? If so, if you could send it to me at scott_dyreng@unc.edu I'd appreciate it. All data is currently in Excel.
    :
    : Thanks,
    :
    : Scott
    :

    Hello Scott,

    It should handle that situation nicely.
    You will need to make some modifications, though.
    I will send instructions with the attached zip file.

    Thanks,

    OpnSeason

  • EPEREZEPEREZ Member Posts: 2
    [b][red]This message was edited by EPEREZ at 2006-3-17 3:37:58[/red][/b][hr]
    Hi OpnSeason,
    My problem is:
    I've a database with 35.000.000 records in SQL.
    Fields: REFNUM,NAME,STREET,CITY,PC,PHONE,FAX,EMAIL,URL

    The problem is that I cannot compare all records (Ref1 with Ref2,Ref1 with Ref3....Refn with Ref1,...,Refn with Refn-1).

    I must find possible candidates for compare.

    Can you help me, please.

    Thanks,

    Eusebio.


  • cjbell9zcjbell9z Member Posts: 1
    Hello,

    I am trying to match Facility names from two seperate db that are sometimes not spelled exactly the same. It seem like your program might be able to do just that. I was wondering if you could email me the program and directions to izz3@cdc.gov.

    Thanks,
    C-
  • phans912phans912 Member Posts: 1
    Could I please get a copy of your process.
    Thanks!

    phansen@lillypulitzer.com

  • OpnSeasonOpnSeason Member Posts: 12
    : Hello,
    :
    : I am trying to match Facility names from two seperate db that are sometimes not spelled exactly the same. It seem like your program might be able to do just that. I was wondering if you could email me the program and directions to izz3@cdc.gov.
    :
    : Thanks,
    : C-
    :

    Hello C-,

    The files are on the way.
    Post back with any questions or comments.

    Thanks,

    OpnSeason
  • OpnSeasonOpnSeason Member Posts: 12
    : Could I please get a copy of your process.
    : Thanks!
    :
    : phansen@lillypulitzer.com
    :
    :

    Hello phansen,

    The files are on the way.
    Post back with any questions or comments.

    Thanks,

    OpnSeason
  • jimboliciousjimbolicious Member Posts: 10
    [b][red]This message was edited by jimbolicious at 2006-3-21 19:22:59[/red][/b][hr]
    Hi Opn,

    can I get a copy?


    Thks,

    jim


«134
Sign In or Register to comment.