Rigsby Posted September 9, 2001 Posted September 9, 2001 Well folks, I’ve now posted over 30 answers but only asked one question…. So I guess I’m entitled to a little help now! LOL I don’t have a problem as such, as I’ve already found an answer. However, if someone could help me to do this a better way I would be very happy. The situation: I have an address File with over 130,000 company names and addresses. I get these names and addresses from CDs, which are imported into the DB. The problem is keeping out the double records. If every Company address CD manufacturer used the same formats, this would be easy – but they don’t: So for example: Mueller & Mueller GmbH Ring Strasse 6 46789 Munich Deutschland Is identical with: M
Vaughan Posted September 10, 2001 Posted September 10, 2001 Instead of using a hi-tek self-join, why not just perform a lo-tek find on the calc field for duplicates (search for exclamation character). It can be scripted: perform the find, sort, then start sifting through the records. Because you're no longer using the relationship the calculation based on the global field will now work.
Rigsby Posted September 18, 2001 Author Posted September 18, 2001 I think you missed my point a little! My problem isn’t finding the doubled records. The system of taking out specific characters works great for this. The problem is telling FM which letters, numbers, words, and other characters to remove. If I set these up permanently as part of the calculation, the user has no way of changing them, which is important for my client. An example would be: There are 4 fields: <CompanyName> <Sub1> <Sub1With> <CompanyNameResult> The user enters in field <Sub1> what should be removed from <CompanyName>, and enters in <Sub1With> what the text in <Sub1> should be replaced with, so it would look like this: <CompanyName> = “Boris M
Vaughan Posted September 19, 2001 Posted September 19, 2001 I don't feel like wrapping my brain around your problem at the moment (it's 10:00 am here, just waking up) but my intuition suggestes that a problem like this might be solved by making a database of "substitutions" and looking through all the records to search and replace. Time consuming and complicated, yes. But at least adding and removing substitutions will be clean-cut. Having said that, maybe you are asking for the impossible. Most databases of addresses acknowledge that there are a certain percentage of duplicates and duds.
LiveOak Posted September 19, 2001 Posted September 19, 2001 This really does spiral off into the area of AI. What do you do about incorrectly typed abbreviations you want to remove, etc. Remember the short lived AI stock boom of the early '80 sold on the basis that artificial intelligence was only a few years away? We sure kicked the heck out of that problem (maybe in another 100 years)! What it kind of boils down to is that for the price of all the engineering required to provide a poor/marginal solution, you can hire an army of low cost clerical workers who can solve it just fine. Not as much fun as tinkering with databases, but a lot faster and cheaper. -bd
Rigsby Posted September 19, 2001 Author Posted September 19, 2001 Ok Guys! Thanks for your comments. To be honest, I thought I had already found the best way (read my first posting on this matter), but just thought I’d get your ideas first incase any of you had a way of speeding this up. I’ll stick to the related file, using a lookup and forcing the user to use a script if any values are changed. I’m also going to put a number of users limit on this command, so that it can only be used if there is little network activity. If I come up with a better way, I’ll let you know. Rigsby
Recommended Posts
This topic is 8822 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now