Jump to content

How to perform Find with partial data (text)?


K3lso

This topic is 6635 days old. Please don't post here. Open a new topic instead.

Recommended Posts

It can be translated, but it won't be fast. Because it calculates a match value between the search text and every trademark in the database, it must perform the calculation for every record, every time you enter a new search text value. So, if you have 100,000 records in the file, and you enter a new search text value, It must perform 100,000 calculations, and then index them so it can find the best matches. That's the main difference between these algorithms and algorithms like soundex.

Soundex type algorithms calculate an index value dependent only on the single trademark text. So, it only has to be calculated once, and can then be indexed just once. Hence it is fast.

Link to comment
Share on other sites

I've been quietly fascinated by this discussion, first by Bob's introduction of SoundEx which I encountered ages ago, and then Comment provided links to a couple of interesting sites.

I was intrigued by the Levenshtein Distance, although I thought at the time it was a wee off-topic. Now I'm not so sure. Perhaps the solution to this problem lies in a combination of Levenshtein Distance AND SoundEx. Between the two, you may be able to develop a score that will satisfy 79% of lawyers. That said, there will always be somebody who thinks Starox = Atrox = Starbucks = flummox, so there's truly no such thing as a perfect solution.

Further, I'll add that I just came off a project that required studying etymology and how words evolved, and I can clearly see it illustrated in the given explications. For example, the letters K, G, and Q evolved from the letter C. The letter J is related to G, but it is also a cognate of I and Y.

You guys have gone above and beyond and you're probably exhausted, but I applaud the hard work and let you know it was appreciated beyond just the contributors in the left column.

Link to comment
Share on other sites

"there will always be somebody who thinks Starox = Atrox = Starbucks = flummox"

"Perhaps the solution to this problem lies in a combination of Levenshtein Distance AND SoundEx"

Exactly. What Im doing now is to separate the syllables (e.g.:P the word STAROX: into STAR and OX. Therefore, I perform a common search like this: *STAR* and *OX* and it

Link to comment
Share on other sites

Is it possible to modify the algorith to perform a phonetic search no matter where the "SURE" is (beggining, middle or end)??

No, not with soundex. That's where these other algorithms come into play. If no one else wants to take up the challenge, I will have a look at the oliver code this weekend and see what I can do with it.

Link to comment
Share on other sites

I had a brief look at the code for the Similar_Text function. That's as far as I got because I suddenly realized that today is Feb 28, and I had a pile of month end and year end book work to get done. So, that took care of my week end.

However, from my brief look, I noticed that the Similar_Text function is recursive, so that throws yet another twist into the mix. I did some more reading up on the function, and a bit more thinking about it. Similar_Text is the most computationally inefficient of any of the algorithms described in the resources I searched. Even if it could be made to run fast (which I'm sure it can't), I don't think Similar_Text is very well suited for what you are doing. It looks to me like it was designed to detect plagiarism, rather than detecting similar sounding or similar looking words. But, that has given me an idea about a variation on the metaphone function that may be more suitable and would be lightning fast. Give me a couple of days though. I'm busier than a lint picker in a blue serge suit factory.

Link to comment
Share on other sites

While surfing I found out about Panorama database. It says: "Even though the indexes are large they actually don't contain all of the information in the database (most of the index space is taken up with hints to make searching faster). Since FileMaker is searching the index, not the database, this means that many useful search queries are impossible. Since Panorama doesn't use indexes it can perform any search you can think of, including PHONETIC SEARCHES (sounds like "alan"), PARTIAL MATCHES, comparisons between fields (Price is more than twice the P/E ratio), searching for fields that contain only letters, only numbers, or some other combination, searching all fields at once, even live keystroke-by-keystroke searches (like iTunes)." It seems its an old application, but.. what do u think about the search function?

Link to comment
Share on other sites

Panorama is a memory resident database. So, it is limited by the amount of RAM in your computer. I used it years ago, and liked it very much. I'm surprised to hear that it's still around. You don't hear much about it nowadays. It was one of the very first Macintosh applications to come out in 1984 (under the original name OverVue), and it was way ahead of its time. I believe it was the very first application to use clairvoyance.

As I recall, it had pretty good search capabilities, but it was a long time ago.

Meanwhile, I came up with a variation on Soundex/Metaphone last night, which might be useful to you. Here is the formula:

Let ([

Input="." & Upper(TrimAll(TextField;0;0) )& ".";

Norm1=Substitute(Input;

[".X";".S"];[".KN";".N"];[".PN";".N"];[".GN";".N"];

[".WR";".R"];[".WH";".W"];

["MB.";"M."];["GH";""];

["SCH";"SK"];

["TIA";"SHA"];["TIO";"SHO"];["CH";"SH"];

["CE";"SI"];["CI";"SI"];["CY";"SI"];["CK";"K"];

["DGE";"JE"];["DGI";"JI"];["DGY";"JI"];

["GE";"JE"];["GI";"JI"];["GY";"JI"];

["PH";"F"];["Q";"K"];["V";"F"];["X";"KS"]

);

Norm2=Upper(Substitute(Norm1;

["E";"A"];["I";"A"];["O";"A"];["U";"A"];

["WA";"wA"];["WY";"wY"];["YA";"yA"];["Y";""];["W";""]));

Norm3=Substitute(Norm2;

["G";"K"];["C";"K"];

["TH";"S"];["J";"S"];["Z";"S"];

["F";"B"];["P";"B"];["V";"B"];

["H";"A"];["N";"M"];["T";"D"];

["AAA";"A"];["AA";"A"];["AA";"A"];

["BBB";"B"];["BB";"B"];["BB";"B"];

["DDD";"D"];["DD";"D"];["DD";"D"];

["KKK";"K"];["KK";"K"];["KK";"K"];

["LLL";"L"];["LL";"L"];["LL";"L"];

["MMM";"M"];["MM";"M"];["MM";"M"];

["RRR";"R"];["RR";"R"];["RR";"R"];

[".A";".0"];["A";""];[".";""]

);

KeyLen=Length(Norm3)

] ;

Norm3&

Case(KeyLen>4;

Link to comment
Share on other sites

Well Spanish should be a fairly simple conversion from English. There is some difference in the pronunciation of vowels, but this function ignores the difference between vowels anyway, so you won't need to do anything about those. There aren't many differences between the consonants except for g, j and h. The H is ignored in the English version, so that should be okay. Most of the substitutions like WH and KN, GN etc. are for exceptions to English pronunciation which don't seem to occur in Spanish, so they can be removed.

So, I would add these substitutions:

J -> H

GE -> HE

GI -> HI

The H is later deleted in the existing function.

The R sound is different than English, but distinct from everything else, so it should be okay as is.

The K and W only occur in English, so they doesn't need any special treatment. I'm not sure how foreign words with W are pronounced. My dictionary says W is pronounced like V. If so, you could convert W to V.

So the Spanish version would probably look something like this:

Let ([

Input="." & Upper(TrimAll(SearchText;0;0) )& ".";

Norm1=Substitute(Input;

[".X";".S"];

["TIA";"SHA"];["TIO";"SHO"];["CH";"SH"];

["CE";"SI"];["CI";"SI"];["CY";"SI"];["CK";"K"];

["GE";"HE"];["GI";"HI"];["GY";"HI"];["J";"H"];

["Q";"K"];["V";"F"];["X";"KS"]

);

Norm2=Upper(Substitute(Norm1;

["E";"A"];["I";"A"];["O";"A"];["U";"A"];

["WA";"wA"];["WY";"wY"];["YA";"yA"];["Y";""];["W";""]));

Norm3=Substitute(Norm2;

["G";"K"];["C";"K"];

["TH";"S"];["Z";"S"];

["F";"B"];["P";"B"];["V";"B"];["W";"B"];

["H";"A"];["N";"M"];["T";"D"];

["AAA";"A"];["AA";"A"];["AA";"A"];

["BBB";"B"];["BB";"B"];["BB";"B"];

["DDD";"D"];["DD";"D"];["DD";"D"];

["KKK";"K"];["KK";"K"];["KK";"K"];

["LLL";"L"];["LL";"L"];["LL";"L"];

["MMM";"M"];["MM";"M"];["MM";"M"];

["RRR";"R"];["RR";"R"];["RR";"R"];

[".A";".0"];["A";""];[".";""]

);

KeyLen=Length(Norm3)

] ;

Norm3&

Case(KeyLen>4;

Link to comment
Share on other sites

That would be another script? (if so, there will be two search scripts? one in Spanish and another in English? ) or I only need to change the previous algorithm, and there will be only one script that performs phonetic search in Spanish and English?)

P.S: In Spanish there is a letter that doesn

Link to comment
Share on other sites

Im thinking...

Could it be possible to somehow include the pronnonciation of the letters "H", "J" and "Z" in Spanish?

"H" is prononced "HACHE"

"J" is prononced "JOTA"

"Z" is prononced"ZETA"

The substitutions would be:

HACHE = H

JOTA = J

ZETA = Z

Am i right?

Link to comment
Share on other sites

I've attached another example which does both English and Spanish at the same time without requiring any script.

By the way, you didn't mention whether this is for European Spanish or Latin American Spanish. I understand there are a couple of subtle differences.

Here are the rules I used. Please correct me if I'm wrong:

Spanish J is pronounced as an English H. This is already handled in the function I gave.

Spanish H is essentially silent. Again this is already handled.

As for Z, My understanding is that European Spanish pronounces it like English 'TH' and Latin American Spanish pronounces it the same as English Z. Either way, the phonetic simplification that is done converts them both to 'S,' so it is already handled too.

For LL, I believe the European Spanish pronunciation is LY, while Latin American Spanish pronounces it as Y. I missed that one, so it has been added in.

CI and CE convert to SE and SI in both English and Spanish, so no changes needed here.

I was aware of

Phonetic.fp7.zip

Link to comment
Share on other sites

I forgot. Its for Latin American Spanish, but there are a few differences between the prononciation. The sound of the "Z" is all I can think by now. The European SPanish sounds like "C" and Latin American like "S". But you mentioned its already done with the "S" substitution.

Spanish "H" is silent.

I think "

Link to comment
Share on other sites

oh, and... Is it possible to make a single NeoPhoneSearch and a single NeoPhone so there are only 2 fields that are related to English and Spanish phonetic searches, instead of 4 like (NeoPhoneSearchEspan; NeoPhoneEspan; NeoPhoneSearch and NeoPhone)B)B) That way I wont have to search in English and then search in Spanish (double work frown.gif

Link to comment
Share on other sites

Matchcount gives the number of matching records that are found. Its value appears at the top of each portal.

The new functions produce a multikey (several values each on a separate line). If any one of these matches any of the multikey values of one of the textfield words, then it is considered a match and the word will appear in the portal. The tolerance value determines how many key values are created in the search key field. Here is how the multikeys are created. The main key value is created from the function as previously described. Each character in the key corresponds to a consonant sound in the original word. Then, if the key value is longer than 4 characters, two more key values are created. The first is the main key minus the first character, and the second is the main key minus the last character. If the main key is longer than 5 characters, then an additional two keys are created. The first is the main key minus the first two characters, and the second is the main key minus the last two characters. If the tolerance value is set to 0, then only the main key is created. If it is set to 1, then 3 keys are created (if the word is long enough), and if the tolerance is set to 2, all 5 keys are created (again assuming that the word is long enough). This allows you to match smaller parts of words. For example if you enter "mathematic" for the search text, it will return both "mathematic" and "mathematical."

The 4 fields could be combined into two fields with the English and Spanish version forming two sets of multikey values, 5 for English and 5 for Spanish for a total of 10 key values.

You may want to read up on multikey relationships if you are not familiar with them.

Link to comment
Share on other sites

This topic is 6635 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.