Jump to content

FM as a translation memory DB? (requires "fuzzy" matching)


Wintermute101
 Share

This topic is 4397 days old. Please don't post here. Open a new topic instead.

Recommended Posts

Greetings, programs.

I have been using FileMaker since version 3 for all my daily database needs: contacts, invoices, mail archives etc. I always get the results I want, but I usually "stay on the surface", using basic features and simple scripts. Now I am trying to solve a problem that may be too complex either for FM or for me; but I suppose it's the latter. :) Maybe an experienced developer can tell me if what I have in mind here is possible.

My main business is (technical) translation from English to German. Like most translators today, I'm using Computer-assisted Translation software (see http://en.wikipedia.org/wiki/Computer-assisted_translation ).

A translation memory system is basically a database that stores segments (usually sentence level) as pairs ("translation units").

E.g., "The cat is black" - "Die Katze ist schwarz" would be one English/German TU.

When you open a new sentence/segment for translation, the CAT software will go to this translation memory database (which usually has tens of thousands, maybe even millions of records) to see if that particular sentence has been translated before.

Full (100 %) matches are easy - the respective translation will immediately be placed in the target segment.

But usually, the database will only hold "fuzzy" matches - sentences that are "similar" to the current one.

The whole trick (the "secret sauce" for good CAT software, similar to a search engine's algorithm) is to bring up relevant matches first. "Relevant" meaning: translation units where...

- as many words as possible from the current segment are present (if possible, stop words such as "the", "a" etc. should be ignored in this context.

- submatches occur (i.e., several words in the same sequence as in the current segment - so you'd look for translation units that have "The cat is" or "is black", because these are structurally closer to the current segment than segments where these words appear in a different order).

I guess you can see where this is heading:

I'm wondering if and how this kind of fuzzy search is possible in FM. In that case, I would have my translation memory (a simple FM database with one table, containing ca. 80,000 translation English/German translation unit pairs) in one table and the current translation in another. Going from one record (sentence) to the next would query the second table and bring up all relevant (i.e., "similar") sentences.

Is this possible? Frankly, I don't even know where to start.

Of course, whenever I open a segment for translation, I could have a script look up every word combination from my "source" field ("cat is", "is black" etc.) and assign a match value to all sentences in the TM table found for each combination. At the end, the records with the highest match rates would be on top of the result list. But with long sentences (20 to 40 words), this would take forever.

How would you approach this problem - if at all?

I'm not expecting anyone to do the hard work for me, but a push in the right direction would be appreciated.

Thank you.

Link to comment
Share on other sites

I have seen an example of someone producing a "relevance" rating for a search, which measured how many characters it would take to change to make the search equal the actual record. (or something to that effect).

I can't seem to find it right now, but that may be something worth looking into. (I'm hoping someone else will read this and will be able to post a link to this example file)

Another thing worth mentioning is that FileMaker will only index the first X # of characters (I forget the exact amount), so if you are finding records in your 'TU' table via relationship, this could affect your results. I don't suppose it should affect your results if you perform a find though.

Link to comment
Share on other sites

How would you approach this problem - if at all?

Probably not at all, for two reasons:

1. Let's take this example:

- as many words as possible from the current segment are present (if possible

It is quite easy to show records that have one or more words common with the current segment: you break the sentences into individual words (eliminating stop words along the way) and use the resulting list as the matchfield.

However, when it comes to counting how many words are being matched, there is no way to perform this calculation in advance - you must compare the current segment with each one of the matching sentences. If there are 10,000 matching records, you must perform 10,000 calculations in order to filter/rank the matches by relevance. This will be slow - and it will get slower as the database grows.

2. The algorithms required here are pretty complex, and it would take a long time to perfect them. You have mentioned only two rather simple examples, but what about matching "cat" with "cats", "is" with "are" - not to mention matching "cat" with "tomcat" but NOT with "catalog".

Link to comment
Share on other sites

Hello consultant,

thank you for your comment. I had prepared a long response, but you seem to indicate that FileMaker is simply not the "weapon of choice" here, and you're probably right.

According to what you suggested, the only approach I can think of is:

1.) a (stop-word-filtered) "word list" calculation field for the source field ("the black cat is on the roof" > "black cat roof") in Table A with my translation units (English/German sentence pairs) and

2.) a script that will split the current source sentence in Table B into separate (OR) searches ("orange cat roof").

That would give a lot of results, and as you say, prioritizing them would take a long time even if you don't consider the remaining problems (looking for submatches, composites and word inflections).

I had hoped that there is a simple solution which I had simply not seen.

Anyway - this is not the end of the world for me; I'm using several CAT products (OmegaT, memoQ, Wordfast), and they all work (more or less) It's just that I like FileMaker so much, and the ideas to build a "home-brew" translation software with a custom interface that I could access from every Mac, PC, iPad or even browser was tempting.

PS:

If someone (maybe even a FM plugin developer) wants to give this a try: Translation is a huge business, and both freelance translators and agencies are willing to spend serious money on good CAT software. If someone came up with a FileMaker-based solution for fuzzy matching, which is the key feature of every CAT product (see http://en.wikipedia.org/wiki/Fuzzy_matching), he could probably earn good money. Just saying. :)

Thank you, guys.

Link to comment
Share on other sites

Just a small correction:

"the black cat is on the roof"

should produce:

"black¶cat¶roof"

i.e. a return-separated list of non-stop(?) words. This can be calculated using a custom function. If you do the same thing with the current segment and define a relationship matching the two results, you can show the matching sentences in a portal - no scripting or finding necessary.

This is the fast part, because the calculation results can be stored and indexed. However, you cannot pre-calculate the number of matching words, since this depends on the current segment.

Link to comment
Share on other sites

... you seem to indicate that FileMaker is simply not the "weapon of choice" here, and you're probably right.

I cannot speak for Comment, but I would say the issue is not necessarily whether FileMaker is the right tool for the job. Some problems are "hard" regardless of the technology. Breakthroughs are often the re-definition of the original problem into a simpler one that can be solved, or one that leads to a reasonably optimal solution. An oft quoted example is the travelling salesman problem.

Link to comment
Share on other sites

I cannot speak for Comment, but I would say the issue is not necessarily whether FileMaker is the right tool for the job. Some problems are "hard" regardless of the technology. Breakthroughs are often the re-definition of the original problem into a simpler one that can be solved, or one that leads to a reasonably optimal solution. An oft quoted example is the travelling salesman problem.

I played with doing this off and on for a few years. Then found that most of the folks that were doing translations at my clients sites were using google translate. So we added a pop up window that is a simple web viewer integrated to the google translate web page, and then pull the results from it also. Works like a charm.

Link to comment
Share on other sites

Some problems are "hard" regardless of the technology. Breakthroughs are often the re-definition of the original problem into a simpler one that can be solved, or one that leads to a reasonably optimal solution. An oft quoted example is the travelling salesman problem.

And this is true for many (interesting) problems.

Now I'm not a developer (a few script lines don't count, I'm afraid :) ) - so it's hard for me to say how "hard" a given IT problem is.

But as far as "fuzzy matching" in computer-assisted translation is concerned: This has been done/solved by many teams around the world, some of them very small (1-3 devs). Some of these projects are under the GNU General Public License.

I don't know what the fuzzing matching algorithms actually do "behind the curtain", but from a user perspective, what you can expect is this:

- You provide a translation memory, e.g. as a TMX file (= a relatively simple, XML-based translation memory format), which in my case has around 80,000 records.

- The software (e.g. OmegaT) seems to build its own index based on this TMX file.

- Once you start a translating session, "fuzzy" matches for a sentence are found and sorted within fractions of a second. Usually, one of the three top results in such a query is good (= close) enough to save a lot of thinking/typing.

The result is incredibly useful: You get "matching" records, even if some terms in your query (= the sentence to translate) don't match. If you are working with large TMs on technical documents, a 50-page document can be pre-translated in a minute, leaving only the new/radically different sentences for translation.

I think (but I may be alone here) that solving this particular problem within FileMaker (= providing relevant search results even if some search terms are misspelled or do not appear in your records) would benefit not only translators, but other applications/businesses as well.

Look at what Google does today. Ten years ago, even a minor spelling error in a search would result in zero results. A search for "lotus cars" would not bring up a given page if it only contained "Lotus Car".

Today, you can "shoot from the hip" (not worry about inflections, spelling etc.), and you'll still get relevant results from Google, Yahoo, Bing & Co.

(Let's not talk about SEO/spamming here, which is a different problem).

What I probably want to say:

If somebody (a plug-in developer, the FileMaker team, a clever FM solutions developer) could do what you suggest (= redefine/solve the problem of fuzzy matching) in FileMaker, it would probably result in a user experience that is close to what 99% of computer users expect these days when they open their web browser.

Just my $0.02.

Link to comment
Share on other sites

Then found that most of the folks that were doing translations at my clients sites were using google translate. So we added a pop up window that is a simple web viewer integrated to the google translate web page, and then pull the results from it also.

That is the most pragmatic approach (and some CAT tools already have such an option for Google machine translation). However, many professional translators...

a) work on confidential documents, so sending "raw" sentences around the world to Google is out of the question (I'm not paranoid, just stating what you'll find in most NDAs).

B) rely on "local", high-quality translation memories, i.e. their own (or the company's) intellectual property.

With complex documents/topics, Google Translate still isn't too useful. This will definitely change one day, but right now I am/was looking for a way to get "fuzzy" results from my own databases/translation memories.

Which, as it seems, isn't a trivial problem, at least here and now.

Link to comment
Share on other sites

This topic is 4397 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.