Jump to content
Server Maintenance This Week. ×

This topic is 3705 days old. Please don't post here. Open a new topic instead.

Recommended Posts

As the title implies, I need a step in a script that allows me to check if the tag is in fact a word. 

 

Context:

Creating a script to import pdf documents into a FM DB.  Using Applescript, selecting PDF's text, converting to list of all words and sending this to a FM DB record as well as the PDF's name and location in finder.  Script then imports the PDF to a container field for future reference.  List of words are then set as documents tags.  After import can search tags and show related documents.

 

Problem:

Problem is OCR is not perfect.  Can get strings of greksfhqr[oif or wordsmashedtogether both of which are treated as a word.  I need to check is the string is in fact a word.  If true, set as tag, if false >>future magic I have not yet even tried to figure out<<. This goal is to just solve the first part of the if function: if true, set as tag.

 

Possible Solutions Tried/Considered:

1) send $tag to field and have script execute "check spelling" of field.  Requires user interaction. Want to be able to run this check without need of UI

 

2) create a web portal with the word and check result of query to website. This has potential if query and response can happen programatically.  Running query through web browser is VERY SLOW. Imagine checking very word of a 20,000 word document.

 

3) utilize a shell script in conjunction with applescript.  This could work except that dictionary.app is not set for applescript. there is a dictionary shell with a text file of many words in usr/shared/dict/words.txt ... but the words are from 1930es and this file may not be very updatable.  Could try importing these words into a subtable creating the DB's own dictionary.  Wondering if there is an easier way tho.

 

Plee:

Any suggestions?  How would you resolve this issue?

 

Cheers,

Marcus

Link to comment
Share on other sites

My suggestion would be to download an English Word List (there are many online, search to your own content and find one you like) that lists its words in a simple list, with no definitions or other values on each line. Then import those words into a new table (i.e. Wordlist) on your database, and create a relationship of YourTable::GlobalTag to Wordlist::Word. For each word, if IsValid(Wordlist::Word), then you have a match!

 

The problem with this approach is that you'd still have to loop through all 20,000 words in your document and set them to a global, then check the relationship. Totally automated, and much faster than the web, but still not exactly speedy. If this is ok with you, though, give it a try.

 

Edit:

 

Alternatively, upon rereading your post, it seems you might be converting your PDFs into a list of words already. If that is the case, then simply make a relationship between YourTable::File_Converted_To_Text and Wordlist::Word, then apply List(Wordlist::Word) to get a list of all the valid tag words. That should be much faster!

Link to comment
Share on other sites

then simply make a relationship between YourTable::File_Converted_To_Text and Wordlist::Word, then apply List(Wordlist::Word) to get a list of all the valid tag words.

Ooohh ... that is elegant.  The automated script is looping already, its slower but functions ok - fast enough.  Could also sql search the "good" wordlist for each word I guess.  Thanks for the input

Link to comment
Share on other sites

This topic is 3705 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.