Data Parser

aldipalo · April 6, 2008

Has anyone ever seen a script that can parse the name and address out of a data field?

And, of course, if you have can you direct me to it?

Such as:

Name: Moira Westerner, M.D., Ph.D.

Home Address: 35 Pinky St., apt. 2, Boston, MA 02114

Office Address: Neuroendocrine Unit, Massachusetts General Hospital,

55 Fruit Street, Bulfinch 457B, Boston, MA 02114, 617-726-1347

E-mail: [email protected]

Fax: 617-627-2705

So that I can pull Prefix, FName, LName, Suffix, Address, City, State, Zip, Phone1, Phone2, emailaddress.

Of course the problem is no one ever puts it the same way.

Fenton · April 6, 2008

We've seen quite a few, over the years. But each tends to have its own idiosyncrasies. Yours for example has some seriously multi-valued chunks, Office Address in particular.

It looks like otherwise you've got 2 returns between form value chunks. There's various methods to parse out the data. But I think I'd just go for a Custom Function, built by one wiser :-]

I used the BetweenNext one, by Fabrice Nordmann (who has several variations of text parsing CFs), at:

http://www.briandunning.com/filemaker-custom-functions/list.php

That gets your form chunks out. Then you need to go through the multi-valued ones like Office Address and pull out the pieces. That one is kind of tricky because your example has:

2 pieces to do with the company (department & company)

2 pieces to do with the "street address" (though what the heck "Bulfinch 457B" is I have no idea; some archaic East Coast kind of thing I expect).

However, other Office Addresses may only have 1 piece for each of these. So your calculations must be able to handle either situation.

I did several calculations, using Position(), etc.. But, I think that I might also try first busting it into separate lines, then using GetValue() instead; I think it would less tedious, easier to get the values.

Either way would work. Though I couldn't guarantee that anything will parse 100% of addresses correctly. And I am certainly not the best calculator around.

The script, when you get all the pieces, will not be difficult. But I'd try and build the calculations to get the pieces first, so you can tinker until they're correct (99% of the time).

Parse_CF.fp7.zip

aldipalo · April 7, 2008

Thanks Fenton that's excellent:

What if the user highlights the data to be parsed first?

I put this together to just move blocks of highlighted data into a specific field, but, thought it was a little clunky since there are multiple fields. If the user highlights the Name/Address block and then parses the data. Would the CF work on that through a script?

Field_Data.zip

Fenton · April 7, 2008

Well, you could do that. It would work pretty much the same, except the calculations would be based on the data from the highlighted text instead. I'd recommend either using a global Variable syntax, $$Data, or setting it into a global field. Unless you're a calculation genius and write everything correctly the first time, at some point you're going to want to see the results, in a view that better than the Data Viewer.

In fact I don't think I'd use a "user selection," for a couple of reasons. First, unless the data block is huge, or has several people's data mixed up in it, it will make almost no difference to the speed of processing. Second, I would want to see the results of a lot of records at once. Because you really need to see how your calculations work on all of them, and, if not, whether you can tweak them to work. Of course, you could do that with a script also. But I find writing calculation fields easier, as I can see the result immediately (I used Unstored fields).

For example, your data does not have a "phone2" in the office data. But, if it did, my calculation would get some of it incorrectly. You would need some more advanced logic to accurately parse that block of data.

Or you could do all the figuring out in another test file, then only do the real parsing via script only. That's probably the best solution, since it would not clog up the real file with calculation fields.

But I still don't really know why the user would need to select the data to parse. If it is because the data is coming from various sources, and is pasted in, then you've bigger problems.

Name parsing is a whole 'nother problem. I've yet to see a solution that can accurately parse all the variations of people's names. You can get most, but not all. I've used this (old) file of Lynn Bradford (which I doubt you'll find anymore). It uses repeating fields, in each record, to enter exceptions. If the exception applies to multiple records, then you need to Replace to apply to them. A little awkward perhaps, but it works. Do not Delete All Records, or you'll lose the exceptions.

NameParseLynn.fp7.zip

aldipalo · April 7, 2008

Fenton:

Usually a CV will have from a bare minimum of 3 pages up to 75+ pages. A CV/Resume is probably one of the worst text blocks to parse because no one lays out their resume the same way.

We, Recruiters, usually get resumes one at a time. If we are looking to gather (Harvest) resumes from a Job Board there are commercial resume parsers out there, for a few hundred dollars, that will do the job quite nicely. It's the single CV that we receive and need to create a new record and cut & paste the contact data a that is the time consumer. That's why I felt if the user could highlight the name/address block and just parse it out they would still save a good deal of time.

I like the last example you sent and will see if I can understand it and make it work. Thanks for your help.

Al

Fenton · April 7, 2008

I see why you'd want to pull out just some of the data. You seem to have that covered in your file, using the active selection. But I would still use a global field, or a global Variable, then calculation fields to at least build the tool. Because, as I said, I would want to see if it worked in every case. If you have no control over the source, it is going to be difficult or impossible to build something that works flawlessly for everything. But you can try and get most of it. You may find that you cannot finally parse out multi-value chunks of data. But you should be able to at least get the labeled chunks.

I would do this is in a test file, not in the real file, because it's going to be messy, and there will be different approaches to try. As I said at the end, I think I might try the "multiple lines" approach, instead of the comma-separated. Once you've got them as good as you can get them, then you can move what you need into the real file, putting the calculations into script steps.

I think I would consider moving these resumes into another table, separate from the main contact table, probably a 1-to-1 relationship on the Contact ID. This is easy to do; the user doesn't have to know. That will keep any extra fields you need for processing out of the contact table.

The last file I uploaded is simply for separating a single name field into its parts. Yeah, anything to do with you humans is messy :-]

Sign In

Data Parser

Recommended Posts

aldipalo

Fenton

aldipalo

Fenton

aldipalo

Fenton

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Explore

Affiliate Forums

Activity

Important Information