Jump to content

Cleaning up content across a large number of fields, across many records


This topic is 4756 days old. Please don't post here. Open a new topic instead.

Recommended Posts

I've posted this at the official FM forums, but the scripting section here seemed more robust. Please forgive the repost.

I've seen recommendations on how to do this with fields that are blank, but I wasn't able to modify that scripting to apply to my case, although there's probably someone out there who could point out where I went wrong.

In summary though, I have a data file (.csv) with values assigned in a random order to a constituent record. The record will have anywhere from 4 to 100 values associated with it.

Many of these values are database generated, and not relevant to me, and I would like to strip them out.

I was trying to find a way to blank/replace contents for any field that matches specific criteria. The two key criteria are cases when the field value is entirely numeric (random location, length), or in cases where leftwords (any field) match a specific criteria. (after the first few chars. the field name/length is random)

For the life of me though I can't get it to perform those searches across all 100 random attribute fields.

I tried setting the current field value as a $$ variable within the script, but it didn't work the way I've written it. I'm sure someone has done something similar in the past, and if I could be pointed in the right direction I think I can go from there, but I'm having a hard time getting a script that correctly navigates and searches through all 100 fields before moving on to the next record to start again from the top.

I'll attach my attempt at a version to replace any all numerical field contents with "NA", please don't laugh. :)

My goal in all of this is to export the cleaned data.

Screen shot 2011-04-07 at 12.04.43 PM.png

Link to comment
Share on other sites

I am rather confused by this: you import this data from a .csv file? How is it possible for the attributes to be "in a random order"?

The record will have anywhere from 4 to 100 values associated with it.

But it will still have 100 fields, won't it? I can't understand why SOME values (in the same field) are worth keeping, while others are not.

  • Like 1
Link to comment
Share on other sites

I am rather confused by this: you import this data from a .csv file? How is it possible for the attributes to be "in a random order"?

But it will still have 100 fields, won't it? I can't understand why SOME values (in the same field) are worth keeping, while others are not.

This is an odd case in that I don't need to keep track of which values are associated with a record. I'm just trying to remove all extraneous information from this giant export in order to get a comprehensive list of user assigned (rather than DB assigned) values on any of these records. The goal is to extract all of the relevant field values and then use those to clean up the overall DB. Once I have a list of user assigned values, I can export records that contain that overlay data, and then re-upload and re-assign the values in a better way.

The data was originally imported from a variety of sources over the course of many years, and associated with a constituent's record as an overlay table in SQL. So Constituent 1 might have 30 values associated with their record, Constituent 2 might have 5, and Constituent 3 might have 74 (I only include 5 or 6 for each in the example below though)

The 5th value won't typically be the same information though for any of the records, because data was assigned chronologically to one giant field. I can export the contents of that single field as .csv values, and create artificial headers "value1, value2, value3 etc.", but the actual data is just a single string of comma separated text, one line per constituent, and the data string is in whatever random order the data was assigned to the record over the record's lifetime.

I can only export the data as a .csv from the SQL field, and when doing that, there isn't an option to sort or otherwise order the data.

So I might have:

Record 1 2008595, Congressional District=28, County=Monroe, Outreach=took action in 2008, State House District=132, State Senate District=56,

Record 2 2008633, Med. Med. Profs, Dr. Joe Example, Joe Example,

Record 3 2008638, Med. Med. Profs, Med. Med. Doctor, Dr. XYZ B. ABC M.D. 800,

In this case, the only information I'd really care about would be "Med. Med. Doctor" and "Med. Med. Profs". We'd want to search for that overlay data against all records, export records that have that criteria then upload them properly into a DB so that the information is consistent in labeling, location etc.

The only way I can think to do that with data that's basically in a random order across 100 fields in the FMP document is to have a script that starts with the first field, searches for specific criteria, and if it finds it removes it and goes to the next record, or if it doesn't match leaves the data alone and goes on to the next field. At the end of the last field, I want it to go to the next record and start from the top and rinse/repeat through all the data. (To make it easier to see, I've italicized sample data I wouldn't want from some sample data with the names changed/modified. The plain text values are the ones I'm trying to retain/export.)

At the end of the day, I then want to export the remaining data that I haven't cleared out, dedupe it to get a comprehensive list of what user uploaded data is labeled in the original file, then using that information extract/re-import the information one piece at a time instead of having this mishmash all assigned in a single field.

Link to comment
Share on other sites

Looking at your example, if you import this into Filemaker you'll get a table with 3 rows and 7 columns.

before.png

If you delete some of the imported data, you will still have a table of 3 rows by 7 columns.

after.png

When you export this, you must export every column that has data in at least one record - {f2, f3, f4} in the current example - and you will have empty fields both in the export and in the re-import.

I believe this could be solved by exporting to XML and using an "intelligent" custom XSLT stylesheet during the export or re-import. But I still don't know what is the final format of the remaining data that you wish to achieve.

Anyway, in order to delete the unwanted data, you must specify what the exact criteria for keeping/deleting the contents of a field are. I can't find that in your script, and the following description is not quite clear, esp the second part.

The two key criteria are cases when the field value is entirely numeric (random location, length), or in cases where leftwords (any field) match a specific criteria. (after the first few chars. the field name/length is random)
Link to comment
Share on other sites

Looking at your example, if you import this into Filemaker you'll get a table with 3 rows and 7 columns.

post-72594-0-90274700-1302276785_thumb.p

If you delete some of the imported data, you will still have a table of 3 rows by 7 columns.

post-72594-0-37712300-1302276795_thumb.p

When you export this, you must export every column that has data in at least one record - {f2, f3, f4} in the current example - and you will have empty fields both in the export and in the re-import.

I believe this could be solved by exporting to XML and using an "intelligent" custom XSLT stylesheet during the export or re-import. But I still don't know what is the final format of the remaining data that you wish to achieve.

I don't mind having empty cells, I just want to strip out all the data that matches specific criteria no matter where it's located in one of the 100 fields in the record.

Anyway, in order to delete the unwanted data, you must specify what the exact criteria for keeping/deleting the contents of a field are. I can't find that in your script, and the following description is not quite clear, esp the second part.

There are two distinct cases where I want to remove data. In the example, I was trying to find a way to have it remove data from any data field that was entirely numeric. None of the data that I want to keep is made up of all numbers, so I thought I could do the division on the field, and based on the true/false value only remove data from fields that were entirely made up of numbers.

The second condition, which I didn't include any example of, would be a series of variations involving an if/then condition using a calculation with Left (text; number of characters). All of the thousands DB generated values share similar starting text strings, so I can remove anything that begins with "Congressional District" no matter which field it's in, or which district number it is. The problem I'm having is I don't know to get it to loop on a:

navigate to the "next" field in the record and just do the same conditional check over and over until the end, then go to the next record.

Link to comment
Share on other sites

I thought I could do the division on the field, and based on the true/false value only remove data from fields that were entirely made up of numbers.

This is based on a false assumption. A field that contains digits along with text, is going to be reduced to a number when you apply a numerical operation to it. For example,

Field / 100




will return 2.5 when Field contains "abc2de5f0g", and 0 when Field contains "0" (all numerical).



The correct test would be:




Filter ( Field ; "0123456789." ) = Field

The problem I'm having is I don't know to get it to loop on a:

navigate to the "next" field in the record and just do the same conditional check over and over until the end, then go to the next record.

You don't need to go to the next record. Replace Field Contents[] works on the entire found set. See the attached file for a simple example (it clears all field values except where Field > 700).

ReplaceAllFields.zip

---

Note that fields must be on the layout, and within tab order.

Link to comment
Share on other sites

This is based on a false assumption. A field that contains digits along with text, is going to be reduced to a number when you apply a numerical operation to it. For example,

Field / 100




will return 2.5 when Field contains "abc2de5f0g", and 0 when Field contains "0" (all numerical).



The correct test would be:




Filter ( Field ; "0123456789." ) = Field

Ahh, I didn't know that.

You don't need to go to the next record. Replace Field Contents[] works on the entire found set. See the attached file for a simple example (it clears all field values except where Field > 700).

ReplaceAllFields.zip

---

Note that fields must be on the layout, and within tab order.

That is AWESOME! Thank you. I think I can use this script in a lot of other places too!

Link to comment
Share on other sites

Ahh, I didn't know that.

That is AWESOME! Thank you. I think I can use this script in a lot of other places too!

Two questions if you don't mind a follow-up.

1. I wasn't able to modify what you had written to work properly with a specific set of "Left (text, # of chars)" definitions to be cleared along the lines of the numeric values. (edit: see new edit, found something that works) Would you be willing to write up an example of that, or at least the line or two that would need to be tweaked? The variations I tried either end up with all blank fields regardless of input, the fields are untouched, or the field contents are universally replaced with the field names, none of which I'm intending.

Ex. Changing the line that defines removing a value greater than 700 to *try* and say "remove a field that matches the following text criteria" (in this case that the first 22 characters spell are Congressional District)

Case ( GetField ( $currentField ) & Left ($currentField; 22) = "Congressional District" ; GetField ( $currentField ) )




Leads to all fields just being blanked out.





2.  Is there someone else I can do to modify the script to be smarter about removing number fields?  Since they're stored as text, using terms like > 700 doesn't  work in all instances.  (in the sample file, it will leave 70, 9, 84 etc. because they're evaluated as "larger than 700" since the fields are all text based and I assume it's going off of the leading character.  Changing the file to just say "greater than 1" ended up with a script that removed nothing.  I'm not sure why that is based on how your script is written.



It was easy enough to just reverse that, tell it anything smaller than 999999999999 should be removed, but I like to know why something didn't work as intended when possible, so I don't make the same mistakes a second time.  Are there specific things I can do when scripting to handle occasional purely numeric values in random text fields, or do they always have to be evaluated as text then?

Thanks again for your help.



edit:  I believe I've found the correct line to use for the leftwords removals, see below if anyone else is trying to solve a similar problem down the line:




Case ( Left (GetField ( $currentField ) ; 22)  ≠ "Congressional District" ; GetField ( $currentField ) )

Link to comment
Share on other sites

Is there someone else I can do to modify the script to be smarter about removing number fields?

I thought I already answered that in post #6 above. In the context of my script, it would be =

Let ( 

f = GetField ( $currentField )

;

Case ( Filter ( f ; "0123456789." ) ≠ f ; f )

)

Link to comment
Share on other sites

This topic is 4756 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.