dmontano Posted August 23, 2008 Posted August 23, 2008 I have a "table" of data (excel spreadsheet). I am trying to build tables that will normalize the data from the spreadsheet. The spreadsheet consists of these rows: continents countries states counties cities zip codes latitude longitude I am building separate tables in FileMaker as such: continents countries states counties cities zipcodes (will hold lat and longitude as fields) I have been able to re-establish the relationships by somewhat of a hack method up until I got to the "cities" table. In other words, I established: countries to their continents; states to their countries; counties to their states; but I do not know how to get: cities to their counties; zip_codes to their cities. The shear number of these (40,000) makes my hack technique worthless. So, two questions: 1. Is there a typical method to re-establish relationships like these when the data comes in as 1 table and you are breaking it out into multiple tables? In other words, shoudl I have started at the most distant child (Zip-code) and work my way up to the (parent) 2. What techniques / methods could I use to re-establish what I have done so far? Any help appreciated. Thanks.
comment Posted August 23, 2008 Posted August 23, 2008 I am not sure there are significant benefits to be gained by normalizing this type of data. However, if you wish, I think the general method would be something like this: Import all of the data into a table. Sort by continent, country, state, county, city and zip code. Assign a serial number (ZipCodeID) to each record. Loop through the records and assign ContinentID, CountryID, StateID, CountyID and CityID, raising each number by 1 at the corresponding break*. Import the records into their respective tables, after defining the ID field to be unique (validate always). --- (*) a break must be checked at all upper levels too, since (at least in theory) it's possible for the city name to remain the same while we have moved to another county, etc.
dmontano Posted August 24, 2008 Author Posted August 24, 2008 Hi Comment, Sorry to trouble you, I do not understand some of what you have said. I am trying... All values are in a single table. I assigned a unique ID to all of the Zip Codes in the table Now, I am unclear on your instructions from there. Loop through etc.. I am assuming I need to create one field for each of the following in the same table: ContinentID CountryID StateID CountyID CityID I am unclear on what the break is? And How-to implement the assiging? I know this is fundamental - I am ignorant though.
Fenton Posted August 25, 2008 Posted August 25, 2008 I think what comment is saying is to set each of the values into a Variable as you loop. Then on the next record, check each to see if its corresponding field's value has changed. If so, increment the entity's ID by 1. The result when you're done would be same ID numbers as if you created the whole thing in a relational system. So then you just need to import the ParentID (if any), ID and Name for each entity to its own table. Remember to set/check that the IDs auto-enter the next value.
comment Posted August 25, 2008 Posted August 25, 2008 Yes, you need those fields. The break is where the value changes. A quick pseudocode to populate one of the fields: Go to Record [ First ] Loop If [ City ≠ $city ] Set Variable [ $cityID ; $cityID + 1 ] Set Variable [ $city ; City ] End If Set Field [ CityID ; $cityID ] Go to Record [Next ; Exit after last ] End Loop This is just a primitive sketch. It could be made more elegant and faster by employing Mikhail Edoshin's Fast Summaries method - but for a one-time conversion it should do quite well.
dmontano Posted August 25, 2008 Author Posted August 25, 2008 Fenton and Comment - thanks to both. I have been staring at my screen for - literally - hours. I keep reverting to hacking some way because ignorance reigns supreme in my head. I do understand the concept of at least getting the keys to be unique per value and the association between the keys of other tables is what makes the relationship happen. I was just really lost on how to implement it. I will take a stab at what has been advised and let you know how far I have gotten - or at least I can demonstrate further how deep my ignorance runs. Truly thanks... now I "think" I know which way to start swimming.
Fenton Posted August 25, 2008 Posted August 25, 2008 Alternatively, you could import the parent Name AND the unique instances of each table entity's names into a new table (likely produced by Export summarized), allowing FileMaker to generated the auto-enter ID for the table. Then use the names in temporary relationships to import the correct parent ID into each table; then switch the relationships to IDs. Comment's method, though it requires a script with a Loop incorporating several tests, is a lot less manual labor; you run it once on the single table you have, to populate the IDs for all the tables, then import/create the appropriate tables, create the ID relationships and you're done. You know what the end result should be. One has to consider whether normalization is really worth it. But once it's done it will be -]
dmontano Posted August 25, 2008 Author Posted August 25, 2008 I had thought of taking the table as a whole and relating it into the individual tables. I got started and got it to work on one table. The values I was joining were easy - country to state. Then, it got sticky. I realized that state to county was not the same thing as more than one state can have the same named county. Comment pointed this dynamic out and it continues down further into cities. Since this is my first go at doing this sort of thing - I was having doubts about how to make multiple criteri fit. I guess that would be a multi-predicate relationship? In any event, I am working on the scripted process. Your comment also made me realize that while I am running it now, I am doing it incorrectly. You stated that this can all be set in one script pass - I am running it once per "new entity". For example, I am running it on continents, then I was going to run it again on countries. Sounds like I took Comments advice too literal and did not realize I should be writing the script to contain ALL of the key assignments. Still not sure. Also, I have yet another question... On the first set variable, is it possible to use the actual start value that I assign in the serial string in the field specification instead of a "$cityid +1". For example, I have a habit - whether it be good or bad is another thread - where I prefix my serials with the table name. It has helped me see if I have my keys in the right place (seeing only numbers at my level would give me no real clue if I have jumped the tracks.) In this case, since I am assigning keys in one table to fields that will eventually become their own tables, I would like to be able to set the keys with the appropriate future table prefix. For instance, I would like to set the start keys like: continents to a start value of: continent_pk_001 countries to a start value of: country_pk_0001 Please bear in mind, this is for my benefit - however lame it may be. As mentioned, it gives me insight if I have keys in the right place. In the script posted, the first set variable is using the local variables value + 1. I am afraid if I tamper with the script as outlined I will lose all ability to see what it going on. Thanks, David
dmontano Posted August 25, 2008 Author Posted August 25, 2008 I think I figured out the "table" prefixing... Set Variable [$continent_pk; Value: "continent_pk_" & $continent_pk + 1] I created each pk set as a script, then created a Main script that includes each one pk setting as a subscript. Does this sound alright? Or is it faster to place them all in one script? if possible...
comment Posted August 25, 2008 Posted August 25, 2008 this can all be set in one script pass - I am running it once per "new entity". It doesn't really matter, the result is the same (or at least it should be ). Set Variable [$continent_pk; Value: "continent_pk_" & $continent_pk + 1] It should be: Set Variable [$continent_pk; Value: "continent_pk"] the first time (before starting the loop), and after that: Set Variable [$continent_pk; Value: SerialIncrement ( $continent_pk ; 1 )]
dmontano Posted August 25, 2008 Author Posted August 25, 2008 I am really making progress with the help provided. Thanks so much. I do have some additional questions - and I think they may improve the speed or functionality of the "routine" I am trying to create. First off, to state a purpose to frame what I want... Create a routine to implement table normalization when I encounter instances where I wish to normalize the data. (Once again, I am keeping the observation that Comment made about whether this data is worthy of normalization or not in mind. And also the observation by Fenton that once it is done - it is often worth it. As time goes by and my experience level goes up - I hopefully will have the capacity to see more clearly and choose appropriately based on the task at hand.) Question 1: Is this a reasonable routine when I encounter a table of data (say an excel spreadsheet) 1. See if the data CAN be normalized; 2. Decide if Normalization makes good sense based on the purpose of the data; 3. If so, import the data into a FileMaker table; 4. Create 1 additional field for each "new entity" I want to extract from the data; 5. Each additional field will become (via assignment) the Primary Key for that New Entity; 6. In order to retain the built in relationship from the composite table; keys are set in each of these new fields using the technique as descibed by Comment; 7. Then create empty tables that will use the composite table with the new keys assigned to their respective fields/tables 8. Establish the relationships to the separate tables as needed. This is basically what I think was advised by Comment, but I am trying to clarify. Question 2: It took more than a couple of hours to run the script over 42,000 records. I was setting 5 different keys per record so that would be a total of 210,000 fields that got set with new primary keys. How can I speed that up? A. Use a Freeze Window? I do want to be able to stop the script myself if I notice something is going wrong. I would hate to lock myself out of stopping the script. Obviously this is Admin only routine. B. Is it faster if I set each of the 5 keys in one record before looping to the next record? Or loop through all records setting one field "column" at a time? For example: set all continent keys, then go back and loop through all country keys; OR, set each of the five different keys for the one records and then loop to the next? I ask because it seems FM will have to visit 210000 different records, versus visiting 42,000 records? C. Is it possible to use a script step that uses "Replace Field Contents" and use the calculation box to perform the same function of setting key values? If so, is this faster. I did create a "replace field contents" script because I was not able to get the "Set Keys" script to work for a while. In that script, I clear the values to ( "" ) and use a "replace field contents" script step and it is way faster than the loop I had initially created. I understand these are different tasks and are different in scope - but I thought I would ask. Thanks again, and I hope to post the end result in case anyone else is in a similar situation and may find it helpful to set new keys to retain the existing relationship.
dmontano Posted August 25, 2008 Author Posted August 25, 2008 Update: Screenshots attached I have notice a tremendous speed increase and do not know what it is attributable to. The only thing I can think of is the "SerialIncrement" used in the set variable that comment suggested earlier in the thread made the difference. Could this result in such a drastic improvement? From over 2 hours to 11 1/2 minutes?The only other thing that was changed is computers: from a G4 to a G5. I really doubt this as being the sole reason for the time improvement. Lastly, would a freeze window improve the performance further? While the script is running, I see no window content changes as it is - except for the small scroll position icon along the right side of the window. Other than that the screen stays static. I am a novice so the scripts may be plagued with some error I unknowingly have introduce - so use with caution. Any input is appreciated.
comment Posted August 25, 2008 Posted August 25, 2008 Is this a reasonable routine LOL, I think by now you know what my answer is going to be: it depends. The truth is that there are several ways to do this (Fenton mentioned another possible approach). I just picked the one that seemed the easiest to explain and to implement FOR THIS TYPE OF DATA. Re speed: usually, you want to go for reliability over speed in one-time conversions. Freezing the window while looping does speed things up (and so does switching to Form view), and it doesn't prevent you from aborting the script. Applying Fast Summaries would speed it even further - but again, the more sophisticated the technique, the easier it is to make a mistake, so you may end up spending more time perfecting the process than the time it takes to just run it in "brute force" mode. Is it possible to use a script step that uses "Replace Field Contents" Maybe, but you'd still need to check the previous record AND bump the variable up, so I am not sure it would be significantly faster. And I am not going to test this, for the reasons explained above. Could this result in such a drastic improvement? Who can tell? You have broken the first rule of controlled testing: never change two things at once.
Fenton Posted August 25, 2008 Posted August 25, 2008 Freeze Window View as Form Are well-known to greatly speed up Loop operations. You can switch the view back when you're done.
comment Posted August 25, 2008 Posted August 25, 2008 OK, I know I said I wouldn't test this, but I got curious about how would Replace Field Contents handle incrementing the variable. As it turns out, it handles it quite well - and at least twice as fast as a loop. The calculation I have used is: Case ( Table::Category ≠ $category ; Let ( [ $category = Table::Category ; $categoryID = $categoryID + 1 ] ; $categoryID ) ; $categoryID )
dmontano Posted August 26, 2008 Author Posted August 26, 2008 Hi Comment, I started to respond yesterday and was mentally drained. Still am. When I can formulate an attempt to understand where and how to use the calc you posted I will try and reply with the results. That is great news that you noticed a speed improvement. I hope I can apply that myself. By the way, normalizing a seemingly simply table as I described is quite the exercise for me - I hope I am learning something from the effort... For now, I am going to post a "theoretical" post and I hope you provide some feedback. It is regarding the "party" concept. For example, People and organizations are "parties", etc.
Recommended Posts
This topic is 5992 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now