optimizing xml parsing script on a big db (12 hour)

Revuelta · March 10, 2014

I have a database that contains an Ancient Greek dictionary in XML. The Greek is encoded in Betacode (an ASCII encoding system for Ancient Greek) and I'm converting Betacode Greek into Unicode Greek. My script works, but it takes more than 12 hours on a very capable MacBook Pro (middle 2010, 2,66 Intel Core i7). I would like to optimize the script, but perhaps the size of the file imposes the more than 12 hours of processing time. The features of my db are as follows:

1) 118.115 records.

2) One field per record containing the xml: from 1 to 7595 tags per record.

3) The script contains two loops:

3.1. The first loop takes from the first record (1) to the last (118.115), record by record.

3.2. The second loop creates a variable and allows to parse every single tag within each record.

4) When the second loop parses the xml and finds a tag that contains the attribute lang="greek", a "set variable" script step turns the Betacode into Unicode Greek using the function "substitute". There are more than 315 substitutions through the "evaluate" function: the correspondences between Betacode and Unicode are on another table and the script creates the formula dynamically once for all the steps before the first loop (less than a second just once for the whole script), turns the formula into a variable and applies it on each xml element, when it contains the attribute "greek" (see in the code "Evaluate($Substitute & Quote(TEXTOGRIEGO) & $Datos)"). The unicode Greek is colored in blue in order to better check the changes on the xml.

5) All the formulas are based on variables: instead of operating on fields, I've created variables that contain the fields values and the script operates on these variables (this change from fields to variables has speeded up the whole process a lot). When all transformations are finished (end of the second loop), the script puts back again each variable's value into its field, before going to the next record.

6) Making the script I've created many steps that are now inactive. Can inactive steps slow down a script?

I know there are many operations (118.115 records x 1-7495 tags per record x 315 substitutions on certain tags), but is it reasonable that it takes so many hours (more than 12!!!)? Most probably either my code is poor, or Filemaker is slow, or both.

Any suggestion is welcome.

This is the code of the "set variable" step that parses the xml within the second loop. This is the step that consumes most of the time:

=============================KEY OF THE VARIABLES USED=============================

a) ETIQUETA = opening xml tag: e.g. <entry>

:cool: TEXTOGRIEGO = text between xml tags (written in Betacode at the beginning): a)/thrwpos (= Unicode ἄνθρωπος)

c) ETIQUETAFINAL = ending xml tag: e.g. </entry>

d) $Texto_HTML_02 = variable containing the xml text that is being parsed;

e) $Substitute contains the first part of the function Substitute evaluated: e.g. "Substitute(Substitute(Substitute( …";

f) $Datos contains the second part of the function Substitute evaluated (correspondences between Betacode and Unicode Greek): e.g. "; "*a)/|" ; "ᾌ") ; "*a)|" ; "ᾊ") ; "*a)=|" ; "ᾎ") ; "*a(/|" ; "ᾍ") …")

=============================BEGINNING OF CODE (SET VARIABLE STEP)=============================

Let (

[ETIQUETA = (Middle($Texto_HTML_02 ; Position($Texto_HTML_02 ; "<" ; 0 ; $Contador) ; (Position($Texto_HTML_02 ; ">" ; 0 ; $Contador) + 1) - Position($Texto_HTML_02 ; "<" ; 0 ; $Contador))) ;

TEXTOGRIEGO = (Middle($Texto_HTML_02 ; Position($Texto_HTML_02 ; ">" ; 0 ; $Contador) + 1 ; (Position($Texto_HTML_02 ; "<" ; 0 ; $Contador + 1)) - (Position($Texto_HTML_02 ; ">" ; 0 ; $Contador) + 1))) ;

ETIQUETAFINAL = (Middle($Texto_HTML_02 ; Position($Texto_HTML_02 ; "<" ; 0 ; $Contador + 1) ; (Position($Texto_HTML_02 ; ">" ; 0 ; $Contador + 1) + 1) - Position($Texto_HTML_02 ; "<" ; 0 ; $Contador + 1)))] ;

If(

PatternCount(ETIQUETA;"lang="&Quote("greek"))≥1;

(Substitute($Texto_HTML_02;

ETIQUETA & TEXTOGRIEGO & ETIQUETAFINAL ;

ETIQUETA & TextColor( Evaluate($Substitute & Quote(TEXTOGRIEGO) & $Datos ); RGB(0 ; 0 ; 255)) & ETIQUETAFINAL));

$Texto_HTML_02

)

=============================END OF CODE=============================

comment · March 10, 2014

It's very difficult to read your code. Speaking in general, in order of decreasing importance:

1. Manipulating XML using Filemaker text functions is not a good idea. XSLT is a much better tool for this - and it can be used even inside Filemaker, with the help of the (free) BaseElements plugin;

2. Evaluate() slows down things - I couldn't figure out why you need to use it;

3. Using multiple substitutions within a single Substitute() function is probably faster that nesting multiple Substitutes (haven't tested this).

Also it sounds like this is a one-time conversion. If so, why bother with optimizing?

---

P.S. Please update your profile to indicate your version and OS.

Revuelta · March 11, 2014

Thank you for your reply, Comment. I answer below to your remarks. I would appreciate if you could point out to me an example of good commented and documented coding. I've never given much importance to it, but since I would like to share my code with others and I find it more difficult to understand my own code as it gets more complex, I think it's the time to properly document my code.

It's very difficult to read your code. Speaking in general, in order of decreasing importance:

1. Manipulating XML using Filemaker text functions is not a good idea. XSLT is a much better tool for this - and it can be used even inside Filemaker, with the help of the (free) BaseElements plugin;

Answer: I don't know more than the basics about XSLT. And I don't know how to use it within Filemaker. Could you point out to me some references?

2. Evaluate() slows down things - I couldn't figure out why you need to use it;

Answer: I'm thinking right now how to avoid Evaluate.

3. Using multiple substitutions within a single Substitute() function is probably faster that nesting multiple Substitutes (haven't tested this).

Answer: I've always nested. I didn't know it is not necessary to do so.

4. Also it sounds like this is a one-time conversion. If so, why bother with optimizing?

Answer: It is not. First, I'm not sure about the exact correspondence between Betacode and Unicode Greek: I've established just the basic. Secondly, turning one encoding into the second is just one of the many parsings I need to do. My purpose is to extract all grammatical information contained in the dictionary and to put it in a useful db format for my research purposes (I'm a linguist and dbs are just a tool for me). I'm using this conversion to learn how to extract information from xml. Now I'm trying to optimize the process for the future extractions. For example, the dictionary contains information about which cases uses every verb and the like, and that information is encoded through xml elements and attributes. I must learn how to parse them.

Till now, I've reduced the time to 4 h. 26 '. Right now I'm working on a modification that would reduce the time to 1 h. 28', if my calculations are right. As soon as I verify this point, I will post the code.

I would appreciate if you (or any other reader of this topic) could point out to me plug-ins or custom functions that could help me to parse xml more efficiently.

Thanks again for your time.

A.

---

P.S. Please update your profile to indicate your version and OS.

comment · March 11, 2014

I would appreciate if you (or any other reader of this topic) could point out to me plug-ins or custom functions that could help me to parse xml more efficiently.

That's kinda difficult, because you started in the middle. What would help here, I think, is seeing the original XML (or XMLs? - how did you mamage to end up with 118,115 records, with each record containing an unparsed XML?), and understanding what is your final goal here.

It is my impression - mainly based on this part:

My purpose is to extract all grammatical information contained in the dictionary and to put it in a useful db format for my research purposes

that you should concentrate on getting the data from the XML source/s into Filemaker in a structured manner. Usually, in a dictionary context, that would mean getting a record for each lexeme (and probably multiple records in a related table for inflections). That is the primary task here, IMHO, and once you get that done, everything else will be much easier. The conversion from Betacode to Unicode, for example, seems rather trivial - but I would postpone it until the basic parsing is done.

Revuelta · March 11, 2014

That's kinda difficult, because you started in the middle. What would help here, I think, is seeing the original XML (or XMLs? - how did you mamage to end up with 118,115 records, with each record containing an unparsed XML?), and understanding what is your final goal here.

It is my impression - mainly based on this part:

that you should concentrate on getting the data from the XML source/s into Filemaker in a structured manner. Usually, in a dictionary context, that would mean getting a record for each lexeme (and probably multiple records in a related table for inflections). That is the primary task here, IMHO, and once you get that done, everything else will be much easier. The conversion from Betacode to Unicode, for example, seems rather trivial - but I would postpone it until the basic parsing is done.

I think I've got the message. Now I'm going to create as many related records as elements in each xml record. Once I have each element in a different field in a different record, I'll need almost no parsing and It will be much easier to extract the information from the dictionary. Was this what you were meaning with your expression "getting the data from the XML source/s into Filemaker in a structured manner". It will be a lot of records (some millions, I suppose), but later it will be easier to deal with the xml text.

Revuelta · March 11, 2014

3. Using multiple substitutions within a single Substitute() function is probably faster that nesting multiple Substitutes (haven't tested this).

I've tried this point right now, and contrary to your assertion and Filemaker support (http://help.filemaker.com/app/answers/detail/a_id/726/~/nested-substitutes), nested "Substitute" (a), is faster than non-nested expressions (, at least within the "Evaluate" function:

a) Sustitute(Substitute( text; "a"; "A"); "b"; "B" )

Substitute( text; ["a"; "A"]; ["b"; "B"] )

Thanks a lot for your help. Just exchanging this comments with you has helped me since yesterday to change a lots of things and to improve a lot the performance. Now I'm going to reconsider all the database.

Here you have an example of an entry in my xml:

<entryFree id="n57632" key="ki/xlh" type="main" opt="n"><orth extent="full" lang="greek" opt="n">ki/xlh</orth> [<pron extent="full" lang="greek" opt="n">i^</pron> by nature], <gen lang="greek" opt="n">h(</gen>, <sense id="n57632.0" n="A" level="1" opt="n"><tr opt="n">thrush</tr> (a generic term, including various <pb n="955" />  species, <bibl n="Perseus:abo:tlg,0086,014:617a:18" default="NO"><author>Arist.</author><title>HA</title><biblScope>617a18</biblScope></bibl>), <cit><quote lang="greek">k. tanusi/pteroi</quote> <bibl n="Perseus:abo:tlg,0012,002:22:468" default="NO" valid="yes"><author>Od.</author><biblScope>22.468</biblScope></bibl></cit>, cf. <bibl n="Perseus:abo:tlg,0019,006:591" default="NO" valid="yes"><author>Ar.</author><title>Av.</title><biblScope>591</biblScope></bibl>, etc.:—<gramGrp opt="n"><gram type="dialect" opt="n">Dor.</gram></gramGrp> <orth extent="full" lang="greek" opt="n">kixh/la</orth> <bibl n="Perseus:abo:tlg,0521,001:157" default="NO"><author>Epich.</author><biblScope>157</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0019,003:339" default="NO" valid="yes"><author>Ar.</author><title>Nu.</title><biblScope>339</biblScope></bibl>:—late Gr. <orth extent="full" lang="greek" opt="n">ki/xla</orth> <bibl default="NO"><author>Alex.</author></bibl> Tralk.<bibl n="Perseus:abo:tlg,0402,001:1:10" default="NO"><biblScope>1.10</biblScope></bibl>, <bibl default="NO"><title>Gp.</title><biblScope>15.1.19</biblScope></bibl>. </sense><sense n="II" id="n57632.1" level="2" opt="n"> sea-fish, a species of <tr opt="n">wrasse</tr>, <bibl n="Perseus:abo:tlg,0521,001:60" default="NO"><author>Epich.</author> <biblScope>60</biblScope></bibl>, <bibl default="NO"><author>Antim.</author></bibl> ap. <bibl n="Perseus:abo:tlg,0008,001:7:304e" default="NO"><author>Ath.</author><biblScope>7.304e</biblScope></bibl> ('Antiphanes' codd.), <bibl n="Perseus:abo:tlg,0664,001:135" default="NO"><author>Diocl.Fr.</author><biblScope>135</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0086,014:598a:11" default="NO"><author>Arist.</author> <title>HA</title><biblScope>598a11</biblScope></bibl>, <bibl default="NO"><author>Nic.</author><title>Fr.</title><biblScope>59</biblScope></bibl>, <bibl default="NO"><author>Numen.</author></bibl> ap. <bibl n="Perseus:abo:tlg,0008,001:7:305c" default="NO"><author>Ath.</author><biblScope>7.305c</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0023,001:1:126" default="NO"><author>Opp.</author><title>H.</title><biblScope>1.126</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0023,001:4:173" default="NO"><biblScope>4.173</biblScope></bibl>: later <orth extent="full" lang="greek" opt="n">ki/xla</orth>, <bibl n="Perseus:abo:tlg,0744,001:1:15" default="NO"><author>Alex.Trall.</author><biblScope>1.15</biblScope></bibl>.</sense></entryFree>

And here you can see the same fragment with the conversion of Betacode into Unicode Greek:

<entryFree id="n57632" key="ki/xlh" type="main" opt="n"><orth extent="full" lang="greek" opt="n">κίχλη</orth> [<pron extent="full" lang="greek" opt="n">ῐ</pron> by nature], <gen lang="greek" opt="n">ἡ</gen>, </br><sense id="n57632.0" n="A" level="1" opt="n"><trad opt="n">thrush</trad> (a generic term, including various <pb n="955" />  species, <bibl n="Perseus:abo:tlg,0086,014:617a:18" default="NO"><author>Arist.</author><work>HA</work><biblScope>617a18</biblScope></bibl>), <cit><quote lang="greek">κ. τανυσίπτεροι</quote> <bibl n="Perseus:abo:tlg,0012,002:22:468" default="NO" valid="yes"><author>Od.</author><biblScope>22.468</biblScope></bibl></cit>, cf. <bibl n="Perseus:abo:tlg,0019,006:591" default="NO" valid="yes"><author>Ar.</author><work>Av.</work><biblScope>591</biblScope></bibl>, etc.:—<gramGrp opt="n"><gram type="dialect" opt="n">Dor.</gram></gramGrp> <orth extent="full" lang="greek" opt="n">κιχήλα</orth> <bibl n="Perseus:abo:tlg,0521,001:157" default="NO"><author>Epich.</author><biblScope>157</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0019,003:339" default="NO" valid="yes"><author>Ar.</author><work>Nu.</work><biblScope>339</biblScope></bibl>:—late Gr. <orth extent="full" lang="greek" opt="n">κίχλα</orth> <bibl default="NO"><author>Alex.</author></bibl> Tralk.<bibl n="Perseus:abo:tlg,0402,001:1:10" default="NO"><biblScope>1.10</biblScope></bibl>, <bibl default="NO"><work>Gp.</work><biblScope>15.1.19</biblScope></bibl>. </sense></br><sense n="II" id="n57632.1" level="2" opt="n"> sea-fish, a species of <trad opt="n">wrasse</trad>, <bibl n="Perseus:abo:tlg,0521,001:60" default="NO"><author>Epich.</author> <biblScope>60</biblScope></bibl>, <bibl default="NO"><author>Antim.</author></bibl> ap. <bibl n="Perseus:abo:tlg,0008,001:7:304e" default="NO"><author>Ath.</author><biblScope>7.304e</biblScope></bibl> ('Antiphanes' codd.), <bibl n="Perseus:abo:tlg,0664,001:135" default="NO"><author>Diocl.Fr.</author><biblScope>135</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0086,014:598a:11" default="NO"><author>Arist.</author> <work>HA</work><biblScope>598a11</biblScope></bibl>, <bibl default="NO"><author>Nic.</author><work>Fr.</work><biblScope>59</biblScope></bibl>, <bibl default="NO"><author>Numen.</author></bibl> ap. <bibl n="Perseus:abo:tlg,0008,001:7:305c" default="NO"><author>Ath.</author><biblScope>7.305c</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0023,001:1:126" default="NO"><author>Opp.</author><work>H.</work><biblScope>1.126</biblScope></bibl>, <bibl n="Perseus:abo:tlg,0023,001:4:173" default="NO"><biblScope>4.173</biblScope></bibl>: later <orth extent="full" lang="greek" opt="n">κίχλα</orth>, <bibl n="Perseus:abo:tlg,0744,001:1:15" default="NO"><author>Alex.Trall.</author><biblScope>1.15</biblScope></bibl>.</sense></entryFree>

comment · March 11, 2014

Was this what you were meaning with your expression "getting the data from the XML source/s into Filemaker in a structured manner". It will be a lot of records (some millions, I suppose), but later it will be easier to deal with the xml text.

I meant that your data should be stored in fields, records and tables, in the form of text, numbers, dates or timestamps - not as XML. For example,you would probably have a record in a Bibl table with the following fields:

EntryID: n57632

Author: Arist.

Title: HA

Scope: 617a18

--

I say probably because I don't have a "map" of the XML snippet you posted.

Revuelta · March 11, 2014

I'm doing EXACTLY what you've just mentioned. It's a lot or records, but it's worth: a) operations are much faster; everything is much clearer and understandable. I do really thank you for your help. In case you need someone who knows Spanish (my mother language) and Greek (Ancient and Modern, my area of expertise), just make me know.

comment · March 11, 2014

It's a lot or records, but it's worth:

Yes, it it's bound to be. I am not sure what's your starting point - let me remind you that Filemaker can import XML data directly, given a suitable XSLT stylesheet. You may not be familiar enough with XSLT but (a) learning the necessary basics can be less work that trying to parse XML within Filemaker, and ( help is available, both here and in more specialized venues.

Sign In

optimizing xml parsing script on a big db (12 hour)

Recommended Posts

Revuelta

comment

Revuelta

comment

Revuelta

Revuelta

comment

Revuelta

comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information