large transaction solution

happymac · August 17, 2016

We have a database solution that is basically an accounting solution. We have a transaction table that stores all the transactions. This table contains millions of records.

We have been dealing with speed issues for a few years where when we run reports, since it is such a large set of records, the reports take hours to complete. the Transaction table contains no unstored fields, but just the sheer size of the table makes the reporting a bear.

We have tried storing the calculated values overnight so that the users don't have to wait for the reports to run, but sadly the stored values occasionally become inaccurate and cause obvious issues. the transaction table is always accurate, but the stored values many times are not. After years of dealing with this and adjusting code to try and eliminate this issue. We are thinking the best solution is to find a system that can handle the immense table and reporting ... to use FileMaker as a front-end for the users, but some other system (SQL or Python or ???) for quick calculations / reporting.

The problem is not going away and in fact is getting worse as more and more records get added on a daily basis.

I feel like we can keep bandaiding the solution, but the real long-term solution is to realize that maybe we have outgrown FileMaker's capabilities and move the data storage and calculations to a different platform.

Am hoping to have some others shed some light on similar situations they have run into and what solutions I should consider to solve this frustrating issue.

thank you in advance!

Kris M · August 17, 2016

Perhaps you should/could consider shipping transaction records that will never be required for a report off to a storage table?

happymac · August 17, 2016

actually all of them are required because the reports show historicals.

comment · August 17, 2016

I am not aware of any system that can summarize millions of records on demand. The general idea is to denormalize some aspects of the solution. For example, you could pre-summarize the transactions from previous years, and supplement this by a "live" summary of this year's records. Hopefully, the reason why "sadly the stored values occasionally become inaccurate" is not modification of history records - otherwise you will need more help than any human can provide.

Edited August 17, 2016 by comment

LaRetta · August 17, 2016

The point is that history data should be frozen (posted) and no further modification be allowed. Once that takes place, and is enforced via Security settings, its aggregate totals can be written to records in a Statistics table. As Comment indicates, you then just combine the history total with the 'unposted' current totals for faster totals.

happymac · August 17, 2016

We have implemented this but it still isn't enough. We have a historical table that holds data that is older than 2 years ... but the last two years of data is still crazy. Basically the historical table contains an archive of all the historical records, and in its place it creates one record in the transaction table per month of similar transaction records in the transaction table. So maybe 100,000 records get deleted from the transaction table and in its place is one record (totaling the values).

this 'archiving' helps, but isn't enough. Probably is that each month we are importing about a half million transactions (records in the transaction table) and so each monthly report we run takes about 30 minutes to create (about 24 hours for the entire batch). the reports are fairly complex showing multiple products over multiple months, but I have to think that a more enterprise level system would be able to summarize this data within minutes. I'm hugely loyal to FMP, but I am thinking it's time to entertain a more robust system just for reporting???

Fitch · August 17, 2016

You're storing transactions, that's good. But if your accounting system doesn't store the account balances, you're doing it wrong.

Also: are you using PSoS (Perform Script on Server) to run your report scripts?

happymac · August 17, 2016

when we "archive" a month, it sums the balance in a number field ... so it is stored at that point. BUT ... until we archive the transactions we use unstored calculations to summarize. so if the report is showing the total revenue for July for one product, we have an unstored calculation that finds all the transactions for that one product for that one month and displays the total (via a relationship). super slow to generate the reports, but I don't see any other way while these transactions are potentially still in flux. After roughly a year, they get archived with stored values.

We are NOT using PSOS to generate these reports since they are run by employees on the fly, and this is we could entertain changing to speed things, but in all honesty PSOS hasn't proven too impressive to me. I have many processes running with PSOS, but the speed increase still hasn't wowed me in most cases (maybe 20% speed boost if I were to guess). AND ... I've definitely had situations where PSOS has bogged down the entire server with too many processes running at once ~ so that worries me in this case. I would entertain moving things to the server if I felt the speed boost would be sizable. I'm really looking to take these reports from 30 minutes down to just a few minutes. Storing the transactions DID this (de-normalizing), but it created inaccuracies that could potentially be debugged to get perfect, but I have been trying for years to get it right and still am dealing with issues...

LaRetta · August 17, 2016

I am shocked that you say that about PSOS. It has been my experience that it decreases the runtime substantially. A report which normally would take 20 minutes, takes 1 minute with PSOS! All processes switched to PSOS for various clients all respond with same speed savings. You say you can't use it because your reports are ran by employees on the fly ... but that makes no difference; they can still be ran PSOS. You aren't confusing server-side scheduled scripts with PSOS (perform script on server) are you? BIG difference. PSOS can be ran by staff and they won't even know the server is handling it and it can happen on the fly.

I would also suggest that you have your file reviewed for optimization. Even good developers can use other developers' eyes on their solution. You would be surprised what can be found. The first thing I would review are the extensive calculations you mention and consider writing totals static using script script triggers (with tight controls) during data-entry to eliminate some of those calculations. And I would question why you are using a relation to display totals instead of generating the report from the child table directly.

In all, there are many tricks which handle data in more-optimized manner. But yes, there is also a point where FileMaker might not be the best solution but exhausting all options first is recommended since FM is still the best bang-for-your-buck software.

happymac · August 17, 2016

I appreciate your thoughts, thank you.

maybe giving an example of one report would be helpful to see how PSOS could help. a typical report shows one year of all transactions for a product. It is broken down with columns that display each month. so column A is January, column B is February, etc. In each column, there are rows for various revenue and expenses related to that product. so you might have a line "revenue from xx" and another "marketing expense" and another "royalty paid". then at the bottom of each column, we have totals that add up the above pluses and minuses. It's a bit more complicated than this, but more or less this is a good description.

The fields are calculation fields that are located in the product table. these fields summarize (sum is used primarily) the totals for each revenue / expense type by month. So it uses a relationship from product->transaction with match fields like (month, year, expense type, revenue type, etc ... all number fields). Bottom line ... when you open the layout it immediately starts calculating the fields on the layout and forces you to wait while they run. There is very little scripting that is going on, user selects some global values (year and product and revenue type (or multiple types), then the unstored calculations run while the user waits.

One idea I suppose is to use PSOS to "load" the layout on the server, then push the unstored but calculated values into a new temporary reporting table that the user then sees. SO ... the user never sees the unstored calculations, but rather waits for the server to load the unstored values then imports those values into a temp table that the user sees. Or maybe the whole layout load is arcane and there is a better way to load the data.

One problem related to this report is that they have tens of thousands of products and each product has about 20 different revenue types which the user selects when running the report. So it would be next to impossible to try and run the reports after hours so they are completed before the user walks in each day. I realize you aren't suggesting this, but I certainly thought this would be an answer early on ... but it wont work here.

jbante · August 17, 2016

3 hours ago, happymac said:

but I have to think that a more enterprise level system would be able to summarize this data within minutes.

Nope. Well, yes, but not using techniques any different from what we could do in FileMaker for the same effect. Any time you see a report showing summarized values over large quantities of records showing up super fast, it's because the expensive calculations were done before any user asked for the report (even for reports where the user thinks they're doing something ad hoc). There are two ways to do this: batching and streaming. The idea to run a report after hours that you've already mentioned is the batching option. I think you should consider streaming. Rather than accumulating your summary values in reports after the fact, each transaction updates the appropriate summaries at the time of the transaction continuously over the course of the period. The processing time isn't hours and hours at a time when you're building reports; it's milliseconds at a time as each transaction is processed.

As you mentioned, this can be a problem if your updates to the summaries get out of sync with your transaction data. 2 things about that: (1) That's a bug that you need to track down and fix, and (2) you can re-normalize the data periodically while you're doing your archiving (which you mentioned you're already doing) until you do track down that bug.

Edited August 17, 2016 by jbante

comment · August 17, 2016

46 minutes ago, happymac said:

The fields are calculation fields that are located in the product table

I hesitate whether to go into this. This thread seems to revolve around three different topics simultaneously:

an alternative to Filemaker;
how to denormalize a part of your data (and which part);
how to summarize the normalized part,

and is in serious danger of eventually missing out on all three.

So let me just say briefly that the method you have described is patently unsuitable for summarizing large amounts of records (topic #3). You should be using summary fields (basically, a single summary field if all you're interested in is the revenue amounts), and a technique known as FastSummaries if you must display the results in columns. Alternatively, the ExecuteSQL() function might be worth a try (in a script, not in a calculation field).

Note that the more you denormalize your solution (topic #2), the less critical the technique of summarizing the normalized part (topic #3) becomes.

Edited August 17, 2016 by comment

happymac · August 17, 2016

4 minutes ago, comment said:

I hesitate whether to go into this. This thread seems to revolve around three different topics simultaneously:

an alternative to Filemaker;

how to denormalize a part of your data (and which part);

how to summarize the normalized part,

and is in serious danger of eventually missing out on all three.

So let me just say briefly that the method you have described is patently unsuitable for summarizing large amounts of records (topic #3). You should be using summary fields (basically, a single summary field if all you're interested in is the revenue amounts), and a technique known as FastSummaries if you must display the results in columns. Alternatively, the ExecuteSQL() function might be worth a try (in a script, not in a calculation field).

Note that the more you denormalize your solution (topic #2), the less critical the technique of summarizing the normalized part (topic #3) becomes.

again ... thank you to ALL for weighing in! I REALLY appreciate the suggestions.

I think in a perfect world we would achieve a denormalized state that would be sufficient, and not shop around for non fmp solutions. With that said, I have really struggled with this particular situation where the records get out of sync ... and I think it has to do with the quantity of records. denormalizing is easy for smaller sets, but when dealing with millions of records, inevitably the "sync" doesn't complete or the user ends up having to wait for it to complete.

Obviously we should be able to solve this, but it seems like either of two things happen with lots of records ... 1) the records get out of sync ... which is an error and can be solved, or 2) the user needs to wait while the denormalization occurs. A silly example to illustrate number 2 ... user imports transactions (roughly 30,000 records) through an import script that does lots of things. They then immediately want to run a report for that product. user has to wait for this denormalization to occur. the way we solve this currently is when the import script completes, it triggers a server side script to run and start updating the records. Great in theory, but this can take lots of time to complete. In the mean time if the user tries to pull a report they get incorrect data.

It sounds like my techniques are correct, I just need to find ways to denormalize quicker.

JBANTE also suggested denormalizing at the time of import or input ... which I agree is ideal, but ... since in this case we have so many possible reporting outcomes ... I'm not sure we can achieve a perfect solution here. I think we could severely cut down on the number of records by having a summary for each product for each month for each revenue / expense stream, and then use summary fields (unstored calcs) to summarize that data. Since this probably would reduce the amount of transactions by 90% - 95% that need to be summarized this would help drastically but it wouldn't perfectly solve the issue. AND ... yes, we "re-normalize" on the first of every month to make certain things are accurate ... but it takes days (literally) to force an update of all the records.

I don't know ... maybe I'm going down the right path, just need to perfect the code so it is error proof, but it certainly seems frustrating, and getting calls from the panicked customer that the report says one thing and the database says something different is definitely no Bueno.

comment · August 17, 2016

18 minutes ago, happymac said:

I think in a perfect world we would achieve a denormalized state that would be sufficient, and not shop around for non fmp solutions

In a perfect world, the hardware would be fast enough to handle a fully normalized solution, and no denormalization would be necessary...

20 minutes ago, happymac said:

It sounds like my techniques are correct

I don't think so - as I tried to tell you earlier.

22 minutes ago, happymac said:

A silly example to illustrate number 2 ... user imports transactions (roughly 30,000 records) through an import script that does lots of things. They then immediately want to run a report for that product.

No, that actually illustrates #3. 30k records is not that many by today's standards, and I would expect a summary report to be produced in a reasonable time.

happymac · August 17, 2016

Comment ... I may have not been clear in my first posts ... I AM using summary fields, but a typical report has roughly 10-15 summary fields each having to look at potentially millions of records. I have not tried FASTSUMMARY and that might be a good solution to the performance problem ... and of course EXECUTESQL might be effective here as well. I guess before I go down those rabbit holes, the question is ... are you suggesting that it is quite reasonable with millions of records that we could keep the data normalized and use better summary techniques to get the reports to load reasonably quickly, OR is normalization never going to work sufficiently when we are talking about 5-10 million records?

comment · August 17, 2016

9 minutes ago, happymac said:

are you suggesting that it is quite reasonable with millions of records that we could keep the data normalized and use better summary techniques to get the reports to load reasonably quickly, OR is normalization never going to work sufficiently when we are talking about 5-10 million records?

It's the latter. You should denormalize as much of your data as you possibly can. LaRetta gave you the criteria: determine a point in time (or a point in the process workflow) beyond which records cannot be legitimately modified.

jbante · August 17, 2016

2 hours ago, happymac said:

JBANTE also suggested denormalizing at the time of import or input ... which I agree is ideal, but ... since in this case we have so many possible reporting outcomes ... I'm not sure we can achieve a perfect solution here. I think we could severely cut down on the number of records by having a summary for each product for each month for each revenue / expense stream, and then use summary fields (unstored calcs) to summarize that data. Since this probably would reduce the amount of transactions by 90% - 95% that need to be summarized this would help drastically but it wouldn't perfectly solve the issue. AND ... yes, we "re-normalize" on the first of every month to make certain things are accurate ... but it takes days (literally) to force an update of all the records.

I suppose you could say I suggested denormalizing, but not quite in the same sense as how I usually hear that word used. Those stored "summary" records? (Summary of the data, not a FileMaker summary or calculation field.) That's exactly what I'm talking about. (Except it may make sense to store the summaries of those, too.) I suggest you do some research into how data warehouses are structured, star schemas, etc.

happymac · August 18, 2016

15 minutes ago, jbante said:

I suppose you could say I suggested denormalizing, but not quite in the same sense as how I usually hear that word used. Those stored "summary" records? (Summary of the data, not a FileMaker summary or calculation field.) That's exactly what I'm talking about. (Except it may make sense to store the summaries of those, too.) I suggest you do some research into how data warehouses are structured, star schemas, etc.

yes, I was thinking the same thing. that we put data in a number field that essentially is a summary of all records in the transaction table for a specific product, specific revenue/expense stream and specific month, so ultimately if a product has 1,000 transactions for July for royalties paid ... the transaction table would retain the 1,000 transactions, but we would also have a separate table called "TRANSACTIONSSUMMARIZED" that would contain 1 record with one number field entry summarizing the 1,000 transactions. So if the 1,000 transactions total 3,085.13 ... this one record would have a number field with 3085.13 in it. This way the reports would in a sense have one record to evaluate per month instead of 1,000.

Is this more or less what you were implying?

Wim Decorte · August 18, 2016

18 hours ago, happymac said:

I've definitely had situations where PSOS has bogged down the entire server with too many processes running at once ~ so that worries me in this case. I would entertain moving things to the server if I felt the speed boost would be sizable.

The speed boost CAN be sizable but only if you design your solution AND deployment to make it so. You can't expect to just throw PSoS session at the database server and hope that it will be faster. Sounds like your server is not up to the task, which is not a failure of the product but one of design/architecture.

The key difference here is that in the blink of an eye you are expecting your database server to become an application server....

comment · August 18, 2016

11 hours ago, happymac said:

the transaction table would retain the 1,000 transactions, but we would also have a separate table called "TRANSACTIONSSUMMARIZED" that would contain 1 record with one number field entry summarizing the 1,000 transactions. ...

Is this more or less what you were implying?

I don't think anyone who contributed to this thread meant anything else.

2 minutes ago, Wim Decorte said:

Sounds like your server is not up to the task

And now we have topic #4 ...

jbante · August 18, 2016

15 hours ago, happymac said:

the transaction table would retain the 1,000 transactions, but we would also have a separate table called "TRANSACTIONSSUMMARIZED" that would contain 1 record with one number field entry summarizing the 1,000 transactions. So if the 1,000 transactions total 3,085.13 ... this one record would have a number field with 3085.13 in it. This way the reports would in a sense have one record to evaluate per month instead of 1,000.

Is this more or less what you were implying?

That's a start. Yes to having a "TRANSACTIONSSUMMARIZED" table. No to having one record per month (or other period) to "evaluate". In my suggestion, there should be nothing to evaluate (on the client or server) at the time of the report except a find for the target TransactionSummary records. No calculation fields to evaluate; no summary fields (the FileMaker field type) to evaluate. (Or at least only over some negligibly small found set.) I'm suggesting you achieve this by updating the applicable TransactionSummary records at the time of each transaction: TransactionSummary::total = TransactionSummary::total + Transaction::amount, not TransactionSummary::total = Sum ( Transaction::amount ). If the TransactionSummary results are getting out of sync with the actual transactions, start by looking into transactional scripting (database transaction, not business transaction).

bruceR · August 18, 2016

ETL: It's a thing.

http://www.webopedia.com/TERM/E/ETL.html

Above link is another source to confirm and describe the common practice of ETL; Extract, Transform, Load.

As mentioned by Comment, Jeremy, et al. Live calcs across millions of rows are NOT recommended practice.

happymac · October 13, 2016

On ‎8‎/‎18‎/‎2016 at 9:21 AM, jbante said:

That's a start. Yes to having a "TRANSACTIONSSUMMARIZED" table. No to having one record per month (or other period) to "evaluate". In my suggestion, there should be nothing to evaluate (on the client or server) at the time of the report except a find for the target TransactionSummary records. No calculation fields to evaluate; no summary fields (the FileMaker field type) to evaluate. (Or at least only over some negligibly small found set.) I'm suggesting you achieve this by updating the applicable TransactionSummary records at the time of each transaction: TransactionSummary::total = TransactionSummary::total + Transaction::amount, not TransactionSummary::total = Sum ( Transaction::amount ). If the TransactionSummary results are getting out of sync with the actual transactions, start by looking into transactional scripting (database transaction, not business transaction).

I've been pondering this and feel it is the best way to proceed, but I'm still struggling with how to deal with updates or deletes. Let me pose a scenario. Imagine if for August of this year we have 100 transactions for the month and a user deletes two of the transactions in the middle of the month. So, we now have 98 transactions. would all 98 of these need to be 're-totaled', or only the ones AFTER the removed transactions (roughly transactions 51-98)? OR do we leave transactions 51-97 out of sync and only update the last entry in the month with the summary totals? Also, I assume this would mean that all transactions in September, October and beyond would need re-totaling as well since we also want to display the totals for all time on the report as well.

Probably a better solution ... would be to use a related 'summary' table and have one record in the 'summary' table for each month ... so that any change in ANY of the transactions in the month would result in this one 'summary' field getting re-calc'd?

your insight would be really helpful!

Wim Decorte · October 13, 2016

In a true transaction model you never delete a transaction, you compensate by making another if an adjustment is needed... Think accounting.

Fitch · October 13, 2016

@happymac Asked and answered in the post you just quoted.

normanicus · October 14, 2016

Consultant wrote "I am not aware of any system that can summarize millions of records on demand".

Google?

LaRetta · October 14, 2016

3 minutes ago, normanicus said:

Consultant wrote "I am not aware of any system that can summarize millions of records on demand".

Google?

Surely that was a dig, was it not? Listen carefully, normanicus, Comment does not deserve a dig of any kind. I will skip his credentials and obvious expertise to anyone with half a mind but he devotes a LOT of his valuable time for free to help folks here.

Tell me I'm wrong in my perception of your response ... tell me quickly please lest I come back for a second discussion.

comment · October 14, 2016

19 minutes ago, LaRetta said:

Surely that was a dig, was it not?

I understood this as saying that Google is capable of summarizing millions of records on demand.

I am not sure that's an accurate description of what they do, but since I don't know exactly how they do it, I am willing to accept it as a possibility. The question then becomes whether in the given context it's fair to exclude systems that require the output of an entire nuclear power plant to run. The context being:

Quote

I feel like we can keep bandaiding the solution, but the real long-term solution is to realize that maybe we have outgrown FileMaker's capabilities and move the data storage and calculations to a different platform.

Edited October 14, 2016 by comment

normanicus · October 14, 2016

That wasn't meant as a dig at all. So, Consultant, if I've written something that could be taken badly, it was not meant. I have always thought since trying to wade through Marx, that one must be responsible for the consequences of what one writes so, I apologise.

No, being on a Filemaker forum and involved in relational thinking we do tend to try to crack the nut with the hammer to hand. My reply was to suggest that there are solutions to searching massive data sets.

I might be a bit odd but I almost never use Filemaker summary stuff. Without claiming to be an export the answer to speed and large data sets seems to be:

data only fields (no calculations fields)

Find

Script to produce the report.

I emphasise, I don't claim to be an authority,

Norman

LaRetta · October 14, 2016

I am glad I was mistaken. I am sorry Norman, for jumping to wrong conclusion! :-)

Edited October 14, 2016 by LaRetta

jbante · October 14, 2016

8 hours ago, normanicus said:

No, being on a Filemaker forum and involved in relational thinking we do tend to try to crack the nut with the hammer to hand. My reply was to suggest that there are solutions to searching massive data sets.

In general, there's more to how Google does what it does than just having smart people and good software. Google doesn't work "smarter, not harder"; Google works "smarter AND harder". In particular, you're right that solutions to searching massive data sets quickly (i.e. in sub-linear time with respect to the size of the data set) exist. Indexing, for example. FileMaker has indexes. (There are other types of indexes that might be nice to add, but not that would help the content of this thread.) However, there are no solutions for quickly making massive data sets searchable, at least in terms of doing it with few CPU cycles. Building indexes is a super-linear affair. (The time it takes to build an index grows faster than the amount of data to index.) This is due to computing constraints more fundamental than gravity. Google's solution is to throw billions of dollars of hardware at the problem (with some non-trivial intelligence going into how to parallelize the indexing), but the total compute time is still super-linear. Search is fast when you ask for it because some computer has spent a lot of time preparing for the search. "Big iron" has some different techniques at their disposal, but the governing principles of computing are no different. There are problems big enough that using FileMaker for them is impractical (by which I mean either impossible or takes substantially more resources to accomplish than a solution using competing tools); but those problems are rare, and that threshold is very high. Very few people are trying to index the entire internet, and those who are will have to spend monumental resources to do it, no matter how smart they are.

Edited October 14, 2016 by jbante

Sign In

large transaction solution

Recommended Posts

Create an account or sign in to comment

Create an account

Sign in

Important Information