Jump to content
Server Maintenance This Week. ×

FMS14v3 Failed Consistency Check - False Negative


xochi

This topic is 3087 days old. Please don't post here. Open a new topic instead.

Recommended Posts

I've updated 3 servers to FM14v3, and two of them are fine, but the one that sees the highest use (and has the biggest database) is now throwing "failed consistency check" errors which prove to be False Negatives (e.g. the files are actually fine).   FileMaker has a tech note that this is a "known issue" http://help.filemaker.com/app/answers/detail/a_id/10787/~/consistency-check-error-for-scheduled-backups  but I'm wondering : anyone else seen this starting with their FMS14v3 upgrade?

 

Any way to fix it?   I hate ignoring these warning emails.

Link to comment
Share on other sites

Hahahaha.   Oh dear.

Note to self : never EVER post on the internet about how something is giving you a "bogus" error message but really, everything is really OK and fine and dandy.

Not an hour after I posted this, my server died in a flaming heap of rubble - filemaker reporting that the file was corrupt, force-disconnecting all clients, and the server itself locked up so bad that it had to be force-powered-off.

 

Running "Check consistency" on the file now gives a single block (out of almost 900,000) which is corrupt - so something is actually going wrong clearly.  Will try rebuilding and/or going back to prior backups as needed.

 

 

Link to comment
Share on other sites

I feel your pain.

But equally sad is that the link you presented was created March, 2012 and last updated December, 2014 so it still hasn't been fixed?   If it has been fixed, that post should indicate it has been fixed.  If not, it is shameful that it has been 2 years 8 months still broken.

Edited by LaRetta
Link to comment
Share on other sites

9 hours ago, xochi said:

 

Running "Check consistency" on the file now gives a single block (out of almost 900,000) which is corrupt - so something is actually going wrong clearly.  Will try rebuilding and/or going back to prior backups as needed.

And scrutinize your deployments.  Something is making those files sick so you need to find out what... AV, external backup, OS level file sharing,....

Link to comment
Share on other sites

3 hours ago, Wim Decorte said:

And scrutinize your deployments.  Something is making those files sick so you need to find out what... AV, external backup, OS level file sharing,....

None of those are an issue - we have 3 servers with over 30 years of FMS service between them, and this is the first time this has happened.   The fact that it happened immediately after a FMS14v3 upgrade is of course suspicious, but if nobody else is reporting similar problems then it might just be coincidental. 

Given that it was only a single block that was corrupted out of nearly 1 Million blocks, this could be some sort of "bit rot" - perhaps a cosmic ray flipped a bit in RAM or in the SSD.

I'm still intrigued by why the backup consistency checker was reporting problems, but when I manually checked the backups they were fine.   The tech note I linked above from FileMaker says that the FMS and FMP use the same algorithm for checking.

 

 

 

 

Link to comment
Share on other sites

The problem has returned.  After putting a known good backup back into service around midnight, the server ran happily along doing hourly backups. 

At 0800 today it again reported the Consistency Check error (805).  However, just like before, copying both the backup file and the live files from the server and running Consistency Check and Recover on both from FMPA 14v3 reveals zero problems.

Re Checking my notes as to the timing of all this:

  • I upgraded to FMS14v3 around October 8th - I had earlier said the problems started "immediately after a FMS14v3 upgrade" but that's wrong.
  • On November 1st I did OS X "Security Update 2015-004" 
  • The first failed consistency check happened November 2 at 4am.
  • The consistency checks do not fail on every hourly backup, but are sporadic - e.g. when this started, the failures were at 7pm, 8pm, then nothing until 1pm and 2pm the next day.

Based on the history, now it makes it seem that "Security Update 2015-004" is more likely to blame.

I have another clue:  This server has Time Machine set to back up, and, as per FM Inc's recommendations, the live Data/Databases/ folder is excluded from backups.   However, the Data/Backups folder is not excluded from Time Machine - this is a technique we've been using for 5+ years on multiple servers w/o problem, and seems to follow the guidelines from FileMaker :  http://help.filemaker.com/app/answers/detail/a_id/6760/~/filemaker-and-time-machine

However, if I look at the Time Machine backups log, I see this:

11/5/15 12:44:49.035 AM com.apple.backupd[5571]: Backup completed successfully.
11/5/15 1:47:51.051 AM com.apple.backupd[6636]: Backup completed successfully.
11/5/15 2:49:11.531 AM com.apple.backupd[7705]: Backup completed successfully.
11/5/15 3:50:49.520 AM com.apple.backupd[8736]: Backup completed successfully.
11/5/15 4:54:35.958 AM com.apple.backupd[9785]: Backup completed successfully.
11/5/15 5:56:25.828 AM com.apple.backupd[10805]: Backup completed successfully.
11/5/15 6:57:47.132 AM com.apple.backupd[11793]: Backup completed successfully.
11/5/15 8:03:31.528 AM com.apple.backupd[12954]: Backup completed successfully.

Aha!  The backup that failed consistency check was the one that happened at 8am, when TimeMachine backup FMS Hourly backup script both ran at the same time.  All the other TimeMachine backups did not overlap with the FMS backup times.  

Could it be that the combination of OS X 10.10.5, Security Update 2015-004, FMS14v3, and TimeMachine backups of the FM Server Backups folder is now problematic? 

 

Other ideas:

  • This could be as simple as a failing hard drive : the symptoms (random data corruption under heavy use) would be consistent.  I'm currently running a Samsung 840 Pro (which is not the problematic 840 EVO model, but who knows - maybe it's dying?)   It looks like there's a firmware update for the SSD I may try, or maybe I'll just swap it out to be safe.   

 

 

 

 

Link to comment
Share on other sites

Time Machine and FMS backups at the same is likely the issue... the fact that it worked well for a long time is just a fluke I think.  You do not want anything to interfere with the FMS live files, temp files, backup locations and progressive backup locations.

Link to comment
Share on other sites

Wim, you certainly could be right, but that would mean that FileMaker's own recommendations are incorrect, and that my literally 30+ years of combined FM Server operation w/o problems was all just dumb luck, and that the problem happening soon after upgrades was also just coincidence.  

I've restored a known good backup and am letting the server run.  I've also decided to enable Progressive Backups as well (my logic: if it's a failing SSD, the more work it does the more likely it will be to fail, and if it's a software bug with FMS backups, then perhaps changing the backup settings will make the bug more consistent or even possibly go away).     I've also got a brand new SSD ready to be swapped in if things go south again.

Edited by xochi
spelling
Link to comment
Share on other sites

More info:

  • The "soft" problem had recurred : although some hourly backups are fine, some are being marked as inconsistent.  However, this time there was no overlap between the FileMaker backup and the Time Machine backup - the failure happened when Time Machine was idle.
  • Enabling Progressive backups didn't cure the problem (also, I'm now showing two "IncrementalBackup_YYYY-MM-DD_HHMM" folders, but documentation suggests there should only be one.  Is that a bug?)
  • As a test, I've set up another hourly backup folder, but this one is excluded from Time Machine.
  • I also realized that Spotlight indexing was enabled for this drive, so I've disabled it entirely.

 

22 hours ago, LaRetta said:

I feel your pain.

But equally sad is that the link you presented was created March, 2012 and last updated December, 2014 so it still hasn't been fixed?   If it has been fixed, that post should indicate it has been fixed.  If not, it is shameful that it has been 2 years 8 months still broken.

Yeah, it's hard to know what to make of that statement from FM Inc.

 

Also, would it kill them to put in a *little* more error logging?   The FMPro Recover command is great: it shows you in great detail for every file, table, field, layout and index what did (or did not) verify.    

 

FMS on the other hand doesn't even tell you which database failed the consistency check in the email message (you can get the database name if you open up Event.log, but that's it, no more details).   I've browsed /Library/FileMaker/Logs folder but I don't see anything with more info about why FMS thinks my backups are corrupted.  Sigh.

 

Edited by xochi
Link to comment
Share on other sites

11 hours ago, xochi said:

More info:

  •  
  • Enabling Progressive backups didn't cure the problem (also, I'm now showing two "IncrementalBackup_YYYY-MM-DD_HHMM" folders, but documentation suggests there should only be one.  Is that a bug?)

You should always see two full sets of "incrementalBackup".  Sometimes 3 if a progressive backup is actually in progress.

Not sure where you saw documentation that would suggest there is only one folder?

Link to comment
Share on other sites

 

Quote

 

How many copies of my files does FileMaker Server keep?

Progressive backup essentially keeps two copies of the backup files: a time stamped copy that is available for you to use as a backup, and an in-progress copy that gets updated with the accumulated changes.

 

This is from http://help.filemaker.com/app/answers/detail/a_id/10240/~/frequently-asked-questions-about-progressive-backups and suggests that there will be One time-stamped backup folder, but I see Two, in addition to the In Progress folder as well.

 

Edit:

 To be clear, what I actually see on my FMS14v3 server with Progressive Backups is:

Changes_FMS

Copies_FMS

IncrementalBackup_2015-MM-DD_HHMM

IncrementalBackup_2015-MM-DD_HHMM  (a second copy with a different timestamp)

Seems like FM needs to update their documentation a bit?

Edited by xochi
Link to comment
Share on other sites

I tried a few more diagnostics last night:

  • I created a new second hourly backup folder, running :30 after the hour, and excluded it from TimeMachine backups.  Did not help.
  • I turned off Spotlight for the entire hard drive.  Did not help.
  • I turned off Time Machine altogether.  Did not help.

None of these solved the issue .    I will try turning some more tests and report back.

Link to comment
Share on other sites

More Info:

I tried restoring to older software versions: OS 10.10.5 prior to Security Update 2015-004, and FMS to 14v1:  No change.

I tried putting in a brand new SSD:  No change

I tried swapping out RAM:  No change.

 

On the latest round, i did eventually get an actual crash of fmserverd :

Thread 30 Crashed:
0   com.filemaker.dbengine.framework	0x000000010a180457 Draco::HPage::FromDiskFmt(bool) + 215
1   com.filemaker.dbengine.framework	0x000000010a18ebf2 Draco::HCacheMgr::ReadPage(Draco::HFrame*, Draco::HBFile*, unsigned int, int) + 492
2   com.filemaker.dbengine.framework	0x000000010a18f7b1 Draco::HCacheMgr::PinPage(int, Draco::HBFile*, unsigned int, int, Draco::HFrame*) + 323
3   com.filemaker.dbengine.framework	0x000000010a1891f9 Draco::HPage::VerifyIndexPage(Draco::HFrame*, Draco::HPage::HFirstLast*) + 2807
4   com.filemaker.dbengine.framework	0x000000010a1ad8a4 Draco::HPosInfo::VerifyAllPages(Draco::HBFile*, Draco::HVerifyStats*) + 2558
5   com.filemaker.dbengine.framework	0x000000010a1a44cd Draco::HBAMSlot::S_Verify(Draco::HVerifyStats*) + 17
6   com.filemaker.dbengine.framework	0x000000010a198924 Draco::HBFile::Verify(Draco::HVerifyStats*) + 60
7   com.filemaker.dbengine.framework	0x000000010a337987 Draco::DBDamagedFile::Verify(Draco::HVerifyStats*, Draco::DBLogger*) + 241
8   com.filemaker.EngineMgr.framework	0x000000010b4cae5b BackupData::Verify() + 59
9   com.filemaker.EngineMgr.framework	0x000000010b4ca5a1 BackupData::VerifyAndOrCloneSet() + 193
10  com.filemaker.EngineMgr.framework	0x000000010b4c81b4 BackupData::BackupSet() + 2468
11  com.filemaker.EngineMgr.framework	0x000000010b4cc718 EngineMgr::EngineBackup(Draco::unistring const&, FMSTypes::DatabaseIDList const&, Draco::UniversalPath const&, bool, bool, unsigned int, FMSTypes::PathList const&) + 1048
12  com.filemaker.EngineMgr.framework	0x000000010b496b2a EngineMgr::BackupFileRequest(FMS::EventHelper&) + 154
13  com.filemaker.EngineMgr.framework	0x000000010b49013b EngineMgr::ProcessEvent(FMS::EventHelper&, bool) + 1467
14  com.filemaker.EngineMgr.framework	0x000000010b4c1e43 EngineMgr::DatabaseThread::Main() + 131
15  com.filemaker.support.framework	0x000000010b24134f Draco::Thread::RunThread() + 59
16  com.filemaker.support.framework	0x000000010b2403d0 Draco::Thread::ThreadUnixProc(void*) + 54
17  libsystem_pthread.dylib       	0x000000010caa005a _pthread_body + 131
18  libsystem_pthread.dylib       	0x000000010ca9ffd7 _pthread_start + 176
19  libsystem_pthread.dylib       	0x000000010ca9d3ed thread_start + 13

 

Given that this problem is only happening to a single file, on a single server, and multiple changes to hardware and software are having no effect, my conclusion is that there must be some sort of subtle corruption of the file itself.   

Even though the file will pass recovery and consistency checks most of the time, there is some circumstance in which it won't, and this can (at its worst) actually crash the server.

What caused this?  I have no idea - the timing was about a month after upgrading FMS to 14v3, and a few days after some of the clients upgraded from 14v1 to 14v3.    

I have multiple backups including cloned backups of the structure only - is the best next move is to find an older one of these that's in good shape (predating the crashes), then write scripts that re-import every table from a current recovered file?

 

 

Link to comment
Share on other sites

Another update:

After running extensive testing with different hardware, software, and configuration settings (e.g. Database RAM Cache size, frequency of backup scripts, etc.) and running consistency checks & recovery on the backups which fail consistency check, I've discovered the following pattern:

  • I still can not tell what causes FMS to declare a backup has failed consistency check, although it seems to occur more often when the server is under stress (e.g. with only 4GB RAM in the server it's more likely, with 16GB it happens less often).  It only happens on my largest file (which is 3.5GB and contains over 10 million records).  
  • Every time I recover or check one of these "failed" backups, it checks out 100% clean.
  • Thus, these failed backups are probably examples of the "false negative" bug which FileMaker acknowledges but hasn't fixed:  http://help.filemaker.com/app/answers/detail/a_id/10787/~/consistency-check-error-for-scheduled-backups
  • There does not seem to be any relationship with TimeMachine backups (of backups folders) or Spotlight or FileSharing or having WebDirect running.   The False Negatives were seen with all of these services disabled.

Additional info:

  • Rarely, fmserverd itself will crash and it appears to crash in the middle of a consistency check based on the stack trace (see example above).
  • When fmserverd crashes, this can and does cause actual database corruption in the live hosted file.  I've seen this happen twice now.

Conclusions and advice:

  • It therefore appears that the very operation of running consistency checks is dangerous, and running them frequently may cause more harm then good.
  • Until FM fixes this bug, I've decided to stop running consistency checks on Hourly backups, running them only on Weekly ones.
  • In theory, this sounds like a bad idea as it would let an entire week go by before discovering corruption; however in practice it seems like it's actually much less dangerous as the chance of causing corruption is much lower.
Edited by xochi
Link to comment
Share on other sites

3 hours ago, xochi said:

Another update:

  • There does not seem to be any relationship with TimeMachine backups (of backups folders) or Spotlight or FileSharing or having WebDirect running.   The False Negatives were seen with all of these services disabled.

 

I would disagree strongly here.  The absence of evidence is not evidence of absence.

Since some of these services had been running before, the damage to the databases could have happened at the time that those services were enabled.   That kind of damage happens at very specific cross-sections of those services interfering with FMS and you can go years without getting damage.

Link to comment
Share on other sites

5 hours ago, Wim Decorte said:

I would disagree strongly here.  The absence of evidence is not evidence of absence.

Since some of these services had been running before, the damage to the databases could have happened at the time that those services were enabled.   That kind of damage happens at very specific cross-sections of those services interfering with FMS and you can go years without getting damage.

You could be right, but I think the weight of evidence is not in your favor:

  1. FM Inc says, officially, that running TimeMachine is fine (see links above), so you are arguing against official FM Inc. dogma.
  2. FM Inc. admits there is a "false negative" bug (see links above), so you are again arguing against official FM Inc. dogma.
  3. All methods of validating the "inconsistent" database files return zero error.   (I also ran it through FMDiff as well and got the same results: the supposedly bad files are fine.   Audits of the data outputs of the 'corrupt' file are fine).
  4. The 'corruption' is not consistent :  if you are arguing that an earlier Time Machine backup corrupted the live file, then you have to explain why 9 times out of 10, the backups pass consistency checks by fmserver, and always pass consistency checks by FMPro).  
  5. There was absolutely zero relationship between any of my tests with TimeMachine, Spotlight, etc. and the appearance of the issue; nor did the problem go away when these services were stopped.

The only way you could be right would be if there is some invisible, undetectable corruption that can afflict a FMP12 database, and this invisible (dare I say "magical") corruption has only one effect: occasionally triggering FMS to notice that it's corrupt, and to crash fmserverd on occasion, but to be 100% undetectable by any other means including FMPro's Recover command.

Now, I've been working in computer software long enough to know I'd be a fool to say your scenario was impossible ; It certainly could be true.  

But I also tend to believe in the scientific method: claims that approach magic require extraordinary evidence.

Occam's razor suggests a much simpler answer:   The bug that FileMaker Inc. says exists, does exist, and (according to my tests) is actually more severe than they realize (with the actual crashing of fmserverd and database corruption of the live database being a possibility ).

 

 

Link to comment
Share on other sites

This topic is 3087 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.