Database crash

Message boards : News : Database crash
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1936 - Posted: 5 Jan 2018, 1:19:20 UTC

We had a database crash. We are working hard to get it fixed.
ID: 1936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1937 - Posted: 8 Jan 2018, 1:59:07 UTC - in response to Message 1936.  

We are finally back up and running.

You will probably need to hit the update button to upload pending results. Several of my machines gave a "permanent http error" the first time I connected but eventually recovered.

All database tables were intact except for the workunit and result tables. These were corrupt and needed to be rebuilt, in the end we lost about 200 rows. Not a big deal considering how bad it could have been.

Sorry for any inconvenience! Please let me know if you notice any problems. Thanks!
ID: 1937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Forretrio

Send message
Joined: 26 Jun 13
Posts: 11
Credit: 8,735,592
RAC: 0
Message 1938 - Posted: 8 Jan 2018, 7:30:28 UTC

Good to know it comes back. I got trouble finding other desirable projects :P

Is it possible to have a 2017 review as well?
ID: 1938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,390,202
RAC: 128,088
Message 1939 - Posted: 8 Jan 2018, 11:33:17 UTC

Yes, things getting back to normal here too - reported completed work and got a few new ones.

I'm getting some of those "permanent http error" too. They seem to be data files missing from the download storage area (HTTP 404), so of course the downloads fail for all replications of the workunit. You'll have a few jobs to re-issue when all this is over.
ID: 1939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1940 - Posted: 8 Jan 2018, 15:56:06 UTC - in response to Message 1938.  

Good to know it comes back. I got trouble finding other desirable projects :P

Is it possible to have a 2017 review as well?


Yes, I will do that. May take me a few days to get to it, I am still catching up on other duties after spending all weekend on this fricken database.
ID: 1940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1941 - Posted: 8 Jan 2018, 16:08:14 UTC - in response to Message 1939.  

I just discovered something else... although all the project daemons are working as expected, the create_work script cannot connect to the database, so it looks like we missed something when reinstalling mysql.

So just a heads up, we may run out of work until I get this fixed.
ID: 1941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Vitaly

Send message
Joined: 5 Jan 13
Posts: 43
Credit: 37,966,315
RAC: 38,512
Message 1943 - Posted: 9 Jan 2018, 19:48:12 UTC - in response to Message 1941.  
Last modified: 9 Jan 2018, 19:48:36 UTC

I just discovered something else... although all the project daemons are working as expected, the create_work script cannot connect to the database, so it looks like we missed something when reinstalling mysql.

So just a heads up, we may run out of work until I get this fixed.


Some WUs can be loaded successfully but the others cannot be loaded - "Error while loading"

For example,

https://numberfields.asu.edu/NumberFields/workunit.php?wuid=20780713

I have two dozens such WUs.
ID: 1943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Daniel Liebmann

Send message
Joined: 9 Jan 16
Posts: 6
Credit: 11,031,524
RAC: 0
Message 1944 - Posted: 9 Jan 2018, 20:26:21 UTC



It looks like they are getting more and more.
ID: 1944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1945 - Posted: 10 Jan 2018, 7:42:34 UTC - in response to Message 1944.  

You will notice that all those errors are "download errors". As Richard mentioned above this means the file does not exist anymore on the server. I also checked a few of these, and the server has the data, which means it got returned at some point. My best theory as to what happened is that while the database was crashing (which could have been over a period of hours), these results got returned, validated, assimilated, and then marked for deletion. The file_deleter did its job and deleted the file. During restoration of the database, some wus/results got put back to an earlier state. So the server is resending these results but the corresponding data file has already been deleted.

Also note that the server has to go through 8 of these download errors before it gets flagged as bad and then subsequently removed from the queue. I would reduce the 8 if I could, but it's part of the WU template which means the value is already set in the database.

I hope that helps explain things.

Also, I was able to fix the create_work script, so new work is being generated.
ID: 1945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Vitaly

Send message
Joined: 5 Jan 13
Posts: 43
Credit: 37,966,315
RAC: 38,512
Message 1947 - Posted: 10 Jan 2018, 14:14:21 UTC - in response to Message 1945.  

You will notice that all those errors are "download errors". As Richard mentioned above this means the file does not exist anymore on the server. I also checked a few of these, and the server has the data, which means it got returned at some point. My best theory as to what happened is that while the database was crashing (which could have been over a period of hours), these results got returned, validated, assimilated, and then marked for deletion. The file_deleter did its job and deleted the file. During restoration of the database, some wus/results got put back to an earlier state. So the server is resending these results but the corresponding data file has already been deleted.

Also note that the server has to go through 8 of these download errors before it gets flagged as bad and then subsequently removed from the queue. I would reduce the 8 if I could, but it's part of the WU template which means the value is already set in the database.

I hope that helps explain things.

Also, I was able to fix the create_work script, so new work is being generated.


Does this mean that such tasks lost forever?
Or they will be recalculated somehow?

Regards.
ID: 1947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,750,238
RAC: 287,952
Message 1948 - Posted: 10 Jan 2018, 16:43:36 UTC - in response to Message 1947.  

I will wait until all the dust settles. My collation scripts will tell me any WUs that are missing. If it's a small enough number I will rerun them offline, otherwise I will regenerate new WUs to fill in any gaps. So no worries!
ID: 1948 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Database crash


Main page · Your account · Message boards


Copyright © 2024 Arizona State University