Message boards :
News :
Database crash
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
We had a database crash. We are working hard to get it fixed. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
We are finally back up and running. You will probably need to hit the update button to upload pending results. Several of my machines gave a "permanent http error" the first time I connected but eventually recovered. All database tables were intact except for the workunit and result tables. These were corrupt and needed to be rebuilt, in the end we lost about 200 rows. Not a big deal considering how bad it could have been. Sorry for any inconvenience! Please let me know if you notice any problems. Thanks! |
Send message Joined: 26 Jun 13 Posts: 11 Credit: 8,735,592 RAC: 0 |
Good to know it comes back. I got trouble finding other desirable projects :P Is it possible to have a 2017 review as well? |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 260,241,417 RAC: 218,365 |
Yes, things getting back to normal here too - reported completed work and got a few new ones. I'm getting some of those "permanent http error" too. They seem to be data files missing from the download storage area (HTTP 404), so of course the downloads fail for all replications of the workunit. You'll have a few jobs to re-issue when all this is over. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
Good to know it comes back. I got trouble finding other desirable projects :P Yes, I will do that. May take me a few days to get to it, I am still catching up on other duties after spending all weekend on this fricken database. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
I just discovered something else... although all the project daemons are working as expected, the create_work script cannot connect to the database, so it looks like we missed something when reinstalling mysql. So just a heads up, we may run out of work until I get this fixed. |
Send message Joined: 5 Jan 13 Posts: 44 Credit: 44,666,539 RAC: 75,419 |
I just discovered something else... although all the project daemons are working as expected, the create_work script cannot connect to the database, so it looks like we missed something when reinstalling mysql. Some WUs can be loaded successfully but the others cannot be loaded - "Error while loading" For example, https://numberfields.asu.edu/NumberFields/workunit.php?wuid=20780713 I have two dozens such WUs. |
Send message Joined: 9 Jan 16 Posts: 6 Credit: 11,031,524 RAC: 0 |
It looks like they are getting more and more. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
You will notice that all those errors are "download errors". As Richard mentioned above this means the file does not exist anymore on the server. I also checked a few of these, and the server has the data, which means it got returned at some point. My best theory as to what happened is that while the database was crashing (which could have been over a period of hours), these results got returned, validated, assimilated, and then marked for deletion. The file_deleter did its job and deleted the file. During restoration of the database, some wus/results got put back to an earlier state. So the server is resending these results but the corresponding data file has already been deleted. Also note that the server has to go through 8 of these download errors before it gets flagged as bad and then subsequently removed from the queue. I would reduce the 8 if I could, but it's part of the WU template which means the value is already set in the database. I hope that helps explain things. Also, I was able to fix the create_work script, so new work is being generated. |
Send message Joined: 5 Jan 13 Posts: 44 Credit: 44,666,539 RAC: 75,419 |
You will notice that all those errors are "download errors". As Richard mentioned above this means the file does not exist anymore on the server. I also checked a few of these, and the server has the data, which means it got returned at some point. My best theory as to what happened is that while the database was crashing (which could have been over a period of hours), these results got returned, validated, assimilated, and then marked for deletion. The file_deleter did its job and deleted the file. During restoration of the database, some wus/results got put back to an earlier state. So the server is resending these results but the corresponding data file has already been deleted. Does this mean that such tasks lost forever? Or they will be recalculated somehow? Regards. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 554,056,272 RAC: 659,224 |
I will wait until all the dust settles. My collation scripts will tell me any WUs that are missing. If it's a small enough number I will rerun them offline, otherwise I will regenerate new WUs to fill in any gaps. So no worries! |