Minor server overload problems

Message boards : News : Minor server overload problems

To post messages, you must log in.

AuthorMessage
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 704
Credit: 47,078,492
RAC: 26,507
Message 1925 - Posted: 30 Dec 2017, 17:55:35 UTC

Over the last couple days we have been having some overload problems on the server. To help reduce the server load I temporarily disabled the batch_status and server_status cron jobs. I will be manually generating those about once a day. Sorry for the inconvenience. I also reduced the "max_wus_in_progress" from 24 to 6, so you may see fewer WUs in your job queues.

For those who are interested in the cause of the overload, it was a combination of things:
1. One of the drives in our RAID is failing, causing HD performance issues. We are working to get this fixed.
2. There were 10k+ WUs that all timed out at about the same time. The server then proceeds to generate new results for all these, causing a backlog. This is the reason for reducing max_wus_in_progress.
3. The transitioner had a DB timout which caused it to crash, increasing the backlog further.
4. Not understanding why WUs were not transitioning, I stupidly ran the "transition_all" admin script since it said this would "unstick" jobs. Big mistake - this script changed the transition time of all 600k WUs in the DB, forcing the transitioner to now reprocess every WU. That only made the backlog worse.

The good news is the server is almost caught up with the backlog. I will keep you posted.
ID: 1925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 101
Credit: 76,870,592
RAC: 41,319
Message 1926 - Posted: 31 Dec 2017, 9:33:15 UTC - in response to Message 1925.  

Probably related to the other problems, I'm currently unable to report completed tasks: the messages I get are

31/12/2017 09:27:54 | NumberFields@home | Server error: feeder not running
31/12/2017 09:27:54 | NumberFields@home | Project requested delay of 3600 seconds
Not a problem - they can sit here until you're ready for them.
ID: 1926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 704
Credit: 47,078,492
RAC: 26,507
Message 1927 - Posted: 31 Dec 2017, 16:39:42 UTC - in response to Message 1926.  

That's a new error I haven't seen before. Maybe that was a temporary glitch because I can't find anything in the feeder log. It just shows it constantly adding results to slots.

At the time you posted this there was a backlog of about 15k WUs needing validation. Now it shows only about 1000 needing validation. Is it possible the reason it couldn't report was a validation backlog? Because I don't see how a feeder problem would not allow you to report completed tasks.
ID: 1927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 704
Credit: 47,078,492
RAC: 26,507
Message 1928 - Posted: 31 Dec 2017, 16:50:40 UTC - in response to Message 1927.  

Now we are having another problem... The tmp drive is full which is causing some of the daemons to crash, including the feeder. All the files in the tmp drive are owned by root so there is nothing I can do about this and I will have to get IT to look into it... it might be related to the failing hard drive.
ID: 1928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 101
Credit: 76,870,592
RAC: 41,319
Message 1929 - Posted: 31 Dec 2017, 20:25:04 UTC

It's fairly well known that a stopped feeder puts the server into a form of maintenance mode - nothing gets done, volunteer hosts are backed off for 1 hour (as my log shows). I presume this is deliberate to stop the situation getting worse until it can be inspected.

What that doesn't say is why the feeder stopped in the first place - your second post about the lack of temp file space seems as good an explanation as any.

At intervals throughout the day, I've seen the project come back up fully (so I could report and refill): then go into full maintenance mode with these boards down as well: then return to normal working (the current state: just reported and received new tasks). Which sounds like a good moment to go out and start celebrating the new year...
ID: 1929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 704
Credit: 47,078,492
RAC: 26,507
Message 1930 - Posted: 31 Dec 2017, 20:35:48 UTC - in response to Message 1929.  

Oh, I wasn't aware of that.

Anyways, we got some space cleaned up on the /tmp drive by deleting some old log files. The server appears to be back up and running again.
ID: 1930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James Waddington

Send message
Joined: 12 Dec 17
Posts: 1
Credit: 18,429
RAC: 312
Message 1934 - Posted: 4 Jan 2018, 21:57:59 UTC

The Project still won't update and I can't upload the work I have.
ID: 1934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 704
Credit: 47,078,492
RAC: 26,507
Message 1935 - Posted: 5 Jan 2018, 0:04:32 UTC - in response to Message 1934.  

I'm surprised you were even able to post... The database crashed two days ago and we have been working hard to restore it. I am at work now but the sys admin and a grad student are still working the issue.

The fact that the message boards are back up is a good sign. Hopefully we can get the workunit and results tables back up too.
ID: 1935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : News : Minor server overload problems


Main page · Your account · Message boards


Copyright © 2018 Arizona State University