Minor server overload problems

Message boards : News : Minor server overload problems
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1925 - Posted: 30 Dec 2017, 17:55:35 UTC

Over the last couple days we have been having some overload problems on the server. To help reduce the server load I temporarily disabled the batch_status and server_status cron jobs. I will be manually generating those about once a day. Sorry for the inconvenience. I also reduced the "max_wus_in_progress" from 24 to 6, so you may see fewer WUs in your job queues.

For those who are interested in the cause of the overload, it was a combination of things:
1. One of the drives in our RAID is failing, causing HD performance issues. We are working to get this fixed.
2. There were 10k+ WUs that all timed out at about the same time. The server then proceeds to generate new results for all these, causing a backlog. This is the reason for reducing max_wus_in_progress.
3. The transitioner had a DB timout which caused it to crash, increasing the backlog further.
4. Not understanding why WUs were not transitioning, I stupidly ran the "transition_all" admin script since it said this would "unstick" jobs. Big mistake - this script changed the transition time of all 600k WUs in the DB, forcing the transitioner to now reprocess every WU. That only made the backlog worse.

The good news is the server is almost caught up with the backlog. I will keep you posted.
ID: 1925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,369,502
RAC: 127,991
Message 1926 - Posted: 31 Dec 2017, 9:33:15 UTC - in response to Message 1925.  

Probably related to the other problems, I'm currently unable to report completed tasks: the messages I get are

31/12/2017 09:27:54 | NumberFields@home | Server error: feeder not running
31/12/2017 09:27:54 | NumberFields@home | Project requested delay of 3600 seconds
Not a problem - they can sit here until you're ready for them.
ID: 1926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1927 - Posted: 31 Dec 2017, 16:39:42 UTC - in response to Message 1926.  

That's a new error I haven't seen before. Maybe that was a temporary glitch because I can't find anything in the feeder log. It just shows it constantly adding results to slots.

At the time you posted this there was a backlog of about 15k WUs needing validation. Now it shows only about 1000 needing validation. Is it possible the reason it couldn't report was a validation backlog? Because I don't see how a feeder problem would not allow you to report completed tasks.
ID: 1927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1928 - Posted: 31 Dec 2017, 16:50:40 UTC - in response to Message 1927.  

Now we are having another problem... The tmp drive is full which is causing some of the daemons to crash, including the feeder. All the files in the tmp drive are owned by root so there is nothing I can do about this and I will have to get IT to look into it... it might be related to the failing hard drive.
ID: 1928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,369,502
RAC: 127,991
Message 1929 - Posted: 31 Dec 2017, 20:25:04 UTC

It's fairly well known that a stopped feeder puts the server into a form of maintenance mode - nothing gets done, volunteer hosts are backed off for 1 hour (as my log shows). I presume this is deliberate to stop the situation getting worse until it can be inspected.

What that doesn't say is why the feeder stopped in the first place - your second post about the lack of temp file space seems as good an explanation as any.

At intervals throughout the day, I've seen the project come back up fully (so I could report and refill): then go into full maintenance mode with these boards down as well: then return to normal working (the current state: just reported and received new tasks). Which sounds like a good moment to go out and start celebrating the new year...
ID: 1929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1930 - Posted: 31 Dec 2017, 20:35:48 UTC - in response to Message 1929.  

Oh, I wasn't aware of that.

Anyways, we got some space cleaned up on the /tmp drive by deleting some old log files. The server appears to be back up and running again.
ID: 1930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
James Waddington

Send message
Joined: 12 Dec 17
Posts: 1
Credit: 20,713
RAC: 0
Message 1934 - Posted: 4 Jan 2018, 21:57:59 UTC

The Project still won't update and I can't upload the work I have.
ID: 1934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1935 - Posted: 5 Jan 2018, 0:04:32 UTC - in response to Message 1934.  

I'm surprised you were even able to post... The database crashed two days ago and we have been working hard to restore it. I am at work now but the sys admin and a grad student are still working the issue.

The fact that the message boards are back up is a good sign. Hopefully we can get the workunit and results tables back up too.
ID: 1935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,369,502
RAC: 127,991
Message 1971 - Posted: 27 Feb 2018, 12:34:46 UTC
Last modified: 27 Feb 2018, 12:51:27 UTC

The server has gone into the 'feeder not running' maintenance mode again - 1 hour backoff on scheduler requests, no reporting and no new work.

Edit - sorry, working now. Panic over.
ID: 1971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1972 - Posted: 27 Feb 2018, 15:53:12 UTC - in response to Message 1971.  

For the most part, things have been very good since we got the new SSD installed. BOINC is highly dependent on file I/O (Create work scripts copy files to download directory, incoming results are copied to the upload directory, assimilated results are copied to a final staging area, etc.), so the new drive has sped things up remarkably. For example, I have noticed the create work scripts are at least 10 times faster now with the new drive.

However, every now and then I see an error in the various logs: "Lost connection to MySQL server during query".
The MySQL connection problem causes the associated daemon to shut down. This is probably what caused the feeder to shutdown.
Since the daemons are run as cron jobs, they automatically restart within 5 minutes, so the project quickly recovers (Richard - I'm sure you already know this, so this info is for others reading this post).

I am not a database expert, so I am not sure what the root cause of the database connection problem is. Maybe someone reading this has a suggestion. Maybe I can adjust a timeout value or some other database parameter?
ID: 1972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,369,502
RAC: 127,991
Message 1973 - Posted: 27 Feb 2018, 22:54:08 UTC

I was in a conference call this evening with a couple of *very* experienced BOINC server administrators. One thought that on a lightly-loaded project, a database connection might well time out between active requests: the other had never seen such a thing. On reflection, both thought there might be some useful information about a possible database server stoppage in the MySQL logs.

They suggested I copy your question to the boinc_projects mailing list, both to remind them to look again, and to get some broader responses from the community. OK if I do that in the morning?
ID: 1973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1974 - Posted: 28 Feb 2018, 3:58:51 UTC - in response to Message 1973.  

Sure, no problem.

In the meantime I'll look at the MySQL logs.
ID: 1974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1989 - Posted: 9 Mar 2018, 18:40:14 UTC - in response to Message 1974.  

Just to follow up on this MySQL topic.

First of all, the MySQL logs didn't give any clues, but that could have been because the verbosity level was not set appropriately.

I followed some information I found online and set the connect_timeout to 10 sec (had been 5). I did this on March 2nd, and it's now a week later, and there have been no lost connections. So I think the problem has been resolved.
ID: 1989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Igor

Send message
Joined: 10 Jan 18
Posts: 1
Credit: 24,497
RAC: 0
Message 1990 - Posted: 13 Mar 2018, 15:23:49 UTC
Last modified: 13 Mar 2018, 15:43:50 UTC

столкнулся с проблемой отправки уже обработанных заданий на сервер. более десятка заданий стоят в очереди и не отправляются. Такая ситуация длится уже несколько дней. У некоторых заданий уже истек крайний срок отправки. Что делать? Будут ли защитаны задания у которых просрочен срок отправки?


12.03.2018 12:00:23 | | Internet access OK - project servers may be temporarily down.
12.03.2018 16:00:41 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
12.03.2018 16:00:41 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
12.03.2018 16:01:03 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
12.03.2018 16:01:03 | NumberFields@home | Backing off 04:35:43 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
12.03.2018 16:01:03 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
12.03.2018 16:01:03 | NumberFields@home | Backing off 05:37:15 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
12.03.2018 16:01:05 | | Project communication failed: attempting access to reference site
12.03.2018 16:01:06 | | Internet access OK - project servers may be temporarily down.
12.03.2018 16:55:01 | | Contacting account manager at http://www.grcpool.com/
12.03.2018 16:55:04 | | Account manager: DrugDiscovery@Home is not in your pool account, but was found in your client.Universe@Home is not in your pool account, but was found in your client.
12.03.2018 19:52:05 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
12.03.2018 19:52:05 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0
12.03.2018 19:52:28 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed
12.03.2018 19:52:28 | NumberFields@home | Backing off 01:50:35 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
12.03.2018 19:52:28 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0: connect() failed
12.03.2018 19:52:28 | NumberFields@home | Backing off 00:04:03 on upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0
12.03.2018 19:52:28 | | Project communication failed: attempting access to reference site
12.03.2018 19:52:29 | | Internet access OK - project servers may be temporarily down.
12.03.2018 23:58:04 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
12.03.2018 23:58:04 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
12.03.2018 23:58:28 | | Project communication failed: attempting access to reference site
12.03.2018 23:58:28 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
12.03.2018 23:58:28 | NumberFields@home | Backing off 03:03:16 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
12.03.2018 23:58:28 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
12.03.2018 23:58:28 | NumberFields@home | Backing off 03:03:12 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
12.03.2018 23:58:30 | | Internet access OK - project servers may be temporarily down.
13.03.2018 1:42:44 | NumberFields@home | Computation for task wu_sf4_DS-13x271-1_Grp77121of120193_0 finished
13.03.2018 1:42:46 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0
13.03.2018 1:43:09 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0: connect() failed
13.03.2018 1:43:09 | NumberFields@home | Backing off 00:03:00 on upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0
13.03.2018 1:43:10 | | Project communication failed: attempting access to reference site
13.03.2018 1:43:12 | | Internet access OK - project servers may be temporarily down.
13.03.2018 5:31:57 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 5:31:57 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 5:32:20 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
13.03.2018 5:32:20 | NumberFields@home | Backing off 03:02:34 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 5:32:20 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
13.03.2018 5:32:20 | NumberFields@home | Backing off 04:07:43 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 5:32:21 | | Project communication failed: attempting access to reference site
13.03.2018 5:32:23 | | Internet access OK - project servers may be temporarily down.
13.03.2018 10:02:35 | | Contacting account manager at http://www.grcpool.com/
13.03.2018 10:02:39 | | Account manager: DrugDiscovery@Home is not in your pool account, but was found in your client.Universe@Home is not in your pool account, but was found in your client.
13.03.2018 10:55:53 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 10:55:53 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 10:56:16 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
13.03.2018 10:56:16 | NumberFields@home | Backing off 05:56:49 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 10:56:16 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
13.03.2018 10:56:16 | NumberFields@home | Backing off 04:38:34 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 10:56:17 | | Project communication failed: attempting access to reference site
13.03.2018 10:56:20 | | Internet access OK - project servers may be temporarily down.
13.03.2018 16:14:58 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 16:14:59 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
13.03.2018 16:15:22 | | Project communication failed: attempting access to reference site
13.03.2018 16:15:22 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
13.03.2018 16:15:22 | NumberFields@home | Backing off 05:31:25 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 16:15:22 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed
13.03.2018 16:15:22 | NumberFields@home | Backing off 02:55:26 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
13.03.2018 16:15:23 | | Internet access OK - project servers may be temporarily down.
13.03.2018 17:25:27 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 17:25:27 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 17:25:50 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
13.03.2018 17:25:50 | NumberFields@home | Backing off 04:18:28 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 17:25:50 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed
13.03.2018 17:25:50 | NumberFields@home | Backing off 04:43:47 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0
13.03.2018 17:25:51 | | Project communication failed: attempting access to reference site
13.03.2018 17:25:53 | | Internet access OK - project servers may be temporarily down.
13.03.2018 17:37:13 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 17:37:13 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
13.03.2018 17:37:35 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed
13.03.2018 17:37:35 | NumberFields@home | Backing off 04:17:40 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0
13.03.2018 17:37:35 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed
13.03.2018 17:37:35 | NumberFields@home | Backing off 04:19:36 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0
13.03.2018 17:37:35 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0
13.03.2018 17:37:35 | NumberFields@home | Started upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0
13.03.2018 17:37:36 | | Project communication failed: attempting access to reference site
13.03.2018 17:37:38 | | Internet access OK - project servers may be temporarily down.
13.03.2018 17:37:57 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0: connect() failed
13.03.2018 17:37:57 | NumberFields@home | Backing off 00:10:24 on upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0
13.03.2018 17:37:57 | NumberFields@home | Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0: connect() failed
13.03.2018 17:37:57 | NumberFields@home | Backing off 00:06:20 on upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0
13.03.2018 17:37:58 | | Project communication failed: attempting access to reference site
13.03.2018 17:37:59 | | Internet access OK - project servers may be temporarily down.
И так со всеми 18 заданиями
ID: 1990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1991 - Posted: 13 Mar 2018, 21:53:03 UTC - in response to Message 1990.  

I don't have access to the server right now. I will look into this when I get home later.
ID: 1991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,708,838
RAC: 288,154
Message 1992 - Posted: 14 Mar 2018, 5:22:30 UTC - in response to Message 1991.  

Igor,

I'm not sure how you have BOINC setup but something looks strange. Those WUs in your post were successfully returned earlier today by user ID 77798 with username grcpool.com-3. It also turns out that both you (ID 81451) and grcpool.com-3 have the same IP address. Maybe the log you posted was for the grcpool.com-3 user and the connection problem has resolved itself? Because if that posting was for the Igor account then your two accounts got their wires crossed.
ID: 1992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Minor server overload problems


Main page · Your account · Message boards


Copyright © 2024 Arizona State University