Minor server overload problems

Author	Message
Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1925 - Posted: 30 Dec 2017, 17:55:35 UTC Over the last couple days we have been having some overload problems on the server. To help reduce the server load I temporarily disabled the batch_status and server_status cron jobs. I will be manually generating those about once a day. Sorry for the inconvenience. I also reduced the "max_wus_in_progress" from 24 to 6, so you may see fewer WUs in your job queues. For those who are interested in the cause of the overload, it was a combination of things: 1. One of the drives in our RAID is failing, causing HD performance issues. We are working to get this fixed. 2. There were 10k+ WUs that all timed out at about the same time. The server then proceeds to generate new results for all these, causing a backlog. This is the reason for reducing max_wus_in_progress. 3. The transitioner had a DB timout which caused it to crash, increasing the backlog further. 4. Not understanding why WUs were not transitioning, I stupidly ran the "transition_all" admin script since it said this would "unstick" jobs. Big mistake - this script changed the transition time of all 600k WUs in the DB, forcing the transitioner to now reprocess every WU. That only made the backlog worse. The good news is the server is almost caught up with the backlog. I will keep you posted. ID: 1925 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 373,150,859 RAC: 225,122	Message 1926 - Posted: 31 Dec 2017, 9:33:15 UTC - in response to Message 1925. Probably related to the other problems, I'm currently unable to report completed tasks: the messages I get are 31/12/2017 09:27:54 \| NumberFields@home \| Server error: feeder not running 31/12/2017 09:27:54 \| NumberFields@home \| Project requested delay of 3600 seconds Not a problem - they can sit here until you're ready for them. ID: 1926 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1927 - Posted: 31 Dec 2017, 16:39:42 UTC - in response to Message 1926. That's a new error I haven't seen before. Maybe that was a temporary glitch because I can't find anything in the feeder log. It just shows it constantly adding results to slots. At the time you posted this there was a backlog of about 15k WUs needing validation. Now it shows only about 1000 needing validation. Is it possible the reason it couldn't report was a validation backlog? Because I don't see how a feeder problem would not allow you to report completed tasks. ID: 1927 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1928 - Posted: 31 Dec 2017, 16:50:40 UTC - in response to Message 1927. Now we are having another problem... The tmp drive is full which is causing some of the daemons to crash, including the feeder. All the files in the tmp drive are owned by root so there is nothing I can do about this and I will have to get IT to look into it... it might be related to the failing hard drive. ID: 1928 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 373,150,859 RAC: 225,122	Message 1929 - Posted: 31 Dec 2017, 20:25:04 UTC It's fairly well known that a stopped feeder puts the server into a form of maintenance mode - nothing gets done, volunteer hosts are backed off for 1 hour (as my log shows). I presume this is deliberate to stop the situation getting worse until it can be inspected. What that doesn't say is why the feeder stopped in the first place - your second post about the lack of temp file space seems as good an explanation as any. At intervals throughout the day, I've seen the project come back up fully (so I could report and refill): then go into full maintenance mode with these boards down as well: then return to normal working (the current state: just reported and received new tasks). Which sounds like a good moment to go out and start celebrating the new year... ID: 1929 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1930 - Posted: 31 Dec 2017, 20:35:48 UTC - in response to Message 1929. Oh, I wasn't aware of that. Anyways, we got some space cleaned up on the /tmp drive by deleting some old log files. The server appears to be back up and running again. ID: 1930 · Rating: 0 · rate: / Reply Quote

James Waddington Send message Joined: 12 Dec 17 Posts: 1 Credit: 20,713 RAC: 0	Message 1934 - Posted: 4 Jan 2018, 21:57:59 UTC The Project still won't update and I can't upload the work I have. ID: 1934 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1935 - Posted: 5 Jan 2018, 0:04:32 UTC - in response to Message 1934. I'm surprised you were even able to post... The database crashed two days ago and we have been working hard to restore it. I am at work now but the sys admin and a grad student are still working the issue. The fact that the message boards are back up is a good sign. Hopefully we can get the workunit and results tables back up too. ID: 1935 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 373,150,859 RAC: 225,122	Message 1971 - Posted: 27 Feb 2018, 12:34:46 UTC Last modified: 27 Feb 2018, 12:51:27 UTC The server has gone into the 'feeder not running' maintenance mode again - 1 hour backoff on scheduler requests, no reporting and no new work. Edit - sorry, working now. Panic over. ID: 1971 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1972 - Posted: 27 Feb 2018, 15:53:12 UTC - in response to Message 1971. For the most part, things have been very good since we got the new SSD installed. BOINC is highly dependent on file I/O (Create work scripts copy files to download directory, incoming results are copied to the upload directory, assimilated results are copied to a final staging area, etc.), so the new drive has sped things up remarkably. For example, I have noticed the create work scripts are at least 10 times faster now with the new drive. However, every now and then I see an error in the various logs: "Lost connection to MySQL server during query". The MySQL connection problem causes the associated daemon to shut down. This is probably what caused the feeder to shutdown. Since the daemons are run as cron jobs, they automatically restart within 5 minutes, so the project quickly recovers (Richard - I'm sure you already know this, so this info is for others reading this post). I am not a database expert, so I am not sure what the root cause of the database connection problem is. Maybe someone reading this has a suggestion. Maybe I can adjust a timeout value or some other database parameter? ID: 1972 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 373,150,859 RAC: 225,122	Message 1973 - Posted: 27 Feb 2018, 22:54:08 UTC I was in a conference call this evening with a couple of very experienced BOINC server administrators. One thought that on a lightly-loaded project, a database connection might well time out between active requests: the other had never seen such a thing. On reflection, both thought there might be some useful information about a possible database server stoppage in the MySQL logs. They suggested I copy your question to the boinc_projects mailing list, both to remind them to look again, and to get some broader responses from the community. OK if I do that in the morning? ID: 1973 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1974 - Posted: 28 Feb 2018, 3:58:51 UTC - in response to Message 1973. Sure, no problem. In the meantime I'll look at the MySQL logs. ID: 1974 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1989 - Posted: 9 Mar 2018, 18:40:14 UTC - in response to Message 1974. Just to follow up on this MySQL topic. First of all, the MySQL logs didn't give any clues, but that could have been because the verbosity level was not set appropriately. I followed some information I found online and set the connect_timeout to 10 sec (had been 5). I did this on March 2nd, and it's now a week later, and there have been no lost connections. So I think the problem has been resolved. ID: 1989 · Rating: 0 · rate: / Reply Quote

Igor Send message Joined: 10 Jan 18 Posts: 1 Credit: 24,497 RAC: 0	Message 1990 - Posted: 13 Mar 2018, 15:23:49 UTC Last modified: 13 Mar 2018, 15:43:50 UTC столкнулся с проблемой отправки уже обработанных заданий на сервер. более десятка заданий стоят в очереди и не отправляются. Такая ситуация длится уже несколько дней. У некоторых заданий уже истек крайний срок отправки. Что делать? Будут ли защитаны задания у которых просрочен срок отправки? 12.03.2018 12:00:23 \| \| Internet access OK - project servers may be temporarily down. 12.03.2018 16:00:41 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 12.03.2018 16:00:41 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 12.03.2018 16:01:03 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 12.03.2018 16:01:03 \| NumberFields@home \| Backing off 04:35:43 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 12.03.2018 16:01:03 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 12.03.2018 16:01:03 \| NumberFields@home \| Backing off 05:37:15 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 12.03.2018 16:01:05 \| \| Project communication failed: attempting access to reference site 12.03.2018 16:01:06 \| \| Internet access OK - project servers may be temporarily down. 12.03.2018 16:55:01 \| \| Contacting account manager at http://www.grcpool.com/ 12.03.2018 16:55:04 \| \| Account manager: DrugDiscovery@Home is not in your pool account, but was found in your client.Universe@Home is not in your pool account, but was found in your client. 12.03.2018 19:52:05 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 12.03.2018 19:52:05 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0 12.03.2018 19:52:28 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed 12.03.2018 19:52:28 \| NumberFields@home \| Backing off 01:50:35 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 12.03.2018 19:52:28 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0: connect() failed 12.03.2018 19:52:28 \| NumberFields@home \| Backing off 00:04:03 on upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0 12.03.2018 19:52:28 \| \| Project communication failed: attempting access to reference site 12.03.2018 19:52:29 \| \| Internet access OK - project servers may be temporarily down. 12.03.2018 23:58:04 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 12.03.2018 23:58:04 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 12.03.2018 23:58:28 \| \| Project communication failed: attempting access to reference site 12.03.2018 23:58:28 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 12.03.2018 23:58:28 \| NumberFields@home \| Backing off 03:03:16 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 12.03.2018 23:58:28 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 12.03.2018 23:58:28 \| NumberFields@home \| Backing off 03:03:12 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 12.03.2018 23:58:30 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 1:42:44 \| NumberFields@home \| Computation for task wu_sf4_DS-13x271-1_Grp77121of120193_0 finished 13.03.2018 1:42:46 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0 13.03.2018 1:43:09 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0: connect() failed 13.03.2018 1:43:09 \| NumberFields@home \| Backing off 00:03:00 on upload of wu_sf4_DS-13x271-1_Grp77121of120193_0_r1616414774_0 13.03.2018 1:43:10 \| \| Project communication failed: attempting access to reference site 13.03.2018 1:43:12 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 5:31:57 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 5:31:57 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 5:32:20 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 13.03.2018 5:32:20 \| NumberFields@home \| Backing off 03:02:34 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 5:32:20 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 13.03.2018 5:32:20 \| NumberFields@home \| Backing off 04:07:43 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 5:32:21 \| \| Project communication failed: attempting access to reference site 13.03.2018 5:32:23 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 10:02:35 \| \| Contacting account manager at http://www.grcpool.com/ 13.03.2018 10:02:39 \| \| Account manager: DrugDiscovery@Home is not in your pool account, but was found in your client.Universe@Home is not in your pool account, but was found in your client. 13.03.2018 10:55:53 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 10:55:53 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 10:56:16 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 13.03.2018 10:56:16 \| NumberFields@home \| Backing off 05:56:49 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 10:56:16 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 13.03.2018 10:56:16 \| NumberFields@home \| Backing off 04:38:34 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 10:56:17 \| \| Project communication failed: attempting access to reference site 13.03.2018 10:56:20 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 16:14:58 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 16:14:59 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 13.03.2018 16:15:22 \| \| Project communication failed: attempting access to reference site 13.03.2018 16:15:22 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 13.03.2018 16:15:22 \| NumberFields@home \| Backing off 05:31:25 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 16:15:22 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed 13.03.2018 16:15:22 \| NumberFields@home \| Backing off 02:55:26 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 13.03.2018 16:15:23 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 17:25:27 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 17:25:27 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 17:25:50 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 13.03.2018 17:25:50 \| NumberFields@home \| Backing off 04:18:28 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 17:25:50 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0: connect() failed 13.03.2018 17:25:50 \| NumberFields@home \| Backing off 04:43:47 on upload of wu_sf4_DS-13x271-1_Grp26963of120193_1_r1272416287_0 13.03.2018 17:25:51 \| \| Project communication failed: attempting access to reference site 13.03.2018 17:25:53 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 17:37:13 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 17:37:13 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 13.03.2018 17:37:35 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0: connect() failed 13.03.2018 17:37:35 \| NumberFields@home \| Backing off 04:17:40 on upload of wu_sf4_DS-13x271-1_Grp20482of120193_0_r1285223353_0 13.03.2018 17:37:35 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0: connect() failed 13.03.2018 17:37:35 \| NumberFields@home \| Backing off 04:19:36 on upload of wu_sf4_DS-13x271-1_Grp28743of120193_0_r2082510547_0 13.03.2018 17:37:35 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0 13.03.2018 17:37:35 \| NumberFields@home \| Started upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0 13.03.2018 17:37:36 \| \| Project communication failed: attempting access to reference site 13.03.2018 17:37:38 \| \| Internet access OK - project servers may be temporarily down. 13.03.2018 17:37:57 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0: connect() failed 13.03.2018 17:37:57 \| NumberFields@home \| Backing off 00:10:24 on upload of wu_sf4_DS-13x271-1_Grp28707of120193_0_r307572109_0 13.03.2018 17:37:57 \| NumberFields@home \| Temporarily failed upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0: connect() failed 13.03.2018 17:37:57 \| NumberFields@home \| Backing off 00:06:20 on upload of wu_sf4_DS-13x271-1_Grp28824of120193_0_r2053227961_0 13.03.2018 17:37:58 \| \| Project communication failed: attempting access to reference site 13.03.2018 17:37:59 \| \| Internet access OK - project servers may be temporarily down. И так со всеми 18 заданиями ID: 1990 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1991 - Posted: 13 Mar 2018, 21:53:03 UTC - in response to Message 1990. I don't have access to the server right now. I will look into this when I get home later. ID: 1991 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1472 Credit: 1,284,625,089 RAC: 1,167,978	Message 1992 - Posted: 14 Mar 2018, 5:22:30 UTC - in response to Message 1991. Igor, I'm not sure how you have BOINC setup but something looks strange. Those WUs in your post were successfully returned earlier today by user ID 77798 with username grcpool.com-3. It also turns out that both you (ID 81451) and grcpool.com-3 have the same IP address. Maybe the log you posted was for the grcpool.com-3 user and the connection problem has resolved itself? Because if that posting was for the Igor account then your two accounts got their wires crossed. ID: 1992 · Rating: 0 · rate: / Reply Quote