Had to abort some WUs

Author	Message
Chris Granger Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0	Message 300 - Posted: 31 Oct 2011, 11:05:57 UTC I had a batch of work units that were taking a lot longer than usual, and noticed this in my event log: Mon 31 Oct 2011 04:39:13 AM CDT \| NumberFields@home \| [error] wu_12E10_SF161-1_Idx6_Grp41251of65023_0: negative FLOPs left -2.000000 System Monitor showed that the processes were sleeping, so I aborted them. ID: 300 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1388 Credit: 696,763,942 RAC: 827,745	Message 301 - Posted: 31 Oct 2011, 16:27:19 UTC - in response to Message 300. hmmm. I haven't seen that one before. Let me know if it continues to happen. ID: 301 · Rating: 0 · rate: / Reply Quote

Chris Granger Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0	Message 304 - Posted: 31 Oct 2011, 18:54:19 UTC - in response to Message 301. Will do. I've suspended all but one of the remaining WUs, to see if that one will complete as expected. ID: 304 · Rating: 0 · rate: / Reply Quote

Chris Granger Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0	Message 305 - Posted: 1 Nov 2011, 3:28:04 UTC Things seem to be working normally now. I'm not sure what the problem was. My latest batch of WUs completed and validated. ID: 305 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1388 Credit: 696,763,942 RAC: 827,745	Message 306 - Posted: 1 Nov 2011, 8:44:10 UTC - in response to Message 305. Good to hear that. ID: 306 · Rating: 0 · rate: / Reply Quote

Greg Tucker Project administrator Project developer Project tester Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0	Message 311 - Posted: 2 Nov 2011, 15:46:15 UTC - in response to Message 310. wu_12E10_SF161-1_Idx3_Grp57438of59290 getting stuck. aborted after 70 hours Looks like this wu has some problems. http://numberfields.asu.edu/NumberFields/result.php?resultid=258591 [/url] ID: 311 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 181 Credit: 300,550,013 RAC: 233,587	Message 312 - Posted: 2 Nov 2011, 16:12:29 UTC - in response to Message 311. wu_12E10_SF161-1_Idx3_Grp57438of59290 getting stuck. aborted after 70 hours Looks like this wu has some problems. http://numberfields.asu.edu/NumberFields/result.php?resultid=258591 Looks even worse like this: http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008 :-( ID: 312 · Rating: 0 · rate: / Reply Quote

frankhagen Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0	Message 313 - Posted: 2 Nov 2011, 17:19:52 UTC - in response to Message 312. it's probably not a bright idea to run ancient P4's on a project where you might hit heavy stuff like that.. ID: 313 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0	Message 314 - Posted: 2 Nov 2011, 19:45:27 UTC - in response to Message 312. wu_12E10_SF161-1_Idx3_Grp57438of59290 getting stuck. aborted after 70 hours Looks like this wu has some problems. http://numberfields.asu.edu/NumberFields/result.php?resultid=258591 Looks even worse like this: http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008 :-( One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out. Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux ID: 314 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1388 Credit: 696,763,942 RAC: 827,745	Message 317 - Posted: 3 Nov 2011, 15:59:02 UTC - in response to Message 314. One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out. Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online. I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone... Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention. ID: 317 · Rating: 0 · rate: / Reply Quote

Greg Tucker Project administrator Project developer Project tester Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0	Message 318 - Posted: 3 Nov 2011, 16:50:11 UTC - in response to Message 317. One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out. Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online. I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone... Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention. I'll try to hold down the fort Eric. I'll look into it tonight. ID: 318 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0	Message 319 - Posted: 3 Nov 2011, 18:14:09 UTC - in response to Message 318. Greg, Here are the docs regarding resends (retries). I would suggest setting the <reliable_reduced_delay_bound> to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting <reliable_reduced_delay_bound> to 1.33 to increase the resend's deadline from 3 days to 3.99. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux ID: 319 · Rating: 0 · rate: / Reply Quote

Greg Tucker Project administrator Project developer Project tester Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0	Message 320 - Posted: 4 Nov 2011, 4:49:56 UTC - in response to Message 319. Greg, Here are the docs regarding resends (retries). I would suggest setting the to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting to 1.33 to increase the resend's deadline from 3 days to 3.99. Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for to enable: 1? If so I will add the following. 1 1.33 ID: 320 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0	Message 321 - Posted: 4 Nov 2011, 8:12:15 UTC - in response to Message 320. Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for <reliable_on_priority> to enable: 1? I think the number depends on the base priority which I assume would be the priority assigned to all the WUs in any given batch of WUs. Exactly where (in which script) the base priority is assigned I don't know. Let's assume base priority is 0. If you set <reliable_on_priority> = 0 then all tasks in the batch would be sent only to fast-reliable hosts which would prevent slow hosts from receiving any work from that batch which is not what you want to do. So you would set <reliable_on_priority> to at least 1. If set to 1 then you need to set <reliable_priority_on_over> to at least 1 because when a task fails, the value of <reliable_priority_on_over> is added to the base priority. Suppose base priority is 10. If you set <reliable_priority_on_over> = 4 then when a task fails the resend would be assigned priority 14. In that case you would set <reliable_on_priority> greater than 10 but not greater than 14. Mind you I have never configured this mechanism before, this is just my interpretation of the docs. To test if your configuration has the desired effect just abort a task and watch the resend to see if it goes to a host that meets the criteria for fast-reliable. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux ID: 321 · Rating: 0 · rate: / Reply Quote