Message boards :
Number crunching :
Had to abort some WUs
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0 ![]() |
I had a batch of work units that were taking a lot longer than usual, and noticed this in my event log: Mon 31 Oct 2011 04:39:13 AM CDT | NumberFields@home | [error] wu_12E10_SF161-1_Idx6_Grp41251of65023_0: negative FLOPs left -2.000000 System Monitor showed that the processes were sleeping, so I aborted them. |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,887,742 RAC: 669,566 ![]() ![]() ![]() |
hmmm. I haven't seen that one before. Let me know if it continues to happen. |
Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0 ![]() |
Will do. I've suspended all but one of the remaining WUs, to see if that one will complete as expected. |
Send message Joined: 24 Aug 11 Posts: 4 Credit: 121,206 RAC: 0 ![]() |
Things seem to be working normally now. I'm not sure what the problem was. My latest batch of WUs completed and validated. |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,887,742 RAC: 669,566 ![]() ![]() ![]() |
Good to hear that. |
![]() Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0 ![]() |
wu_12E10_SF161-1_Idx3_Grp57438of59290 Looks like this wu has some problems. http://numberfields.asu.edu/NumberFields/result.php?resultid=258591 [/url] |
Send message Joined: 28 Oct 11 Posts: 181 Credit: 277,426,388 RAC: 239,480 ![]() ![]() ![]() |
wu_12E10_SF161-1_Idx3_Grp57438of59290 Looks even worse like this: http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008 :-( |
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 ![]() |
it's probably not a bright idea to run ancient P4's on a project where you might hit heavy stuff like that.. |
![]() Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 ![]() |
wu_12E10_SF161-1_Idx3_Grp57438of59290 One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out. Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,887,742 RAC: 669,566 ![]() ![]() ![]() |
I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone... Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention. |
![]() Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0 ![]() |
I'll try to hold down the fort Eric. I'll look into it tonight. |
![]() Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 ![]() |
Greg, Here are the docs regarding resends (retries). I would suggest setting the <reliable_reduced_delay_bound> to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting <reliable_reduced_delay_bound> to 1.33 to increase the resend's deadline from 3 days to 3.99. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |
![]() Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0 ![]() |
Greg, Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for |
![]() Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 ![]() |
Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for <reliable_on_priority> to enable: 1? I think the number depends on the base priority which I assume would be the priority assigned to all the WUs in any given batch of WUs. Exactly where (in which script) the base priority is assigned I don't know. Let's assume base priority is 0. If you set <reliable_on_priority> = 0 then all tasks in the batch would be sent only to fast-reliable hosts which would prevent slow hosts from receiving any work from that batch which is not what you want to do. So you would set <reliable_on_priority> to at least 1. If set to 1 then you need to set <reliable_priority_on_over> to at least 1 because when a task fails, the value of <reliable_priority_on_over> is added to the base priority. Suppose base priority is 10. If you set <reliable_priority_on_over> = 4 then when a task fails the resend would be assigned priority 14. In that case you would set <reliable_on_priority> greater than 10 but not greater than 14. Mind you I have never configured this mechanism before, this is just my interpretation of the docs. To test if your configuration has the desired effect just abort a task and watch the resend to see if it goes to a host that meets the criteria for fast-reliable. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |