Had to abort some WUs

Message boards : Number crunching : Had to abort some WUs
Message board moderation

To post messages, you must log in.

AuthorMessage
Chris Granger

Send message
Joined: 24 Aug 11
Posts: 4
Credit: 121,206
RAC: 0
Message 300 - Posted: 31 Oct 2011, 11:05:57 UTC

I had a batch of work units that were taking a lot longer than usual, and noticed this in my event log:

Mon 31 Oct 2011 04:39:13 AM CDT | NumberFields@home | [error] wu_12E10_SF161-1_Idx6_Grp41251of65023_0: negative FLOPs left -2.000000

System Monitor showed that the processes were sleeping, so I aborted them.
ID: 300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,956,778
RAC: 289,625
Message 301 - Posted: 31 Oct 2011, 16:27:19 UTC - in response to Message 300.  

hmmm. I haven't seen that one before. Let me know if it continues to happen.
ID: 301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris Granger

Send message
Joined: 24 Aug 11
Posts: 4
Credit: 121,206
RAC: 0
Message 304 - Posted: 31 Oct 2011, 18:54:19 UTC - in response to Message 301.  

Will do. I've suspended all but one of the remaining WUs, to see if that one will complete as expected.
ID: 304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris Granger

Send message
Joined: 24 Aug 11
Posts: 4
Credit: 121,206
RAC: 0
Message 305 - Posted: 1 Nov 2011, 3:28:04 UTC

Things seem to be working normally now. I'm not sure what the problem was. My latest batch of WUs completed and validated.
ID: 305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,956,778
RAC: 289,625
Message 306 - Posted: 1 Nov 2011, 8:44:10 UTC - in response to Message 305.  

Good to hear that.
ID: 306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Greg Tucker
Project administrator
Project developer
Project tester

Send message
Joined: 8 Jul 11
Posts: 46
Credit: 7,144,042
RAC: 0
Message 311 - Posted: 2 Nov 2011, 15:46:15 UTC - in response to Message 310.  

wu_12E10_SF161-1_Idx3_Grp57438of59290
getting stuck. aborted after 70 hours


Looks like this wu has some problems.
http://numberfields.asu.edu/NumberFields/result.php?resultid=258591

[/url]
ID: 311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,483,122
RAC: 128,944
Message 312 - Posted: 2 Nov 2011, 16:12:29 UTC - in response to Message 311.  

wu_12E10_SF161-1_Idx3_Grp57438of59290
getting stuck. aborted after 70 hours

Looks like this wu has some problems.
http://numberfields.asu.edu/NumberFields/result.php?resultid=258591

Looks even worse like this:

http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008

:-(
ID: 312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 313 - Posted: 2 Nov 2011, 17:19:52 UTC - in response to Message 312.  

it's probably not a bright idea to run ancient P4's on a project where you might hit heavy stuff like that..
ID: 313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 314 - Posted: 2 Nov 2011, 19:45:27 UTC - in response to Message 312.  

wu_12E10_SF161-1_Idx3_Grp57438of59290
getting stuck. aborted after 70 hours

Looks like this wu has some problems.
http://numberfields.asu.edu/NumberFields/result.php?resultid=258591

Looks even worse like this:

http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008

:-(


One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out.

Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,956,778
RAC: 289,625
Message 317 - Posted: 3 Nov 2011, 15:59:02 UTC - in response to Message 314.  


One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out.

Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online.


I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone...

Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention.
ID: 317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Greg Tucker
Project administrator
Project developer
Project tester

Send message
Joined: 8 Jul 11
Posts: 46
Credit: 7,144,042
RAC: 0
Message 318 - Posted: 3 Nov 2011, 16:50:11 UTC - in response to Message 317.  


One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out.

Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online.


I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone...

Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention.


I'll try to hold down the fort Eric. I'll look into it tonight.
ID: 318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 319 - Posted: 3 Nov 2011, 18:14:09 UTC - in response to Message 318.  

Greg,

Here are the docs regarding resends (retries).

I would suggest setting the <reliable_reduced_delay_bound> to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting <reliable_reduced_delay_bound> to 1.33 to increase the resend's deadline from 3 days to 3.99.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Greg Tucker
Project administrator
Project developer
Project tester

Send message
Joined: 8 Jul 11
Posts: 46
Credit: 7,144,042
RAC: 0
Message 320 - Posted: 4 Nov 2011, 4:49:56 UTC - in response to Message 319.  

Greg,

Here are the docs regarding resends (retries).

I would suggest setting the to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting to 1.33 to increase the resend's deadline from 3 days to 3.99.


Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for to enable: 1? If so I will add the following.

1
1.33

ID: 320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 321 - Posted: 4 Nov 2011, 8:12:15 UTC - in response to Message 320.  

Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for <reliable_on_priority> to enable: 1?


I think the number depends on the base priority which I assume would be the priority assigned to all the WUs in any given batch of WUs. Exactly where (in which script) the base priority is assigned I don't know.

Let's assume base priority is 0. If you set <reliable_on_priority> = 0 then all tasks in the batch would be sent only to fast-reliable hosts which would prevent slow hosts from receiving any work from that batch which is not what you want to do. So you would set <reliable_on_priority> to at least 1. If set to 1 then you need to set <reliable_priority_on_over> to at least 1 because when a task fails, the value of <reliable_priority_on_over> is added to the base priority.

Suppose base priority is 10. If you set <reliable_priority_on_over> = 4 then when a task fails the resend would be assigned priority 14. In that case you would set <reliable_on_priority> greater than 10 but not greater than 14.

Mind you I have never configured this mechanism before, this is just my interpretation of the docs. To test if your configuration has the desired effect just abort a task and watch the resend to see if it goes to a host that meets the criteria for fast-reliable.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Had to abort some WUs


Main page · Your account · Message boards


Copyright © 2024 Arizona State University