Deadline too short ?

Message boards : Number crunching : Deadline too short ?
Message board moderation

To post messages, you must log in.

AuthorMessage
JohnMD
Avatar

Send message
Joined: 21 Oct 11
Posts: 4
Credit: 1,305,128
RAC: 1,682
Message 255 - Posted: 22 Oct 2011, 18:02:07 UTC

After 3 hours 36 minutes, boinc manager says http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 is 5,162% finished. That means another 66 hours processing and there's only 63 to the deadline.
ID: 255 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,680,778
RAC: 287,801
Message 258 - Posted: 23 Oct 2011, 19:23:43 UTC - in response to Message 255.  

After 3 hours 36 minutes, boinc manager says http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 is 5,162% finished. That means another 66 hours processing and there's only 63 to the deadline.


The grace period is set to 150 hours, so you should be fine.

Also, the search time is not uniform. The percent complete is the percentage of the way through the loop, but each pass through the loop takes a different amount of time. So it could actually finish earlier than expected.
ID: 258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JohnMD
Avatar

Send message
Joined: 21 Oct 11
Posts: 4
Credit: 1,305,128
RAC: 1,682
Message 263 - Posted: 24 Oct 2011, 16:55:15 UTC - in response to Message 258.  

After about 13 hours, boinc manager said that http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 was about 13% finished. So the total time was going to be about 100 hours. With only 53 hours left for the remaining 87 hours work, and the estimate constantly rising, I cancelled.
The next cruncher on this WU got "Error while computing" - as I also have experienced on these long WU (e.g. after 43 hours -
http://stat.la.asu.edu/NumberFields/workunit.php?wuid=170828
Is there anything we can do for these 'alpha teathing troubles' ?
ID: 263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 264 - Posted: 24 Oct 2011, 21:02:51 UTC - in response to Message 263.  
Last modified: 24 Oct 2011, 21:10:05 UTC

I think there is quite a bit that can be done. When Eric posted this message a few days ago I knew what has happened would happen. Most crunchers probably don't know the grace period was extended and most probably don't even know what the grace period is. Most crunchers don't have time to read forums regularly or learn much about BOINC. What they know is what they see in front of them. JohnMD saw what he saw, did a little math and came up with the logical conclusion that his task wasn't going to return before deadline. Now if the news about the grace period extension had been put in the news section on the front page, he might have spotted it but most people likely would not. Anyway, much to his credit,he asked a question in the forum but the answer came too late, he had already aborted the task. I likely would have done the same thing in his situation.

When a batch of WUs is known to have tasks that can run 100+ hours then the real deadline should be 100+ hours plus 24 hours buffer time because it can take the scheduler 24 hours (worst case scenario) to report a completed task. That way volunteers can see the deadline and make an informed decision. Extending the grace period to make up for a too short deadline is not wise, as we have seen. You can bet your next paycheque that others will be aborting long tasks for the same reason.

The fact that the maximum elapsed time was also too short is another matter and should not be confused with the too short deadline.

Eric, if you're worried about the length of time it takes to finish a batch of WUs then please consider using other available server side mechanisms rather than leaving the deadline at 3 days when some tasks will take 100+ hours. For example, you can use task priority flags and the fast-reliable host settings to make sure resent tasks (tasks to replace tasks that return an error or miss deadline) are sent to hosts that meet user configurable criteria for task turnaround time and reliability (reliable = few compute errors and results validate). I believe you can even specify that *all* tasks go to fast-reliables but I'm not sure. You can read about it here.

BTW, even slow computers can be deemed fast-reliable if they keep a small cache, return very few errors and produce results that nearly always validate. Fast does not mean fast CPU in this case. Fast means short turn around time which is not hard to do if one keeps a small cache.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,680,778
RAC: 287,801
Message 265 - Posted: 24 Oct 2011, 21:15:19 UTC - in response to Message 263.  

After about 13 hours, boinc manager said that http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 was about 13% finished. So the total time was going to be about 100 hours. With only 53 hours left for the remaining 87 hours work, and the estimate constantly rising, I cancelled.
The next cruncher on this WU got "Error while computing" - as I also have experienced on these long WU (e.g. after 43 hours -
http://stat.la.asu.edu/NumberFields/workunit.php?wuid=170828
Is there anything we can do for these 'alpha teathing troubles' ?


Hi John,

In theory, the long work units should not pose a problem, as I set the rsc_fpops_bound to something like 500 hours (assuming a host with 2 gflops). I believe the boinc client is trying to be smart and modifying this value based on historical run times. I think the very fast wus are skewing the calculation, causing the time out on some machines. I explain this more in this thread:

http://stat.la.asu.edu/NumberFields/forum_thread.php?id=35

This is the best explanation I can come up with. It's only happening for a handful of users. Wus that time out on one machine run just fine on other machines. I did increase the fpops_bound by another factor of 10 (now its an obscene 5000 hours).
ID: 265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,680,778
RAC: 287,801
Message 266 - Posted: 24 Oct 2011, 21:24:50 UTC - in response to Message 265.  

Thanks Dagorath. That's a good point. I will also increase the deadline.
ID: 266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Deadline too short ?


Main page · Your account · Message boards


Copyright © 2024 Arizona State University