Message boards :
Number crunching :
Deadline too short ?
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 21 Oct 11 Posts: 4 Credit: 1,802,421 RAC: 1,497 ![]() |
After 3 hours 36 minutes, boinc manager says http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 is 5,162% finished. That means another 66 hours processing and there's only 63 to the deadline. |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,838,917 RAC: 668,000 ![]() ![]() ![]() |
After 3 hours 36 minutes, boinc manager says http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 is 5,162% finished. That means another 66 hours processing and there's only 63 to the deadline. The grace period is set to 150 hours, so you should be fine. Also, the search time is not uniform. The percent complete is the percentage of the way through the loop, but each pass through the loop takes a different amount of time. So it could actually finish earlier than expected. |
![]() Send message Joined: 21 Oct 11 Posts: 4 Credit: 1,802,421 RAC: 1,497 ![]() |
After about 13 hours, boinc manager said that http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 was about 13% finished. So the total time was going to be about 100 hours. With only 53 hours left for the remaining 87 hours work, and the estimate constantly rising, I cancelled. The next cruncher on this WU got "Error while computing" - as I also have experienced on these long WU (e.g. after 43 hours - http://stat.la.asu.edu/NumberFields/workunit.php?wuid=170828 Is there anything we can do for these 'alpha teathing troubles' ? |
![]() Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 ![]() |
I think there is quite a bit that can be done. When Eric posted this message a few days ago I knew what has happened would happen. Most crunchers probably don't know the grace period was extended and most probably don't even know what the grace period is. Most crunchers don't have time to read forums regularly or learn much about BOINC. What they know is what they see in front of them. JohnMD saw what he saw, did a little math and came up with the logical conclusion that his task wasn't going to return before deadline. Now if the news about the grace period extension had been put in the news section on the front page, he might have spotted it but most people likely would not. Anyway, much to his credit,he asked a question in the forum but the answer came too late, he had already aborted the task. I likely would have done the same thing in his situation. When a batch of WUs is known to have tasks that can run 100+ hours then the real deadline should be 100+ hours plus 24 hours buffer time because it can take the scheduler 24 hours (worst case scenario) to report a completed task. That way volunteers can see the deadline and make an informed decision. Extending the grace period to make up for a too short deadline is not wise, as we have seen. You can bet your next paycheque that others will be aborting long tasks for the same reason. The fact that the maximum elapsed time was also too short is another matter and should not be confused with the too short deadline. Eric, if you're worried about the length of time it takes to finish a batch of WUs then please consider using other available server side mechanisms rather than leaving the deadline at 3 days when some tasks will take 100+ hours. For example, you can use task priority flags and the fast-reliable host settings to make sure resent tasks (tasks to replace tasks that return an error or miss deadline) are sent to hosts that meet user configurable criteria for task turnaround time and reliability (reliable = few compute errors and results validate). I believe you can even specify that *all* tasks go to fast-reliables but I'm not sure. You can read about it here. BTW, even slow computers can be deemed fast-reliable if they keep a small cache, return very few errors and produce results that nearly always validate. Fast does not mean fast CPU in this case. Fast means short turn around time which is not hard to do if one keeps a small cache. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,838,917 RAC: 668,000 ![]() ![]() ![]() |
After about 13 hours, boinc manager said that http://stat.la.asu.edu/NumberFields/workunit.php?wuid=171455 was about 13% finished. So the total time was going to be about 100 hours. With only 53 hours left for the remaining 87 hours work, and the estimate constantly rising, I cancelled. Hi John, In theory, the long work units should not pose a problem, as I set the rsc_fpops_bound to something like 500 hours (assuming a host with 2 gflops). I believe the boinc client is trying to be smart and modifying this value based on historical run times. I think the very fast wus are skewing the calculation, causing the time out on some machines. I explain this more in this thread: http://stat.la.asu.edu/NumberFields/forum_thread.php?id=35 This is the best explanation I can come up with. It's only happening for a handful of users. Wus that time out on one machine run just fine on other machines. I did increase the fpops_bound by another factor of 10 (now its an obscene 5000 hours). |
![]() Send message Joined: 8 Jul 11 Posts: 1366 Credit: 607,838,917 RAC: 668,000 ![]() ![]() ![]() |
Thanks Dagorath. That's a good point. I will also increase the deadline. |