Message boards :
Number crunching :
Need more Time!
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Apr 16 Posts: 2 Credit: 3,389 RAC: 0 |
Hello, i need more time for a task is that possible? |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
Hello, i need more time for a task is that possible? It looks like you only have 2 in progress and the earliest is due April 20th. Or are you referring to some other WU that has already timed out? |
Send message Joined: 23 Feb 13 Posts: 29 Credit: 21,480,710 RAC: 0 |
I have to make the same announcement due to the following two WUs: http://numberfields.asu.edu/NumberFields/workunit.php?wuid=15161036 http://numberfields.asu.edu/NumberFields/workunit.php?wuid=15161091 Both WUs are still running, 57% and 71% done after ~272 hours. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
I can't extend the deadline after it's already past, but the server still accepts late results, so it doesn't hurt to let it continue. If someone manages to return it before you, let me know and I will get you the credit. |
Send message Joined: 23 Feb 13 Posts: 29 Credit: 21,480,710 RAC: 0 |
I got the first one (#15161091) of my "fat" WUs down. Unfortunately, it found nothing and might have blown the credit cap once again. # The search is complete. Stats: The second fat WU is still going at 57% and 322 hours of running time. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
I got the first one (#15161091) of my "fat" WUs down. Ok, thanks for the report. I took care of the credit cap issue. |
Send message Joined: 3 Sep 12 Posts: 2 Credit: 16,239,835 RAC: 0 |
I finished another long runner after 32 days, but it was done before by another user: wu_Qsqrt421_DS3x8_CV1_S815_N2_-194088_N1_805269to806765 Could you please have a look at it. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 246,429,494 RAC: 161,118 |
Easier to find as WU 12731776 |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
I finished another long runner after 32 days, but it was done before by another user: I took care of that. Anyone else reading this... if you were denied credit on a long runner because of missing a deadline, let me know and I can adjust your credit. Thanks! |
Send message Joined: 2 Apr 16 Posts: 2 Credit: 3,389 RAC: 0 |
I am a bit dissapointed, i am running a task for 3 days now and it just does not finish. I am at 99,683% and i get 0,100% per 12 hours. I am sorry, but your programming seems to be bad. |
Send message Joined: 11 Apr 15 Posts: 4 Credit: 5,058,905 RAC: 18 |
[...] i am running a task for 3 days now and it just does not finish. Welcome to the project! This is not unusual with some work units. I have processed several such units, as have many other volunteers. The first 80% or so goes pretty fast, then progress slows greatly. I have spent a day or two on the last 1% of a unit. In almost all cases, the built-in deadline grace period is long enough to cover the completion of the work unit, and when it's not, the project administrators are very accommodating about awarding appropriate credit for lost time. I predict your work unit will complete normally on its own. If it goes on for several more days, you might want to mention it again here. [/quote] |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
I am a bit dissapointed, i am running a task for 3 days now and it just does not finish. I am at 99,683% and i get 0,100% per 12 hours. I am sorry, but your programming seems to be bad. It's not a matter of programming. The cases that are exhibiting the strange behavior are the Qsqrt421 cases. This is a special search that is trying to leverage the GetDecics app and not have to write a new app. The GetDecics app was not originally designed for such a large discriminant base field (i.e. disc=421), and unfortunately the timing has strange side effects. The basic issue is this. If you plot timing as a function of loop index you get something that vaguely resembles a Gaussian. The lower discriminant base fields have a nice large variance which gives something closer to a uniform timing distribution when you break the region into pieces (we use larger pieces over the tail of the Gaussian and smaller pieces around the mean). The problem with the large base field cases (Qsqrt421) is that the variance of the Gaussian gets very small and the timing plot starts to resemble a delta function. The code that breaks up the search space estimates the mean of the Gaussian, and when the Gaussian has a very small variance, the error in the estimate causes a larger impact on timing. When the estimate of the mean is off, the end of a WU starts climbing up the delta function, and you end up spending 99% of your time in the last 1% of the search. There is nothing I can do about this, other than writing a new app for this specific case, and maybe use trickle up WUs (which somebody suggested in another thread). I increased the deadline and the grace period to help alleviate the problems, which is a much easier solution than writing a whole new app. Especially since this "special" search will be coming to an end soon. If I have some time later I will try and post some plots of timing so you can see what I am talking about. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
Something else to be aware of... I have seen the progress meter go to 100.000% and the WU still continues processing for another few hours. I believe what is happening is that the progress is really 99.9995% and the client is rounding it up to 100. No need to worry that it's stuck; the WU will eventually finish. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 246,429,494 RAC: 161,118 |
Something else to be aware of... It's even worse than that. I have a current task which has been running over 5 days, and is displaying 100%. (Don't worry, I've seen several like that, and most of them have completed already - it can take as long as it wants. I think this one has already been showing 100% for well over a day.) The key file to investigate is boinc_task_state.xml in the task's slot directory. It says: <active_task> <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url> <result_name>wu_Qsqrt421_DS3x8_CV1_S815_N2_-194161_N1_805982_k2_-1_0</result_name> <checkpoint_cpu_time>421149.300000</checkpoint_cpu_time> <checkpoint_elapsed_time>499853.935949</checkpoint_elapsed_time> <fraction_done>0.000000</fraction_done> <peak_working_set_size>7725056</peak_working_set_size> <peak_swap_size>304140288</peak_swap_size> <peak_disk_usage>13256</peak_disk_usage> </active_task> <fraction_done> would normally be filled in by your application - I think you've said that the overheads of adding reporting at the very innermost loop would be too high, so you've left it a little further out. But that means that in this particular parameter space, all the progress comes from the BOINC client's attempt to reassure that all is well, by inventing its own pseudo-progress to report. By design, pseudo-progress tends asymptotically to a fraction of 1 (100%), but never reaches it. Because this task has run so far beyond its initial estimate (probably 7 hours on this machine), the asymptotic limit has become indistinguishable from 1 (to three decimal places). It's not the first time that BOINC coding has failed to cope transparently with extreme cases. The checkpoint (state) file for this task contains 0 -194161 805982 -1 91 3518541053 0 0 0 499851 and stderr's report on the Martinet search has reached Now starting the targeted Martinet search: N2_L = -194161. N2_U = -194161. N2 = -194161. N1_L = 805982. N1_U = 805982. N1 = 805982. k2 range: -1 => -1. k2 = -1. k1 range: 76 => 136. k1 = 76. k1 = 77. k1 = 78. k1 = 79. k1 = 80. k1 = 81. k1 = 82. k1 = 83. k1 = 84. k1 = 85. k1 = 86. k1 = 87. k1 = 88. k1 = 89. k1 = 90. k1 = 91. if that helps you track down what it's up to. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
Richard, Thanks for that information; it helped me to track down the problem. My formula for fraction_done had one term that was the ratio of 2 integers - I needed to type cast them to float otherwise the ratio got truncated to 0. The issue was more noticable on the WUs where the outer loops only process 1 value and the truncated term was the dominant term in the formula (all other terms go to zero). I originally thought things were working because the progress meter was going up; I didn't realize that BOINC was doing it's own progress estimation. This also explains why my linux machines, which use an older client, go directly from 0 to 100% on these types of WUs. This is not too critical, as it only affects the progress meter, but I will try to get updated apps out there soon. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 246,429,494 RAC: 161,118 |
Always glad to help. We tracked that one down just in time - task 16717469 has just finished and reported all by itself, taking the evidence with it. |
Send message Joined: 23 Feb 13 Posts: 29 Credit: 21,480,710 RAC: 0 |
The second fat WU has finally given up the ghost after 16 days of working time, but again with no noticeable outcome. Credit cap? # Inspected 2017069449 polynomials. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 509,701,206 RAC: 554,220 |
The second fat WU has finally given up the ghost after 16 days of working time, but again with no noticeable outcome. Credit cap? Took care of the credit cap. The Qsqrt421 searches are looking for an extremely rare field, so most of the time nothing will be found, but many polynomials will be tested. |
Send message Joined: 23 Feb 13 Posts: 29 Credit: 21,480,710 RAC: 0 |
Is user rwild reading here? What's about WU #15040289? Are you still on it or is it up to me? I'm participating in the Primegrid race the next days. I'll be back! |