Message boards :
Number crunching :
Super long estimated times
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
Last day or so my active tasks have seen their estimated times inflate to hundreds of hours. I'm used to seeing big leaps in progress % but now tasks that were previously estimated to take 11 hours (and did, more or less) are now being estimated to take 138 hours. No biggie, tasks complete as usual, credit is as usual. Just thought I'd mention it in case it's an indicator of something else breaking. |
Send message Joined: 24 Jun 12 Posts: 5 Credit: 25,698,953 RAC: 0 |
Hi Steve, A modification was just made to the estimated FP operations parameter for the newest batch of work units. Are these high estimated times occurring in the Decics app? More than likely that modification is the cause but Eric will be able to confirm. The modification was made to attempt and solve queue problems on some hosts who were not able to build a queue of tasks to complete. Out of curiosity is your host able to maintain a queue of work units or does it download as needed? ~Jack |
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
Hey Jack, Thanks! These are in Decics. I received 150 WUs and noticed the behavior right after that so I've not have the chance to see how the queue is working or not. With much higher estimation times, I'd suggest that my host - based on how other projects seem to work, particularly Primegrid - will be less likely to maintain a queue (assuming I got my understanding all straight). If the client holds a full buffer, based on inflated estimated times, then there will be fewer WUs in that buffer and a much bigger hole in the buffer will need to be created before another WU is downloaded. Right? Also, I checked all my WUs and I see now that some have smaller estimates than I would expect (like seconds instead of hours). Confused? You will be, after the next installment.... |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Hi Steve, Sorry, I probably should have posted something about this in the news items. As Jack mentioned, I did change the estimated FP ops to what should be a more realistic number. I think it was the equivalent of 50 hours per WU on a 2GHz computer (assuming 1 flop per clock cycle). Typical WUs on my 2GHz computer were averaging less than 5 hours. So I reduced the estimated flops by a factor of 10. The original problem we were seeing, is that some users were not able to maintain a queue, primarily because at 50 hours per WU, the client thought it had enough work. The client is smart enough to maintain statistics, and over time will compensate for the inaccurate flop estimate. This is why I was unaware of the problem for so long, because my clients had already compensated for it. After "fixing" the flop estimate, I noticed that my clients downloaded a ton of new WUs, and I think this is what you are also seeing. This is to be expected, as the client now thinks the new WUs will be 10 times as fast compared to the old ones. Over time the client should readjust to this new flop estimate. In fact, after several days my clients now have a more reasonable queue. Eric |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Just one more thing to add. I am making the same adjustment to the bounded app. But this time, I will slowly reduce the estimated flops by factors of 2, over a period of several weeks. This should reduce the large spike of WUs in the client queue. |
Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 |
We rarely use the bounded property so a quick reduction doesn't hurt much. A tenfold decrease in the flops estimate, however, throws a big wobbly into the scheduler. On the other other hand... when are people gonna learn to keep a small cache <sigh> BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 |
Just one more thing to add. you better go for 10% a day! whatever you are doing there, small steps not big jumps.. |
Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 |
Doh! My mistake. frankhagen is right, 10% per day or less. Eric, I thought you were talking about the <fpops_bound> (sp?) tasks property when you said "bounded app". BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839 I think that wu needs some time. :) |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839 That does seem odd. Keep an eye on it and let me know when it finally finishes. Any chance it's been periodically interrupted? Because one explanation is that it gets continually interrupted before it can checkpoint. Checkpointing is supposed to happen every several minutes, but sometimes it takes 20+ minutes to reach that part of the code. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839 Yea, the Wu is allready computed to 100%. More than 34 houres and 1,151.74 Points. See: http://numberfields.asu.edu/NumberFields/result.php?resultid=5308617 |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Well, not the right thread...see yourself: I´m reporting two long running tasks: One with 27 hours working; at 60%. The second with 54,5 hours working; at 40,911%. I have an eye on them. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Well, not the right thread...see yourself: I am two days behind the Deadline. 209310 needs some more days. ;) |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Well, not the right thread...see yourself: Don't worry, there is a grace period. That WU won't be reissued until later today. And you should be good as long as you return it before someone else does. It looks like your host is 3.3 GHz, so you should be fine as long as you are not over-committing your cpu. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Well, not the right thread...see yourself: Here are two long running tasks: http://numberfields.asu.edu/NumberFields/result.php?resultid=10054642 17 hours CPU-time http://numberfields.asu.edu/NumberFields/result.php?resultid=10012391 2 days and 18 hours CPU-time! [3,6 k points] |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Well, not the right thread...see yourself: As a test, I reran result# 10012391, and it took only 19 hours, about 3.5 times faster than yours. Our clock speeds are comparable. The biggest difference is that you used the windows version, and I the 64bit linux version. I am a little puzzled by this; I remember the windows version being slower, but not 3.5 times slower. Maybe you were running something else simultaneously? |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Well, not the right thread...see yourself: Maybe it was an running application in the background? [FiND@Home: Vina application has troubles with CPU-settings. If suspend the app needs 10% CPU time per unit.] 3.5 times faster...thats much, very much. In the next weeks I need to buy an new computer, the main used OS will be linux. ;) [Debian or Ubuntu...] My mainboard is not working right, CPU and GPU are okay. Maybe I´ll buy an used 2 year old mainboard and make an local RasPi-Computing Grid for it. ;) Now I´m going to catch the 1 million! :) |
Send message Joined: 14 Aug 11 Posts: 5 Credit: 10,096,779 RAC: 0 |
Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space. |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space. Thanks for reporting. The current deadline is 7 days with a 3 day grace period, giving a total of 10 days. So I could raise it a little bit, but I wouldn't want to go much beyond 10 days. When I find some time I will analyze the data from the latest subfield (which recently completed). Several weeks ago I did a spot check of the long running results- the vast majority were from the windows version of the app. Of course this could be because the vast majority of users use windows, but it's something else to look into. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 253,378,167 RAC: 178,862 |
Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space. This sounds like a good moment to mention WU 9446590. The first two copies both exceeded the deadline, and I have the third. It's been running slowly but steadily for 114 hours so far, and has reached 37.317%. I get the impression that it sometimes moves on by 1% or 2% quite quickly, but usually progress moves on in 0.001% increments. Other tasks have come and gone on the other CPU cores while this one has been running, so I don't think it's an issue with the computer it's running on. The Martinet search report for the task which has finished shows 23 cases, from "a5 = -8 + 6w" to "a5 = 14 + 6w". Mine has reached case 0, which seems about right for 37.319% - it's moved on while I've been typing! I'll keep it running to see what happens, although I'm expecting it will far exceed the local deadline tomorrow night, and also exceed the grace period on 3 March. |