Super long estimated times

Message boards : Number crunching : Super long estimated times
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 775 - Posted: 22 Jan 2013, 0:03:12 UTC

Last day or so my active tasks have seen their estimated times inflate to hundreds of hours.

I'm used to seeing big leaps in progress % but now tasks that were previously estimated to take 11 hours (and did, more or less) are now being estimated to take 138 hours.

No biggie, tasks complete as usual, credit is as usual. Just thought I'd mention it in case it's an indicator of something else breaking.
ID: 775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jack Lightholder
Volunteer moderator
Project administrator

Send message
Joined: 24 Jun 12
Posts: 5
Credit: 25,698,953
RAC: 0
Message 776 - Posted: 22 Jan 2013, 1:22:20 UTC - in response to Message 775.  

Hi Steve,
A modification was just made to the estimated FP operations parameter for the newest batch of work units. Are these high estimated times occurring in the Decics app? More than likely that modification is the cause but Eric will be able to confirm. The modification was made to attempt and solve queue problems on some hosts who were not able to build a queue of tasks to complete.

Out of curiosity is your host able to maintain a queue of work units or does it download as needed?

~Jack
ID: 776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 777 - Posted: 22 Jan 2013, 2:30:45 UTC - in response to Message 776.  

Hey Jack,

Thanks! These are in Decics.

I received 150 WUs and noticed the behavior right after that so I've not have the chance to see how the queue is working or not.

With much higher estimation times, I'd suggest that my host - based on how other projects seem to work, particularly Primegrid - will be less likely to maintain a queue (assuming I got my understanding all straight). If the client holds a full buffer, based on inflated estimated times, then there will be fewer WUs in that buffer and a much bigger hole in the buffer will need to be created before another WU is downloaded. Right?

Also, I checked all my WUs and I see now that some have smaller estimates than I would expect (like seconds instead of hours).

Confused? You will be, after the next installment....
ID: 777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 778 - Posted: 22 Jan 2013, 3:46:04 UTC - in response to Message 777.  

Hi Steve,

Sorry, I probably should have posted something about this in the news items.

As Jack mentioned, I did change the estimated FP ops to what should be a more realistic number. I think it was the equivalent of 50 hours per WU on a 2GHz computer (assuming 1 flop per clock cycle). Typical WUs on my 2GHz computer were averaging less than 5 hours. So I reduced the estimated flops by a factor of 10.

The original problem we were seeing, is that some users were not able to maintain a queue, primarily because at 50 hours per WU, the client thought it had enough work.

The client is smart enough to maintain statistics, and over time will compensate for the inaccurate flop estimate. This is why I was unaware of the problem for so long, because my clients had already compensated for it.

After "fixing" the flop estimate, I noticed that my clients downloaded a ton of new WUs, and I think this is what you are also seeing. This is to be expected, as the client now thinks the new WUs will be 10 times as fast compared to the old ones. Over time the client should readjust to this new flop estimate. In fact, after several days my clients now have a more reasonable queue.

Eric
ID: 778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 779 - Posted: 22 Jan 2013, 3:57:56 UTC - in response to Message 778.  

Just one more thing to add.

I am making the same adjustment to the bounded app. But this time, I will slowly reduce the estimated flops by factors of 2, over a period of several weeks. This should reduce the large spike of WUs in the client queue.
ID: 779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 780 - Posted: 22 Jan 2013, 9:47:36 UTC - in response to Message 779.  

We rarely use the bounded property so a quick reduction doesn't hurt much. A tenfold decrease in the flops estimate, however, throws a big wobbly into the scheduler. On the other other hand... when are people gonna learn to keep a small cache <sigh>

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 781 - Posted: 22 Jan 2013, 14:55:46 UTC - in response to Message 779.  
Last modified: 22 Jan 2013, 14:57:16 UTC

Just one more thing to add.

I am making the same adjustment to the bounded app. But this time, I will slowly reduce the estimated flops by factors of 2, over a period of several weeks. This should reduce the large spike of WUs in the client queue.


you better go for 10% a day!

whatever you are doing there, small steps not big jumps..
ID: 781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 782 - Posted: 22 Jan 2013, 21:07:01 UTC - in response to Message 781.  

Doh! My mistake. frankhagen is right, 10% per day or less. Eric, I thought you were talking about the <fpops_bound> (sp?) tasks property when you said "bounded app".

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 924 - Posted: 29 Nov 2013, 14:38:45 UTC

Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839

I think that wu needs some time. :)
ID: 924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 925 - Posted: 29 Nov 2013, 18:41:40 UTC - in response to Message 924.  

Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839

I think that wu needs some time. :)


That does seem odd. Keep an eye on it and let me know when it finally finishes.

Any chance it's been periodically interrupted? Because one explanation is that it gets continually interrupted before it can checkpoint. Checkpointing is supposed to happen every several minutes, but sometimes it takes 20+ minutes to reach that part of the code.
ID: 925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 926 - Posted: 29 Nov 2013, 21:54:08 UTC - in response to Message 925.  

Hm, long WU, see my picture on Da: http://aurel51.deviantart.com/art/Ehh-yes-416655839

I think that wu needs some time. :)


That does seem odd. Keep an eye on it and let me know when it finally finishes.

Any chance it's been periodically interrupted? Because one explanation is that it gets continually interrupted before it can checkpoint. Checkpointing is supposed to happen every several minutes, but sometimes it takes 20+ minutes to reach that part of the code.


Yea, the Wu is allready computed to 100%. More than 34 houres and 1,151.74 Points.
See: http://numberfields.asu.edu/NumberFields/result.php?resultid=5308617
ID: 926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1220 - Posted: 16 Jan 2015, 17:33:40 UTC

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.
ID: 1220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1221 - Posted: 18 Jan 2015, 16:44:16 UTC - in response to Message 1220.  

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.


I am two days behind the Deadline. 209310 needs some more days. ;)
ID: 1221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 1222 - Posted: 18 Jan 2015, 17:33:15 UTC - in response to Message 1221.  

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.


I am two days behind the Deadline. 209310 needs some more days. ;)


Don't worry, there is a grace period. That WU won't be reissued until later today. And you should be good as long as you return it before someone else does. It looks like your host is 3.3 GHz, so you should be fine as long as you are not over-committing your cpu.
ID: 1222 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1223 - Posted: 19 Jan 2015, 18:11:40 UTC - in response to Message 1222.  

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.


I am two days behind the Deadline. 209310 needs some more days. ;)


Don't worry, there is a grace period. That WU won't be reissued until later today. And you should be good as long as you return it before someone else does. It looks like your host is 3.3 GHz, so you should be fine as long as you are not over-committing your cpu.


Here are two long running tasks:


http://numberfields.asu.edu/NumberFields/result.php?resultid=10054642 17 hours CPU-time

http://numberfields.asu.edu/NumberFields/result.php?resultid=10012391 2 days and 18 hours CPU-time! [3,6 k points]
ID: 1223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 1224 - Posted: 21 Jan 2015, 6:08:28 UTC - in response to Message 1223.  

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.


I am two days behind the Deadline. 209310 needs some more days. ;)


Don't worry, there is a grace period. That WU won't be reissued until later today. And you should be good as long as you return it before someone else does. It looks like your host is 3.3 GHz, so you should be fine as long as you are not over-committing your cpu.


Here are two long running tasks:


http://numberfields.asu.edu/NumberFields/result.php?resultid=10054642 17 hours CPU-time

http://numberfields.asu.edu/NumberFields/result.php?resultid=10012391 2 days and 18 hours CPU-time! [3,6 k points]



As a test, I reran result# 10012391, and it took only 19 hours, about 3.5 times faster than yours. Our clock speeds are comparable. The biggest difference is that you used the windows version, and I the 64bit linux version. I am a little puzzled by this; I remember the windows version being slower, but not 3.5 times slower. Maybe you were running something else simultaneously?
ID: 1224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1225 - Posted: 21 Jan 2015, 15:11:21 UTC - in response to Message 1224.  

Well, not the right thread...see yourself:

I´m reporting two long running tasks:

One with 27 hours working; at 60%.
The second with 54,5 hours working; at 40,911%.

I have an eye on them.


I am two days behind the Deadline. 209310 needs some more days. ;)


Don't worry, there is a grace period. That WU won't be reissued until later today. And you should be good as long as you return it before someone else does. It looks like your host is 3.3 GHz, so you should be fine as long as you are not over-committing your cpu.


Here are two long running tasks:


http://numberfields.asu.edu/NumberFields/result.php?resultid=10054642 17 hours CPU-time

http://numberfields.asu.edu/NumberFields/result.php?resultid=10012391 2 days and 18 hours CPU-time! [3,6 k points]



As a test, I reran result# 10012391, and it took only 19 hours, about 3.5 times faster than yours. Our clock speeds are comparable. The biggest difference is that you used the windows version, and I the 64bit linux version. I am a little puzzled by this; I remember the windows version being slower, but not 3.5 times slower. Maybe you were running something else simultaneously?


Maybe it was an running application in the background? [FiND@Home: Vina application has troubles with CPU-settings. If suspend the app needs 10% CPU time per unit.]

3.5 times faster...thats much, very much.

In the next weeks I need to buy an new computer, the main used OS will be linux. ;) [Debian or Ubuntu...]
My mainboard is not working right, CPU and GPU are okay. Maybe I´ll buy an used 2 year old mainboard and make an local RasPi-Computing Grid for it. ;)

Now I´m going to catch the 1 million! :)
ID: 1225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Traills

Send message
Joined: 14 Aug 11
Posts: 5
Credit: 10,096,779
RAC: 0
Message 1232 - Posted: 26 Feb 2015, 9:07:31 UTC

Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space.
ID: 1232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 532,708,184
RAC: 547,640
Message 1233 - Posted: 27 Feb 2015, 3:28:59 UTC - in response to Message 1232.  

Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space.


Thanks for reporting. The current deadline is 7 days with a 3 day grace period, giving a total of 10 days. So I could raise it a little bit, but I wouldn't want to go much beyond 10 days.

When I find some time I will analyze the data from the latest subfield (which recently completed). Several weeks ago I did a spot check of the long running results- the vast majority were from the windows version of the app. Of course this could be because the vast majority of users use windows, but it's something else to look into.
ID: 1233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 253,378,167
RAC: 178,862
Message 1234 - Posted: 27 Feb 2015, 9:35:32 UTC - in response to Message 1233.  

Something similar going on in several of my Get Decics with Bounded Discriminant v3.00 WU's. The last one was showing 10:36:09 worked - not bad in itself, but only 26.233% complete, and that represented just an 0.1% increment in completion over its last 26 CPU minutes (i7-3770 @3.4GHz). ETC inflated by 63 minutes in the same time period. I normally wouldn't mind letting a WU crank for as long as it takes (I also participate in CPDN). This one I thought it best to abort because to keep it around was projecting a risk to other WUs that were waiting in the queue. Cumulative result of two other v3.00 WUs running to completion in the 25-30 CPU hr range in the last week - although some are much shorter - coupled with a lot of Decic Fields 1.02 WUs running 10-25 hrs lately. Not sure what can be done; maybe increase the due date for all WUs by some further percent until the tasks return to a more moderate part of the problem space.

Thanks for reporting. The current deadline is 7 days with a 3 day grace period, giving a total of 10 days. So I could raise it a little bit, but I wouldn't want to go much beyond 10 days.

When I find some time I will analyze the data from the latest subfield (which recently completed). Several weeks ago I did a spot check of the long running results- the vast majority were from the windows version of the app. Of course this could be because the vast majority of users use windows, but it's something else to look into.

This sounds like a good moment to mention WU 9446590.

The first two copies both exceeded the deadline, and I have the third. It's been running slowly but steadily for 114 hours so far, and has reached 37.317%. I get the impression that it sometimes moves on by 1% or 2% quite quickly, but usually progress moves on in 0.001% increments. Other tasks have come and gone on the other CPU cores while this one has been running, so I don't think it's an issue with the computer it's running on.

The Martinet search report for the task which has finished shows 23 cases, from "a5 = -8 + 6w" to "a5 = 14 + 6w". Mine has reached case 0, which seems about right for 37.319% - it's moved on while I've been typing!

I'll keep it running to see what happens, although I'm expecting it will far exceed the local deadline tomorrow night, and also exceed the grace period on 3 March.
ID: 1234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Super long estimated times


Main page · Your account · Message boards


Copyright © 2024 Arizona State University