large swing in ETA

Message boards : Number crunching : large swing in ETA
Message board moderation

To post messages, you must log in.

AuthorMessage
Nick

Send message
Joined: 23 Oct 18
Posts: 5
Credit: 7,397,389
RAC: 0
Message 2548 - Posted: 6 Sep 2019, 15:58:04 UTC

Since yesterday, the estimated completion time grew by 10 days. I'm wondering if this might indicate a problem with the settings on the project. Is there a large number of WUs whose deadline expired?
ID: 2548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,811,878
RAC: 288,810
Message 2549 - Posted: 6 Sep 2019, 20:05:50 UTC - in response to Message 2548.  

Since yesterday, the estimated completion time grew by 10 days. I'm wondering if this might indicate a problem with the settings on the project. Is there a large number of WUs whose deadline expired?


The drop also coincides with the drop in GFLOPS from the server status page. Recall the ETA is computed using the average number of returned results in the previous 24 hours. So when a large number of volunteers migrate to another project you will see the ETA go up. On the flip side, when a large number of volunteers migrate back you will see the ETA drop. We've seen this before with competitions such as the BOINC pentathalon.

But you are correct, when there are server problems and people can't connect, then fewer results get returned and the ETA goes up. But the server seems to be fine, so I think we can chalk it up to the ebbs and flows of volunteers.
ID: 2549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,416,882
RAC: 128,367
Message 2550 - Posted: 6 Sep 2019, 22:19:51 UTC - in response to Message 2549.  

More to it than that. The DS16x271 sequence of tasks is much slower (longer runtime) than its immediate predecessors. Ever since it started, I've been noticing a much higher proportion than usual of '_1' (or higher) resent tasks. And because you have accelerated processing (shorter deadlines) for resent tasks, they are under even greater time pressure.

My suspicion is that if you examine the database, you will find a higher than usual proportion of tasks aborted by the client for "not started by deadline" - I can't remember the error number offhand, but I can look it up tomorrow.

And that is compounded by the very slow adjustment of Estimated Runtime under CreditNew. We had a good run of 'loose ends', which mostly ran quickly: estimates had sufficient time to adapt, and with the short runtime, work requests resulted in a large number of tasks allocated and cached. When these became the later, slower, tasks, caches were overfilled, and couldn't be processed in time.
ID: 2550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nick

Send message
Joined: 23 Oct 18
Posts: 5
Credit: 7,397,389
RAC: 0
Message 2551 - Posted: 6 Sep 2019, 22:45:49 UTC - in response to Message 2550.  

I already had a suspicion that there's a problem with the deadline here. Thanks for pointing out that _1 means that it's a retry, I hadn't known that. On my 32-core machine, I have 11 _1 WUs running right now, so 1/3rd of the work I'm doing is retry work. That seems high to me.

What's the deadline, is it the deadline for starting the task? Finishing the task? The task being sent out to another user? If the deadline passes and I'm almost done (I have seen this countless times on this sf4-16x271), will I continue to compute it unless the retry beats me to it or does it get cancelled either some fixed time later or on the next sync with the server?
ID: 2551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,811,878
RAC: 288,810
Message 2552 - Posted: 7 Sep 2019, 5:05:10 UTC - in response to Message 2551.  

I already had a suspicion that there's a problem with the deadline here. Thanks for pointing out that _1 means that it's a retry, I hadn't known that. On my 32-core machine, I have 11 _1 WUs running right now, so 1/3rd of the work I'm doing is retry work. That seems high to me.

What's the deadline, is it the deadline for starting the task? Finishing the task? The task being sent out to another user? If the deadline passes and I'm almost done (I have seen this countless times on this sf4-16x271), will I continue to compute it unless the retry beats me to it or does it get cancelled either some fixed time later or on the next sync with the server?


The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.

Hope that explains things.
ID: 2552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,811,878
RAC: 288,810
Message 2553 - Posted: 7 Sep 2019, 5:18:37 UTC - in response to Message 2550.  

More to it than that. The DS16x271 sequence of tasks is much slower (longer runtime) than its immediate predecessors. Ever since it started, I've been noticing a much higher proportion than usual of '_1' (or higher) resent tasks. And because you have accelerated processing (shorter deadlines) for resent tasks, they are under even greater time pressure.

My suspicion is that if you examine the database, you will find a higher than usual proportion of tasks aborted by the client for "not started by deadline" - I can't remember the error number offhand, but I can look it up tomorrow.

And that is compounded by the very slow adjustment of Estimated Runtime under CreditNew. We had a good run of 'loose ends', which mostly ran quickly: estimates had sufficient time to adapt, and with the short runtime, work requests resulted in a large number of tasks allocated and cached. When these became the later, slower, tasks, caches were overfilled, and couldn't be processed in time.


You are correct, there are a higher number of resent tasks than usual. So getting back to Nick's original question, are you suggesting that this abundance of resent tasks are responsible for the decrease in GFLOPS (= increase in ETA)? Note it has been this way for several weeks now, and only recently has the GFLOPS dropped, so that's why I attributed it to a drop in volunteers.
ID: 2553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,416,882
RAC: 128,367
Message 2554 - Posted: 7 Sep 2019, 8:18:54 UTC - in response to Message 2553.  

You are correct, there are a higher number of resent tasks than usual. So getting back to Nick's original question, are you suggesting that this abundance of resent tasks are responsible for the decrease in GFLOPS (= increase in ETA)? Note it has been this way for several weeks now, and only recently has the GFLOPS dropped, so that's why I attributed it to a drop in volunteers.
Not directly, but I think they may be separate symptoms of the same underlying cause.

I've looked at a few of my _1 tasks, to see why they needed to be resent. The usual range of sporadic errors and problems, but a non-trivial number of the ones I was talking about. They show up task lists as

Not started by deadline - canceled
and in individual task pages as

Exit status	200 (0x000000C8) EXIT_UNSTARTED_LATE
They're also described as 'Aborted by user', which is false - they are aborted by the client, with no intervention by the user.

This is actually a very minor problem, which is why I haven't mentioned it before. No processing time (no CPU cycles) has been wasted - the machines have probably been busy on other jobs. It just looks a bit ugly.

So, what's the underlying common cause? BOINC is not well instrumented: there are very few things it can measure. It can't measure speed (GFlops) directly, because that depends on the hardware (which can be benchmarked - though not for GPUs) and also on the programming efficiency (as you found when you ditched the library in favour of custom code). BOINC ignores the programming efficiency - no way of measuring that.

So the only measure of speed that BOINC has is 'work done per unit time'. BOINC can measure time - it's one of the few things it's good at - but (again), it can't measure work done. So, the fallback 'work done' figure is our old friend <rsc_fpops_est> - declared for each workunit. And, as we've discussed before, the only way of measuring fpops here is post hoc - by running the workunit. Which defeats the object of the project.

The only possible mitigating action would be to sample each new batch of workunits before distribution starts and declare <rsc_fpops_est> for that batch as some sort of ballpark kludge. But that's a lot of work, still not very accurate, and I can't realistically recommend it. We're muddling through well enough.

In that scenario, GFlops is falling because "same work in longer time" means we're working more slowly. It's taken weeks for BOINC to notice that (which is far too slow): it's taken the same number of weeks for BOINC to adjust our runtime estimates (so we don't fetch more work than we can handle), because that's based on the same underlying data. I think we should just buckle in and enjoy the ride. It might be bumpy at times.
ID: 2554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nick

Send message
Joined: 23 Oct 18
Posts: 5
Credit: 7,397,389
RAC: 0
Message 2555 - Posted: 8 Sep 2019, 6:59:54 UTC - in response to Message 2552.  

I already had a suspicion that there's a problem with the deadline here. Thanks for pointing out that _1 means that it's a retry, I hadn't known that. On my 32-core machine, I have 11 _1 WUs running right now, so 1/3rd of the work I'm doing is retry work. That seems high to me.

What's the deadline, is it the deadline for starting the task? Finishing the task? The task being sent out to another user? If the deadline passes and I'm almost done (I have seen this countless times on this sf4-16x271), will I continue to compute it unless the retry beats me to it or does it get cancelled either some fixed time later or on the next sync with the server?


The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.

Hope that explains things.


Thanks for the explanation!

I note that the ETA has swung back the other way and is now back at 24 days. I haven't documented it, but I have noticed that this swing happens every week and expect it will go back up by next Friday night and back down again as the weekend draws to a close.
ID: 2555 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,416,882
RAC: 128,367
Message 2556 - Posted: 8 Sep 2019, 9:53:08 UTC - in response to Message 2552.  

The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.
Workunit 58533257 shows the timetabling quite well.

On this website, deadlines are displayed with the additional 3 day grace period included. The server uses these extended deadlines for sending replacement tasks after a timeout.

On our home computers, BOINC Manager displays the short deadline, and the BOINC client acts on the short deadline. The machine that I'm processing that resend on displays the deadline as 10 September, 23:51 in my local timezone (UTC+1).
ID: 2556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,811,878
RAC: 288,810
Message 2558 - Posted: 8 Sep 2019, 17:24:48 UTC - in response to Message 2555.  

I already had a suspicion that there's a problem with the deadline here. Thanks for pointing out that _1 means that it's a retry, I hadn't known that. On my 32-core machine, I have 11 _1 WUs running right now, so 1/3rd of the work I'm doing is retry work. That seems high to me.

What's the deadline, is it the deadline for starting the task? Finishing the task? The task being sent out to another user? If the deadline passes and I'm almost done (I have seen this countless times on this sf4-16x271), will I continue to compute it unless the retry beats me to it or does it get cancelled either some fixed time later or on the next sync with the server?


The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.

Hope that explains things.


Thanks for the explanation!

I note that the ETA has swung back the other way and is now back at 24 days. I haven't documented it, but I have noticed that this swing happens every week and expect it will go back up by next Friday night and back down again as the weekend draws to a close.


I too have noticed the cyclical nature of the ETA (and GFLOPS). One possible theory is that a volunteer with a large number of resources is running multiple projects - if all his clients are in sync, they could be switching projects at about the same time. The only other possibility is that a large number of independent users are switching projects at the same time; I could see this happening during a competition, but I don't think the frequency of the competitions matches that of the ETA.
ID: 2558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,811,878
RAC: 288,810
Message 2559 - Posted: 10 Sep 2019, 16:08:36 UTC - in response to Message 2558.  

So I have been watching the top participants page for the last several days. The grcpool contribution has been dropping, and I believe that is the best explanation for the recent drop in GFLOPS. I also went to the Gridcoin site and found that NumberFields has been grey listed, although I don't know when that started. I will look into getting white listed again.
ID: 2559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : large swing in ETA


Main page · Your account · Message boards


Copyright © 2024 Arizona State University