Maximum Elapsed Time Exceeded (Win 32bit)

Author	Message
Conan Send message Joined: 3 Sep 11 Posts: 43 Credit: 20,713,270 RAC: 41,824	Message 251 - Posted: 21 Oct 2011, 10:40:57 UTC Getting this error a few times now and losing processor time because of it WU 178732 WU 178680 wu 178652 WU 176592 Conan ID: 251 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 252 - Posted: 21 Oct 2011, 17:05:08 UTC - in response to Message 251. Getting this error a few times now and losing processor time because of it WU 178732 WU 178680 wu 178652 WU 176592 Conan It looks like these all errored out in less than 2 hours, which should not exceed the limits that I set in the project. Many users have returned results with run times of 30+ without any problems. I suspect something might be fowled up with the manager. Could you try to restart the manager and/or reset the project to see if that fixes the problem? Thanks! ID: 252 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 3 Sep 11 Posts: 43 Credit: 20,713,270 RAC: 41,824	Message 253 - Posted: 21 Oct 2011, 22:31:34 UTC Thanks Eric I will try that. It is only affecting one computer the other has not had that trouble or not that WU type. Got another 3 as well WU 183513 WU 183501 WU 183497 Conan ID: 253 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 19 Aug 11 Posts: 31 Credit: 112,163,136 RAC: 0	Message 254 - Posted: 21 Oct 2011, 23:17:47 UTC Last modified: 21 Oct 2011, 23:18:02 UTC Yep, I have the same problem, but like you, with only one of my machines. They are all xp64 and win7 64. But only this one has the problem. Bizarre. http://stat.la.asu.edu/NumberFields/results.php?hostid=14&offset=0&show_names=0&state=5&appid= Reno, NV Team: SETI.USA ID: 254 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 3 Sep 11 Posts: 43 Credit: 20,713,270 RAC: 41,824	Message 257 - Posted: 23 Oct 2011, 13:26:22 UTC - in response to Message 254. Yep, I have the same problem, but like you, with only one of my machines. They are all xp64 and win7 64. But only this one has the problem. Bizarre. http://stat.la.asu.edu/NumberFields/results.php?hostid=14&offset=0&show_names=0&state=5&appid= Whoa there zombie67 [MM], mine is not as bad at yours, ALL of yours are failing with this error. Maybe a bad batch and you got most of them? Conan ID: 257 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 259 - Posted: 23 Oct 2011, 19:31:56 UTC - in response to Message 257. Did either of you try restarting the manager or reseting the project? Cases that have failed on your machines have been resent and finished successfully on other machines. This suggests it's most likely a problem with the manager. But you never know, it could be some weird bug in the app that only surfaces on a few machines. ID: 259 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 3 Sep 11 Posts: 43 Credit: 20,713,270 RAC: 41,824	Message 261 - Posted: 24 Oct 2011, 2:21:43 UTC Last modified: 24 Oct 2011, 2:23:27 UTC Yes Eric I did a detach and re-attach on the first computer I reported having trouble with, it hasn't stopped the errors just reduced them. Now the problem has gone across to my other computer as well. Host 415 (the one I first reported) has had another one see WU 159119 ran for 13,292 seconds First one on Host 416 see WU 155878 ran for 31,308 seconds. This WU has also failed on another computer with the same error message. Conan ID: 261 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 262 - Posted: 24 Oct 2011, 4:39:20 UTC - in response to Message 261. I dug around a little bit and found something interesting. It appears the client uses a parameter called the duration_correction_factor to estimate how long it thinks a given wu should take. This parameter can be found in the client_state.xml file. My two hosts had values of 7 and 20. This link explains how to check the value and modify it: http://boincfaq.mundayweb.com/index.php?view=301&language=1 For anybody who's having this problem, I'd be curious to know what value you're client is using. I'm guessing this parameter is set badly on those machines having the problem. It appears that the client continuously updates this correction factor based on historical run times on the given host; so the large variance in run times may be screwing up the client's calculation of this factor. Either way, I increased the rsc_fpops_bound by a factor of 10, so hopefully that will fix the problem on the new wus. ID: 262 · Rating: 0 · rate: / Reply Quote

Conan Send message Joined: 3 Sep 11 Posts: 43 Credit: 20,713,270 RAC: 41,824	Message 268 - Posted: 26 Oct 2011, 9:34:48 UTC G'Day Eric I have another one for you Result 222012 Over 44,000 seconds then "Maximum Elapsed Time Exceeded". My Duration Correction Factor for Host 415 is 55 (I reset it from over 70 to 0 just 2 days ago). My Duration Correction Factor for Host 416 is 60 (I have not reset this one). Thanks Conan ID: 268 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 271 - Posted: 27 Oct 2011, 1:53:24 UTC - in response to Message 268. G'Day Eric I have another one for you Result 222012 Over 44,000 seconds then "Maximum Elapsed Time Exceeded". My Duration Correction Factor for Host 415 is 55 (I reset it from over 70 to 0 just 2 days ago). My Duration Correction Factor for Host 416 is 60 (I have not reset this one). Thanks Conan Yeah, I'm still seeing the occasional error. But they do seem to have dropped somewhat. I'm looking into another way I can bypass this correction factor. Stay tuned. ID: 271 · Rating: 0 · rate: / Reply Quote

frankhagen Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0	Message 273 - Posted: 27 Oct 2011, 6:02:39 UTC - in response to Message 271. Last modified: 27 Oct 2011, 6:06:56 UTC now it's going totally berserk! predicted runtime has jumped to 5.200 HOURS :( ID: 273 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0	Message 274 - Posted: 27 Oct 2011, 7:24:12 UTC - in response to Message 273. Last modified: 27 Oct 2011, 7:55:25 UTC What is your DCF for NumberFields? I bet it's close to 40. Mine is 3.2 on one computer, 31 on the other. The reason our DCFs are crazy is because for weeks the tasks were quite short. Now the tasks are much longer so BOINC has increased the DCF to avoid downloading work we cannot complete in time. The algorithm that increases the DCF increases it very quickly and decreases it very slowly My computer that has DCF = 31 had DCF = 52 a few days ago so it is decreasing, as it should, as BOINC "learns" that the long tasks are the new norm. If your DCF is over 100 you might consider adjusting it downward manually. On the other hand, if BOINC is keeping all your cores busy and your tasks are returning before deadline then it's basically doing it's job. A super high estimation of completion time might be shocking but isn't necessarily harmful. @ Eric, When we started the long tasks, did you increase the <flops> or <rsc_fpops_est> figures? BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux ID: 274 · Rating: 0 · rate: / Reply Quote

frankhagen Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0	Message 276 - Posted: 27 Oct 2011, 13:01:27 UTC - in response to Message 274. What is your DCF for NumberFields? I bet it's close to 40. Mine is 3.2 on one computer, 31 on the other. it was around 12 before i decided to reset the project - then estimated runtime was around 500h. then i got a bunch of very short WU's and the crazyness turned the other way around.. mixing WU's with 2 seconds and up to way more than 10.000 must drive the clients crazy - no matter what eric does to estimated flops. ID: 276 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 277 - Posted: 27 Oct 2011, 16:33:42 UTC - in response to Message 274. @ Eric, When we started the long tasks, did you increase the or figures? Yes, I increased both the and the . The bound is now on the order of 10^19, which is way higher than it needs to be. Even with this, users are still getting the "maximum elapsed time exceeded" error. Some users error out after just 1 hour, some after 10 hours, some return results after 30+ hours without any problem. There must be another way to control this, as the fpops est/bound doesn't seem to be helping. ID: 277 · Rating: 0 · rate: / Reply Quote

frankhagen Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0	Message 278 - Posted: 27 Oct 2011, 17:01:11 UTC - in response to Message 277. There must be another way to control this, as the fpops est/bound doesn't seem to be helping. if there is any way to seperate ultra-short from extra-long WU's... set up two seperate sub-projects for opt in. BOINC in no way can cope with what you are sending out now! ID: 278 · Rating: 0 · rate: / Reply Quote

Paul Send message Joined: 19 Aug 11 Posts: 1 Credit: 1,124,422 RAC: 0	Message 279 - Posted: 27 Oct 2011, 23:50:16 UTC One more wu with the error Maximum elapsed time exceeded, all wingmen report the same error on this one- http://stat.la.asu.edu/NumberFields/workunit.php?wuid=152727 This one, I errored out, wingman completed- http://stat.la.asu.edu/NumberFields/workunit.php?wuid=185662 ID: 279 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0	Message 280 - Posted: 28 Oct 2011, 0:41:14 UTC - in response to Message 277. @ Eric, When we started the long tasks, did you increase the <flops> or <rsc_fpops_est> figures? Yes, I increased both the <rsc_fpops_est> and the <rsc_fpops_bound>. The bound is now on the order of 10^19, which is way higher than it needs to be. Even with this, users are still getting the "maximum elapsed time exceeded" error. Some users error out after just 1 hour, some after 10 hours, some return results after 30+ hours without any problem. There must be another way to control this, as the fpops est/bound doesn't seem to be helping. I'm not an expert on this but I think the crazy DCF values and the maximum elapsed time exceeded errors are unrelated. In other words, I doubt the high DCFs are causing tasks to exceed the max time. I'll ask Richard Haselgrove and Ageless to stop in and have a look at what's going on. Both of them know BOINC inside out. I agree with frank, the variation in run times is driving BOINC crazy but I can't see that causing excessive run times. At worst, crazy DCFs can screw up the scheduler and cause tasks to go into high priority mode and such but if all cores are being used, tasks are not missing deadline and resource shares are being honored then there's no real problem, IMHO. I got some tasks that started with completion time of ~6,000 hours with DCF at 35 but the time remaining plummeted like a manhole cover tossed out of an airplane. They went to high priority mode but no problem, they ran to completion. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux ID: 280 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1460 Credit: 1,222,352,389 RAC: 2,647,576	Message 281 - Posted: 28 Oct 2011, 8:12:39 UTC - in response to Message 280. I agree with Dagorath. I have spent the last several hours poking around and doing some experiments of my own. Here is what I came up with. First, I found this thread from malariacontrol: http://www.malariacontrol.net/forum_thread.php?id=1137 They had the exact same problem. To summarize, the equation for maximum elapsed time is: * / where all 3 parameters can be found in client_state.xml (DCF is in the project section, is in the app_version section, and is in the workunit section) Using the parameters from my client_state.xml file, I computed a value around 400 hours, which explains why I never have the problem. As an experiment, I suspended all tasks currently running (all slow tasks) and I let only fast tasks through. If a task was taking too long I would either suspend it or abort it. What I noticed is that the DCF went down, but even more importantly the went up by almost a factor of 10! My maximum allowed time dropped from 400 hours to about 40. It appears that the driving factor here is the value, which is computed by the client. I think the very fast workunits are causing this "measurement" to be very inaccurate. One way to fix this is to increase the , which I have already done. We just need to wait for the old wus to finish before the newer ones get handed out. If a user wants to fix this on their own host immediately, all they need to do is edit their client_state.xml file and either 1. increase to a very large value, say 10000 OR 2. decrease to a smaller value, say 1000000000 (this is 1 gflop) If you run other projects, be careful to only change the entries corresponding to the NumberFields project. Thanks for your patience! -- Eric ID: 281 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 363,534,959 RAC: 181,663	Message 283 - Posted: 28 Oct 2011, 12:10:04 UTC Hi everyone. Dagorath asked me to come along and try to make some sense of this problem, because I've had some experience of it at projects like AQUA and SETI. Unfortunately, it dates back to some poor design decisions taken by BOINC when David Anderson introduced CreditNew. Note, I'm not the slightest bit interested in the 'credit' part of that paper, but the "new system for runtime estimation" is important - and buggy. The first thing to consider is the initial runtime estimate for tasks, before any correction factor is applied by either the BOINC client or server. I just joined the project this morning, with some Intel Core2-class processors. The inital task estimates I saw were around 190 - 200 hours: that is calculated by the client, based on the <rsc_fpops_est> field set on the server in the workunit definition, and the estimated speed of the computer, from the host benchmarks (<p_fpops>) in the first instance, and the default DCF of 1.0000 I had a few tasks finish (without error) in 2 seconds - I guess you know about those. Some have completed in between one and two hours, but it looks as if the 'natural' running time for the others is going to be around 4 - 5 hours. Let's say: initial estimate 200 hours, actual time 5 hours - an overestimate by a factor of some ~40x. That's the immediate cause of the problem, and it would be good to reduce <rsc_fpops_est> by that factor: but read on - now the project has started, you'll need to make the correction gradually. What happens as a new host works through the first few tasks? I'll refer to my host 1288 as we go along. First, the host DCF will start to decrease, to compensate for those task over-estimates. With a 40x overestimate to compensate for, DCF will try to head towards 0.0250 - gradually, 1% at a time, to start with (the BOINC client is very, very, cautious), then in '10% of the remaining gap' jumps once things start to converge. That host is showing a DCF of 0.960758 after four completed tasks. Secondly, the server will also start to adjust CreditNew's equivalent of DCF. The intention is that from now on all correction is done on the server, and that client DCF will settle somewhere around 1.0000, and not be needed any more. In the long term, with large numbers of hosts and tasks per host, it settles down quite well. But the initial boundary conditions can be painful - let's see how they work. The server equivalent of DCF relies on APR - 'average processing rate'. You can see it on the Application details page for each host. You can see that my humble Q9300 already has an APR of 64354 (at the time of writing). The units are Gigaflops, so apparently I'm running at 64 Teraflops (per core....) That's probably because my first two tasks on that host were of the 2-second variety, against the 200-hour estimate. There is, regrettably, (and fatally), no cautious 'smoothing' in the server code equivalent to the 1% steps by which DCF is adjusted. The one safeguard is that APR is allowed to average over the first 10 validated tasks, before it is used to correct task runtime estimates. The 'validated' task requirement excludes any abnormally low runtimes resulting from errors, but until recently, valid but 'outlier' tasks like my 2-second runs were included. We're just starting to work with David on correcting that oversight - changeset [trac]changeset:24225[/trac] - but I doubt it's active in your code here yet, unless you've modified your validator to make use of the new facility. In the meantime, all valid tasks will be averaged into the APR. And at the first work fetch after the tenth validation, the APR will be used as the host's speed for runtime estimation purposes: wham, bang, all in one go, without smoothing. That's when the -177 errors tend to start happening, because of course my CPU can deliver nowhere near 64 Teraflops, or whatever figure APR has reached by the time I've completed 10 tasks. The APR is (re-)transmitted to each host - as <flops> - every time new work is allocated, so I'm afraid Eric's suggestion of manually adjusting <flops> will be a temporary relief at best. The only way of making a fixed correction would be for affected users to wrap the project application in an app_info.xml file (to run under anonymous platform), and put a realistic <flops> figure into that, where the server can't interfere with it. Making <rsc_fpops_bound> manyfold larger than <rsc_fpops_est> will indeed prevent the -177 errors, though it will have to be applied for each new task until the new server-generated work reaches us through the queue. The other problem you need to watch out for is something I christened "DCF squared" some 18 months ago, when CreditNew was first tested (for a few weeks only) on SETI@home Beta. While the 10 validations are accruing before APR goes live, the client is still plodding along with its careful DCF correction. We're perhaps lucky that the initial overestimate was as large as 40x, because DCF will stay in the 'cautious' range, and probably only get down to ~0.90: if the estimate had been 'only' 10x too large, DCF would have reached something like 0.2 .... .... when APR is activated, and tells the host that it's the full 10x faster than it thought it was. For a short time, both corrections are active at the same time, and the runtime estimations made by the client are vastly too low. That can result in massive work-fetch requests. To put some figures on the '10x' example: Actual runtime: 5 hours Initial (over)estimate: 50 hours DCF adjustment: to 10 hours APR adjustment (on top): 1 hour estimate - and work fetch will be based on that 1-hour estimate until the next task completes, when (thank goodness) DCF will jump back up to 1.0000 in one jump, with no smoothing. At that point, your client realises that your 2-day cache will actually take 10 days to compute, and starts running everything at high priority to squeeze it in before the 7-day deadline. It's not pretty. Those are the main points, I think. The other volunteer developer who knows a great deal about this area is Josef W. Segur of SETI, and I'll ask him to come and proof-read what I've written. Assuming it passes that test, I'll draw it to David Anderson's attention as well. ID: 283 · Rating: 0 · rate: / Reply Quote

frankhagen Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0	Message 284 - Posted: 28 Oct 2011, 14:40:47 UTC - in response to Message 283. thank you richard! but it's much worse: i got an Q9450 which is currently showing an average processing rate of 722022 and of course runtime prediction is over 2.000 hours right now. :( ID: 284 · Rating: 0 · rate: / Reply Quote