Message boards :
Number crunching :
Maximum Elapsed Time Exceeded (Win 32bit)
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,809,830 RAC: 112,967 |
Josef has checked over my previous post and given it a clean bill of health, but suggests a couple of additional points. Another document you should be aware of is Job runtime estimation. In his most recent post, Eric said To summarize, the equation for maximum elapsed time is: Joe points out that this is a common misconception: in the function ACTIVE_TASK::init(RESULT* rp) (around line 230 of client/app.cpp in current sources), there is max_elapsed_time = rp->wup->rsc_fpops_bound/rp->avp->flops; - no mention of DCF, just <rsc_fpops_bound> and <flops> As I noted before, <flops> is liable to be reset at every scheduler contact - the only one it's safe to work with is <rsc_fpops_bound>. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,841,106 RAC: 246,290 |
Thanks Richard! That helps a bunch. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,841,106 RAC: 246,290 |
The one safeguard is that APR is allowed to average over the first 10 validated tasks, before it is used to correct task runtime estimates. The 'validated' task requirement excludes any abnormally low runtimes resulting from errors, but until recently, valid but 'outlier' tasks like my 2-second runs were included. We're just starting to work with David on correcting that oversight - changeset [24225] - but I doubt it's active in your code here yet, unless you've modified your validator to make use of the new facility. A few days ago I did upgrade the server from the "server stable" branch to the core release 6_12 version (I thought the newer server might help with this problem). So I guess I do have the new facility; I just need to modify the validator. |
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,809,830 RAC: 112,967 |
Eric, once you've got the code updated and the <rsc_fpops_est> value normalised, you may find it helpful to run app_reset.php (I believe that can be run from the admin web interface): // script for resetting an app's credit and runtime estimation statistics; - though it might be a good idea to consult David Anderson for advice first. IIRC, that script appeared in the repository at [trac]changeset:23836[/trac] after AQUA had problems with job estimates earlier this summer. |
Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 |
|
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,841,106 RAC: 246,290 |
Eric, once you've got the code updated and the Thanks again Richard! I wasn't even aware of the app_reset. |
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,809,830 RAC: 112,967 |
LOL - this is what happens: 29/10/2011 19:08:07 | NumberFields@home | Scheduler request completed: got 21 new tasks 29/10/2011 19:08:07 | NumberFields@home | [sched_op] estimated total CPU task duration: 5348 seconds so - 4 minutes 14 seconds each. Let's see how that works out! That's on my host 1289 - an i5 laptop, currently rated (by APR) at over 7 teraflops per core. I'll check <rsc_fpops_bound>. [Edit - they all have a 'bound' 100x the 'est' - so a little over 7 hours. That may or may not be enough] By the way, when I joined the project, I used the url on the front page (http://numberfields.asu.edu/NumberFields), but now I see 29/10/2011 19:08:07 | NumberFields@home | You used the wrong URL for this project. When convenient, remove this project, then add http://stat.la.asu.edu/NumberFields/ I'm not at home this weekend, but I'll sort that out when I get home. Perhaps the front page could be updated sometime? |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,841,106 RAC: 246,290 |
That's my fault. After I created the project using the stat url, it was decided to create an alias numberfields.asu.edu (everyone thought it sounded better). Both urls point to the same place, so I didn't think to correct them in all places. I'll modify the urls in the config.xml. I think that should fix the problem. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,841,106 RAC: 246,290 |
I upgraded the server to the latest trunk version to get the runtime_outlier capability and I modified the validator to make use of it. Everything appears to be working smoothly after the upgrade. |
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,809,830 RAC: 112,967 |
I upgraded the server to the latest trunk version to get the runtime_outlier capability and I modified the validator to make use of it. Everything appears to be working smoothly after the upgrade. Thanks for your hard work over the weekend - I'm sure the really important bits are under the hood, but I see you've picked up some of the interface improvements (like counts on task lists) along the way. I'm still working through some of the side effects of the early estimates - at one stage, my DCF on one host reached the limit of 100, which forces continual 'high priority' running and restricts work fetch to 1 second at a time - both to be expected. It'll be fun watching how the system recovers from here, and what figures I end up with. Currently DCF ranges from 1.68 to 41.27 across four (similar) machines, and APR from 228 to 1645. The good news is that I've only had to adjust <rsc_fpops_bound> on a very few tasks to remain error-free, and none of the tasks received since this morning have needed even that little helping hand. Since we have mathematicians round these parts, it will also be interesting, in an abstract number-theory sort of way, to see how CreditNew awards credit. To start with, I was getting ~60 credits/core/hour - a little high by historical standards, but near enough. Then it went up to ~600 cr/c/h, and then back down again to ~200 cr/c/h. I'll keep monitoring.... |
Send message Joined: 2 Sep 11 Posts: 57 Credit: 1,274,345 RAC: 0 |
My credit/core/hour is ~100 for tasks with elapsed time of ~2 hours and ~150 for tasks 1 hour duration or less. My numbers have had more time to stabilize so it's not surprising they're lower than Richards. BOINC FAQ Service Official BOINC wiki Installing BOINC on Linux |