Maximum Elapsed Time Exceeded (Win 32bit)

Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,809,830
RAC: 112,967
Message 285 - Posted: 28 Oct 2011, 18:16:06 UTC

Josef has checked over my previous post and given it a clean bill of health, but suggests a couple of additional points.

Another document you should be aware of is Job runtime estimation.

In his most recent post, Eric said

To summarize, the equation for maximum elapsed time is:

<duration_correction_factor> * <rsc_fpops_bound> / <flops>

where all 3 parameters can be found in client_state.xml (DCF is in the project section, <flops> is in the app_version section, and <rsc_fpops_bound> is in the workunit section)

Joe points out that this is a common misconception: in the function ACTIVE_TASK::init(RESULT* rp) (around line 230 of client/app.cpp in current sources), there is

max_elapsed_time = rp->wup->rsc_fpops_bound/rp->avp->flops;

- no mention of DCF, just <rsc_fpops_bound> and <flops>

As I noted before, <flops> is liable to be reset at every scheduler contact - the only one it's safe to work with is <rsc_fpops_bound>.
ID: 285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 410,841,106
RAC: 246,290
Message 286 - Posted: 28 Oct 2011, 18:38:13 UTC - in response to Message 285.  

Thanks Richard! That helps a bunch.
ID: 286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 410,841,106
RAC: 246,290
Message 287 - Posted: 28 Oct 2011, 18:42:52 UTC - in response to Message 286.  

The one safeguard is that APR is allowed to average over the first 10 validated tasks, before it is used to correct task runtime estimates. The 'validated' task requirement excludes any abnormally low runtimes resulting from errors, but until recently, valid but 'outlier' tasks like my 2-second runs were included. We're just starting to work with David on correcting that oversight - changeset [24225] - but I doubt it's active in your code here yet, unless you've modified your validator to make use of the new facility.


A few days ago I did upgrade the server from the "server stable" branch to the core release 6_12 version (I thought the newer server might help with this problem). So I guess I do have the new facility; I just need to modify the validator.
ID: 287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,809,830
RAC: 112,967
Message 289 - Posted: 28 Oct 2011, 20:55:57 UTC

Eric, once you've got the code updated and the <rsc_fpops_est> value normalised, you may find it helpful to run app_reset.php (I believe that can be run from the admin web interface):

// script for resetting an app's credit and runtime estimation statistics;
// use this if these got messed up because of bad FLOPs estimates

- though it might be a good idea to consult David Anderson for advice first.

IIRC, that script appeared in the repository at [trac]changeset:23836[/trac] after AQUA had problems with job estimates earlier this summer.
ID: 289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 290 - Posted: 28 Oct 2011, 23:24:24 UTC - in response to Message 289.  

ID: 290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 410,841,106
RAC: 246,290
Message 291 - Posted: 28 Oct 2011, 23:51:34 UTC - in response to Message 289.  

Eric, once you've got the code updated and the value normalised, you may find it helpful to run app_reset.php (I believe that can be run from the admin web interface):

// script for resetting an app's credit and runtime estimation statistics;
// use this if these got messed up because of bad FLOPs estimates

- though it might be a good idea to consult David Anderson for advice first.

IIRC, that script appeared in the repository at [trac]changeset:23836[/trac] after AQUA had problems with job estimates earlier this summer.



Thanks again Richard! I wasn't even aware of the app_reset.
ID: 291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,809,830
RAC: 112,967
Message 292 - Posted: 29 Oct 2011, 18:24:05 UTC
Last modified: 29 Oct 2011, 18:26:08 UTC

LOL - this is what happens:

29/10/2011 19:08:07 | NumberFields@home | Scheduler request completed: got 21 new tasks
29/10/2011 19:08:07 | NumberFields@home | [sched_op] estimated total CPU task duration: 5348 seconds

so - 4 minutes 14 seconds each. Let's see how that works out!

That's on my host 1289 - an i5 laptop, currently rated (by APR) at over 7 teraflops per core. I'll check <rsc_fpops_bound>. [Edit - they all have a 'bound' 100x the 'est' - so a little over 7 hours. That may or may not be enough]

By the way, when I joined the project, I used the url on the front page (http://numberfields.asu.edu/NumberFields), but now I see

29/10/2011 19:08:07 | NumberFields@home | You used the wrong URL for this project. When convenient, remove this project, then add http://stat.la.asu.edu/NumberFields/

I'm not at home this weekend, but I'll sort that out when I get home. Perhaps the front page could be updated sometime?
ID: 292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 410,841,106
RAC: 246,290
Message 293 - Posted: 29 Oct 2011, 19:37:41 UTC - in response to Message 292.  


By the way, when I joined the project, I used the url on the front page (http://numberfields.asu.edu/NumberFields), but now I see

29/10/2011 19:08:07 | NumberFields@home | You used the wrong URL for this project. When convenient, remove this project, then add http://stat.la.asu.edu/NumberFields/

I'm not at home this weekend, but I'll sort that out when I get home. Perhaps the front page could be updated sometime?


That's my fault. After I created the project using the stat url, it was decided to create an alias numberfields.asu.edu (everyone thought it sounded better). Both urls point to the same place, so I didn't think to correct them in all places. I'll modify the urls in the config.xml. I think that should fix the problem.
ID: 293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 410,841,106
RAC: 246,290
Message 298 - Posted: 31 Oct 2011, 4:01:44 UTC - in response to Message 293.  

I upgraded the server to the latest trunk version to get the runtime_outlier capability and I modified the validator to make use of it. Everything appears to be working smoothly after the upgrade.
ID: 298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,809,830
RAC: 112,967
Message 302 - Posted: 31 Oct 2011, 17:11:18 UTC - in response to Message 298.  

I upgraded the server to the latest trunk version to get the runtime_outlier capability and I modified the validator to make use of it. Everything appears to be working smoothly after the upgrade.

Thanks for your hard work over the weekend - I'm sure the really important bits are under the hood, but I see you've picked up some of the interface improvements (like counts on task lists) along the way.

I'm still working through some of the side effects of the early estimates - at one stage, my DCF on one host reached the limit of 100, which forces continual 'high priority' running and restricts work fetch to 1 second at a time - both to be expected. It'll be fun watching how the system recovers from here, and what figures I end up with. Currently DCF ranges from 1.68 to 41.27 across four (similar) machines, and APR from 228 to 1645. The good news is that I've only had to adjust <rsc_fpops_bound> on a very few tasks to remain error-free, and none of the tasks received since this morning have needed even that little helping hand.

Since we have mathematicians round these parts, it will also be interesting, in an abstract number-theory sort of way, to see how CreditNew awards credit. To start with, I was getting ~60 credits/core/hour - a little high by historical standards, but near enough. Then it went up to ~600 cr/c/h, and then back down again to ~200 cr/c/h. I'll keep monitoring....
ID: 302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 303 - Posted: 31 Oct 2011, 17:52:31 UTC - in response to Message 302.  

My credit/core/hour is ~100 for tasks with elapsed time of ~2 hours and ~150 for tasks 1 hour duration or less. My numbers have had more time to stabilize so it's not surprising they're lower than Richards.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit)


Main page · Your account · Message boards


Copyright © 2024 Arizona State University