Posts by Dagorath

21) Message boards : Number crunching : Use of HIGH PRIORITY (Message 365)
Posted 20 Nov 2011 by Dagorath
Post:
Other than that, I'm not sure what else could cause this.


In my case the cause is that the initial estimate of time to completion is way too high 99% of the time. All my WUs come in with estimatates of 60-120 hours and often go into panic mode. Once they've crunched a bit, the estimated remaining run time drops to a sensible time and they drop out of hi pri (apart from a couple of WUs that really do look likely to run for 4 or 5 days).

Al.


That is exactly what *was* happening on both of my computers too. Then I stumbled upon a post from John Macleod VII that explained that the scheduler tends to run projects that have a very low resource share at high priority. Sure enough, I had given NumberFields and a few other projects very low resource shares and all those projects were *always* running at high priority. I increased the share for those projects a little and now they run at low priority unless, as you said, they get close to the deadline.

Yes, I know it doesn't make sense that a very low project share will induce high priority but it can (JM7 explained precisely why it can) and it did on my computers. Perhaps that is the problem on your computer(s) too.

@Eric,

I don't think the client knows about the grace period so I don't think extending that to 7 days had much effect.
22) Message boards : Number crunching : Use of HIGH PRIORITY (Message 357)
Posted 20 Nov 2011 by Dagorath
Post:
I got to thinking about your previous statement

Please restrict the use of HIGH PRIORITY mode for tasks, that are not in danger of missing the deadline.


How can the server reliably know whether any given task is in danger of missing the deadline? It can't. Only your client can know and that is why your client, not the server, makes the decision to run a task at high priority.

If SETI does not run at high priority then perhaps it does not have a very low resource share. There could be other reasons why SETI doesn't run at high priority, the reason I gave isn't the *only* reason.

If you would be so kind as to post the following information, perhaps someone can provide a better explanation:

1) your "Connect about every __ days" setting
2) your "Additional work buffer __ days" setting
3) a list of your projects and the resource shares assigned to each project
23) Message boards : Number crunching : Use of HIGH PRIORITY (Message 354)
Posted 20 Nov 2011 by Dagorath
Post:
The server does not and cannot designate tasks as high priority. It doesn't happen at this project or any other project. The BOINC client running on your computer, not the server, decides whether tasks run at high priority or normal priority.

Projects that you have given a very low resource share will tend to run at high priority when they should not. If you have projects that fit that description then try increasing their project share slightly.
24) Message boards : Number crunching : Massive drop of credits per CPU hour (Message 344)
Posted 17 Nov 2011 by Dagorath
Post:
They are not safe. I steal some every day.
25) Message boards : Number crunching : open source? (Message 337)
Posted 14 Nov 2011 by Dagorath
Post:
There was a fellow named akosf who was analyzing project binaries with a hex editor (he claimed) and optimizing. He was getting huge performance gains for some projects without even having the source. Amazing!!!

Another fellow, Crunch3r, was taking source from projects and recompiling it for them on his very expensive Intel compiler and getting huge performance gains too.

I haven't seen either of those guys around for a while.

26) Message boards : Number crunching : FATAL: Kernel too old (Message 326)
Posted 9 Nov 2011 by Dagorath
Post:
I don't know what the minimum is but 2.6.35 works here.
27) Message boards : Number crunching : Had to abort some WUs (Message 321)
Posted 4 Nov 2011 by Dagorath
Post:
Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for <reliable_on_priority> to enable: 1?


I think the number depends on the base priority which I assume would be the priority assigned to all the WUs in any given batch of WUs. Exactly where (in which script) the base priority is assigned I don't know.

Let's assume base priority is 0. If you set <reliable_on_priority> = 0 then all tasks in the batch would be sent only to fast-reliable hosts which would prevent slow hosts from receiving any work from that batch which is not what you want to do. So you would set <reliable_on_priority> to at least 1. If set to 1 then you need to set <reliable_priority_on_over> to at least 1 because when a task fails, the value of <reliable_priority_on_over> is added to the base priority.

Suppose base priority is 10. If you set <reliable_priority_on_over> = 4 then when a task fails the resend would be assigned priority 14. In that case you would set <reliable_on_priority> greater than 10 but not greater than 14.

Mind you I have never configured this mechanism before, this is just my interpretation of the docs. To test if your configuration has the desired effect just abort a task and watch the resend to see if it goes to a host that meets the criteria for fast-reliable.
28) Message boards : Number crunching : Had to abort some WUs (Message 319)
Posted 3 Nov 2011 by Dagorath
Post:
Greg,

Here are the docs regarding resends (retries).

I would suggest setting the <reliable_reduced_delay_bound> to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting <reliable_reduced_delay_bound> to 1.33 to increase the resend's deadline from 3 days to 3.99.
29) Message boards : Number crunching : Had to abort some WUs (Message 314)
Posted 2 Nov 2011 by Dagorath
Post:
wu_12E10_SF161-1_Idx3_Grp57438of59290
getting stuck. aborted after 70 hours

Looks like this wu has some problems.
http://numberfields.asu.edu/NumberFields/result.php?resultid=258591

Looks even worse like this:

http://numberfields.asu.edu/NumberFields/workunit.php?wuid=172008

:-(


One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out.

Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online.
30) Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit) (Message 303)
Posted 31 Oct 2011 by Dagorath
Post:
My credit/core/hour is ~100 for tasks with elapsed time of ~2 hours and ~150 for tasks 1 hour duration or less. My numbers have had more time to stabilize so it's not surprising they're lower than Richards.
31) Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit) (Message 290)
Posted 28 Oct 2011 by Dagorath
Post:
Many thanks, Richard :-)
32) Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit) (Message 280)
Posted 28 Oct 2011 by Dagorath
Post:
@ Eric,

When we started the long tasks, did you increase the <flops> or <rsc_fpops_est> figures?


Yes, I increased both the <rsc_fpops_est> and the <rsc_fpops_bound>. The bound is now on the order of 10^19, which is way higher than it needs to be. Even with this, users are still getting the "maximum elapsed time exceeded" error. Some users error out after just 1 hour, some after 10 hours, some return results after 30+ hours without any problem. There must be another way to control this, as the fpops est/bound doesn't seem to be helping.


I'm not an expert on this but I think the crazy DCF values and the maximum elapsed time exceeded errors are unrelated. In other words, I doubt the high DCFs are causing tasks to exceed the max time. I'll ask Richard Haselgrove and Ageless to stop in and have a look at what's going on. Both of them know BOINC inside out.

I agree with frank, the variation in run times is driving BOINC crazy but I can't see that causing excessive run times. At worst, crazy DCFs can screw up the scheduler and cause tasks to go into high priority mode and such but if all cores are being used, tasks are not missing deadline and resource shares are being honored then there's no real problem, IMHO.

I got some tasks that started with completion time of ~6,000 hours with DCF at 35 but the time remaining plummeted like a manhole cover tossed out of an airplane. They went to high priority mode but no problem, they ran to completion.

33) Message boards : Number crunching : Maximum Elapsed Time Exceeded (Win 32bit) (Message 274)
Posted 27 Oct 2011 by Dagorath
Post:
What is your DCF for NumberFields? I bet it's close to 40. Mine is 3.2 on one computer, 31 on the other.

The reason our DCFs are crazy is because for weeks the tasks were quite short. Now the tasks are much longer so BOINC has increased the DCF to avoid downloading work we cannot complete in time. The algorithm that increases the DCF increases it very quickly and decreases it very slowly

My computer that has DCF = 31 had DCF = 52 a few days ago so it *is* decreasing, as it should, as BOINC "learns" that the long tasks are the new norm.

If your DCF is over 100 you might consider adjusting it downward manually. On the other hand, if BOINC is keeping all your cores busy and your tasks are returning before deadline then it's basically doing it's job. A super high estimation of completion time might be shocking but isn't necessarily harmful.

@ Eric,

When we started the long tasks, did you increase the <flops> or <rsc_fpops_est> figures?
34) Message boards : Number crunching : Deadline too short ? (Message 264)
Posted 24 Oct 2011 by Dagorath
Post:
I think there is quite a bit that can be done. When Eric posted this message a few days ago I knew what has happened would happen. Most crunchers probably don't know the grace period was extended and most probably don't even know what the grace period is. Most crunchers don't have time to read forums regularly or learn much about BOINC. What they know is what they see in front of them. JohnMD saw what he saw, did a little math and came up with the logical conclusion that his task wasn't going to return before deadline. Now if the news about the grace period extension had been put in the news section on the front page, he might have spotted it but most people likely would not. Anyway, much to his credit,he asked a question in the forum but the answer came too late, he had already aborted the task. I likely would have done the same thing in his situation.

When a batch of WUs is known to have tasks that can run 100+ hours then the real deadline should be 100+ hours plus 24 hours buffer time because it can take the scheduler 24 hours (worst case scenario) to report a completed task. That way volunteers can see the deadline and make an informed decision. Extending the grace period to make up for a too short deadline is not wise, as we have seen. You can bet your next paycheque that others will be aborting long tasks for the same reason.

The fact that the maximum elapsed time was also too short is another matter and should not be confused with the too short deadline.

Eric, if you're worried about the length of time it takes to finish a batch of WUs then please consider using other available server side mechanisms rather than leaving the deadline at 3 days when some tasks will take 100+ hours. For example, you can use task priority flags and the fast-reliable host settings to make sure resent tasks (tasks to replace tasks that return an error or miss deadline) are sent to hosts that meet user configurable criteria for task turnaround time and reliability (reliable = few compute errors and results validate). I believe you can even specify that *all* tasks go to fast-reliables but I'm not sure. You can read about it here.

BTW, even slow computers can be deemed fast-reliable if they keep a small cache, return very few errors and produce results that nearly always validate. Fast does not mean fast CPU in this case. Fast means short turn around time which is not hard to do if one keeps a small cache.
35) Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit) (Message 237)
Posted 5 Oct 2011 by Dagorath
Post:
I haven't seen any "disk usage exceeded" errors since the fix. I haven't seen any "Error in the PARI system" errors lately either. Did you kill 2 birds with 1 stone or was there a fix for the PARI problem I didn't see?
36) Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit) (Message 229)
Posted 23 Sep 2011 by Dagorath
Post:
I suspended and resumed 4 NumberFields tasks several times as promised in my last post. None of those exceeded the disk limit.

I found 2 tasks that exceeded disk limit yesterday. Here are the telltale lines from BOINC manager's Event Log:

Tue 20 Sep 2011 10:39:42 PM MDT NumberFields@home Aborting task wu_12E10_SF-3-0_Idx6_Grp11214of13668_0: exceeded disk limit: 2470.40MB > 122.07MB
Tue 20 Sep 2011 10:39:42 PM MDT NumberFields@home Aborting task wu_12E10_SF-3-0_Idx6_Grp11215of13668_0: exceeded disk limit: 2523.01MB > 122.07MB
Those are task 85384 and 85383.

I doubt the above 2 were ever suspended. They were running when I shutdown BOINC and they failed just 5 minutes after I restarted BOINC. Maybe restarting has the same negative effect as suspending/resuming? One other NumberFields task restarted same time as those two but it did not fail.

I received 7 fresh NumberFields tasks and tried to crash them with many suspends and resumes and BOINC restarts but they all completed error free.
37) Message boards : Cafe : Was Albert wrong? (Message 228)
Posted 22 Sep 2011 by Dagorath
Post:
CERN scientists think they have observed particles traveling faster than the speed of light.
38) Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit) (Message 227)
Posted 22 Sep 2011 by Dagorath
Post:
I have the "leave tasks in memory" option enabled and have just suspended and resumed a few NumberFields tasks. I'll let you know if any exceed the disk limit.

I've had my "switch between tasks" time set at 6 hours so I suspect NumberFields tasks have mostly been running start to finish with no suspension. Maybe that's why I've not seen this problem.
39) Message boards : Number crunching : Process got signal 11 (Message 221)
Posted 21 Sep 2011 by Dagorath
Post:
Nevertheless, I wish I could run the project now.
I don't even get BOINC running on 11.4 and I don't know what's wrong.
I always installed BOINC with the sh command on the desktop and then runned BOINC and BOINCmgr by clicking on the icons.
But it looks like this doesn't work anymore with 11.4, not even with a startup script which I already tried.
On the sys monitor I can see that BOINC remains in memory when I click it but the BOINCmanager itself simply doesn't start. Rights etc. are all set up correct.
I don't get it...


If you installed BOINC 6.12.x you can run into this problem. The client and manager binaries are no longer static builds so you can be missing some shared libraries, depending on your Linux distro and version.

It sounds like BOINC client (boinc in sys monitor) is OK for you. It's just BOINC manager that needs some additional shared libraries. Do you know how to identify which libraries it needs and how to find and install them?

Another, easier, fix for this problem is to use the manager from BOINC 6.10.58. Or install BOINC from repos but you don't want to do that if you use a GPU for crunching.
40) Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit) (Message 220)
Posted 21 Sep 2011 by Dagorath
Post:
I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal


From my client_state.xml:
<workunit>
<name>wu_12E10_SF-3-0_Idx7_Grp4311of14586</name>
<app_name>GetBoundedDecics</app_name>
<version_num>107</version_num>
<rsc_fpops_est>20000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>2000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>128000000.000000</rsc_memory_bound>
<rsc_disk_bound>128000000.000000</rsc_disk_bound>
<file_ref>
<file_name>wu_12E10_SF-3-0_Idx7_Grp4311of14586.dat</file_name>
<open_name>in</open_name>
</file_ref>
</workunit>


The 122.07MB mentioned in the error message is the 128,000,000B rsc_disk_bound. The admins need to increase rsc_disk_bound to at least 3361.05MB (as implied by the error message). 2 X that would not be unreasonable unless this all points to a flaw in the application.


Previous 20 · Next 20


Main page · Your account · Message boards


Copyright © 2024 Arizona State University