Maximum disk usage exceeded (Linux 64bit)

Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile nenym

Send message
Joined: 23 Aug 11
Posts: 2
Credit: 10,004,759
RAC: 0
Message 190 - Posted: 12 Sep 2011, 15:28:27 UTC

All my unfinished tasks errored out after machine reboot. AMD X6 1090T, Ubuntu 10.04.3 64bit.
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
ID: 190 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 192 - Posted: 12 Sep 2011, 22:16:32 UTC - in response to Message 190.  

All my unfinished tasks errored out after machine reboot. AMD X6 1090T, Ubuntu 10.04.3 64bit.
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>


hmmm... Not sure what could have happened. If it continues to do that you might have to reset the project.
ID: 192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 218 - Posted: 21 Sep 2011, 14:09:14 UTC
Last modified: 21 Sep 2011, 14:10:07 UTC

I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal
ID: 218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 220 - Posted: 21 Sep 2011, 19:28:19 UTC - in response to Message 218.  

I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal


From my client_state.xml:
<workunit>
<name>wu_12E10_SF-3-0_Idx7_Grp4311of14586</name>
<app_name>GetBoundedDecics</app_name>
<version_num>107</version_num>
<rsc_fpops_est>20000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>2000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>128000000.000000</rsc_memory_bound>
<rsc_disk_bound>128000000.000000</rsc_disk_bound>
<file_ref>
<file_name>wu_12E10_SF-3-0_Idx7_Grp4311of14586.dat</file_name>
<open_name>in</open_name>
</file_ref>
</workunit>


The 122.07MB mentioned in the error message is the 128,000,000B rsc_disk_bound. The admins need to increase rsc_disk_bound to at least 3361.05MB (as implied by the error message). 2 X that would not be unreasonable unless this all points to a flaw in the application.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 222 - Posted: 21 Sep 2011, 21:19:04 UTC - in response to Message 218.  

I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal


I just looked at the stderr for that result and something went majorly wrong. It didn't report the standard messages; all it reported was the loop counter and it was counting off to infinity. It's a good thing the disk limit kicked in and stopped it.

I have noticed this before, but it only happenned to me after I restarted the manager. So it may have something to do with checkpointing. Do you know if this wu was stopped and then restarted at some point?
ID: 222 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 224 - Posted: 22 Sep 2011, 11:42:53 UTC - in response to Message 222.  
Last modified: 22 Sep 2011, 11:46:00 UTC

I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal


I just looked at the stderr for that result and something went majorly wrong. It didn't report the standard messages; all it reported was the loop counter and it was counting off to infinity. It's a good thing the disk limit kicked in and stopped it.

I have noticed this before, but it only happenned to me after I restarted the manager. So it may have something to do with checkpointing. Do you know if this wu was stopped and then restarted at some point?


Hi Eric,

No, I didn't stop it manually but BOINC did as I run other projects as well so each hour BOINC reschedules to some other project WU. In fact, I haven't touched BOINC in a week. It's crunching on my server and the only interaction I have is reading the logs through ssh. That's all.
ID: 224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 225 - Posted: 22 Sep 2011, 16:32:32 UTC - in response to Message 224.  

I am getting a similar problem here. It appears to only happen on long WUs

[NumberFields@home] Aborting task wu_12E10_SF-3-0_Idx6_Grp11671of13668_0: exceeded disk limit: 3361.05MB > 122.07MB


Any way to fix this on my side? I'm on Linux 64 bit and my BOINC has 50 GB at its disposal


I just looked at the stderr for that result and something went majorly wrong. It didn't report the standard messages; all it reported was the loop counter and it was counting off to infinity. It's a good thing the disk limit kicked in and stopped it.

I have noticed this before, but it only happenned to me after I restarted the manager. So it may have something to do with checkpointing. Do you know if this wu was stopped and then restarted at some point?


Hi Eric,

No, I didn't stop it manually but BOINC did as I run other projects as well so each hour BOINC reschedules to some other project WU. In fact, I haven't touched BOINC in a week. It's crunching on my server and the only interaction I have is reading the logs through ssh. That's all.


Hi microchip,

Ok. But it was stopped and restarted by BOINC, which is what I suspected. That at least points me to the cause.

In your preferences, do you happen to have "leave applications in memory while suspended" selected? It's just a hunch, but my computer that had that selected had the same problem. That option was not selected on a different computer which never has the problem. But that could just be a coincidence...

Thanks!
ID: 225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 226 - Posted: 22 Sep 2011, 17:10:37 UTC

Hi Eric,

Nope. I don't have that option enabled in BOINC.
ID: 226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 227 - Posted: 22 Sep 2011, 17:26:07 UTC

I have the "leave tasks in memory" option enabled and have just suspended and resumed a few NumberFields tasks. I'll let you know if any exceed the disk limit.

I've had my "switch between tasks" time set at 6 hours so I suspect NumberFields tasks have mostly been running start to finish with no suspension. Maybe that's why I've not seen this problem.

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 229 - Posted: 23 Sep 2011, 5:09:18 UTC - in response to Message 227.  
Last modified: 23 Sep 2011, 5:15:18 UTC

I suspended and resumed 4 NumberFields tasks several times as promised in my last post. None of those exceeded the disk limit.

I found 2 tasks that exceeded disk limit yesterday. Here are the telltale lines from BOINC manager's Event Log:

Tue 20 Sep 2011 10:39:42 PM MDT NumberFields@home Aborting task wu_12E10_SF-3-0_Idx6_Grp11214of13668_0: exceeded disk limit: 2470.40MB > 122.07MB
Tue 20 Sep 2011 10:39:42 PM MDT NumberFields@home Aborting task wu_12E10_SF-3-0_Idx6_Grp11215of13668_0: exceeded disk limit: 2523.01MB > 122.07MB
Those are task 85384 and 85383.

I doubt the above 2 were ever suspended. They were running when I shutdown BOINC and they failed just 5 minutes after I restarted BOINC. Maybe restarting has the same negative effect as suspending/resuming? One other NumberFields task restarted same time as those two but it did not fail.

I received 7 fresh NumberFields tasks and tried to crash them with many suspends and resumes and BOINC restarts but they all completed error free.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 230 - Posted: 24 Sep 2011, 3:42:43 UTC - in response to Message 229.  

Thanks for testing it out! So it sounds like the problem only happens when the Boinc manager is restarted, and even then it only happens some of the time. I'll investigate this further after I take care of a few other pressing issues.
Thanks again!
ID: 230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 231 - Posted: 30 Sep 2011, 11:26:14 UTC

any progress on this?
ID: 231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 232 - Posted: 1 Oct 2011, 9:18:20 UTC - in response to Message 231.  

No, I haven't been able to recreate the problem on my end. Looking through the results, this does seem to still happen, but infrequently. What I do know is it only seems to happen on the linux platform when the manager is shutdown and restarted, and even then it only happens some of the time. Other than stopping and restarting the manager several times, I haven't expended much energy trying to figure this out.
ID: 232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 233 - Posted: 1 Oct 2011, 11:47:35 UTC

Well, on my end, it happens on every 3rd WU that is sent. I will give it more time. Hope you can resolve the issue :)
ID: 233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 234 - Posted: 1 Oct 2011, 23:54:34 UTC - in response to Message 233.  

Well, on my end, it happens on every 3rd WU that is sent. I will give it more time. Hope you can resolve the issue :)


Ok. Give us a couple more days to track this down.
ID: 234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 235 - Posted: 2 Oct 2011, 18:33:28 UTC - in response to Message 234.  

I finally found the bug. For those who care, I was scanning in the checkpoint file using the %d format descriptor for a long integer. This should have been %ld. This conversion issue only affects positive integers, which is why it only happened some of the time. In windows, int and long are both 4 bytes, which explains why the problem was never seen there.

The new linux version is 1.08. If you have WUs queued up with the old version and expect to be stopping/restarting, then you might want to abort those.

Thanks for your patience!
Eric
ID: 235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 236 - Posted: 3 Oct 2011, 12:54:09 UTC

I can report success on my side. Everything seems to work fine. Thanks for the fix :)
ID: 236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath
Avatar

Send message
Joined: 2 Sep 11
Posts: 57
Credit: 1,274,345
RAC: 0
Message 237 - Posted: 5 Oct 2011, 21:47:46 UTC - in response to Message 236.  

I haven't seen any "disk usage exceeded" errors since the fix. I haven't seen any "Error in the PARI system" errors lately either. Did you kill 2 birds with 1 stone or was there a fix for the PARI problem I didn't see?

BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,718,038
RAC: 288,078
Message 238 - Posted: 6 Oct 2011, 1:32:10 UTC - in response to Message 237.  

I haven't seen any "disk usage exceeded" errors since the fix. I haven't seen any "Error in the PARI system" errors lately either. Did you kill 2 birds with 1 stone or was there a fix for the PARI problem I didn't see?


No, I haven't made a fix for the PARI problem. In fact, I am still waiting to hear back from them with the fix. Must just be a coincidence. There's usually a PARI error every 2 or 3 days.
ID: 238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip

Send message
Joined: 3 Sep 11
Posts: 11
Credit: 131,755
RAC: 0
Message 240 - Posted: 8 Oct 2011, 15:45:05 UTC

Yup, I got a PARI error on one of my tasks :(
ID: 240 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Maximum disk usage exceeded (Linux 64bit)


Main page · Your account · Message boards


Copyright © 2024 Arizona State University