Tasks stall while running

Message boards : Number crunching : Tasks stall while running
Message board moderation

To post messages, you must log in.

AuthorMessage
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,962,822
RAC: 112,610
Message 1345 - Posted: 24 Jul 2015, 14:09:14 UTC
Last modified: 24 Jul 2015, 14:20:07 UTC

I'm seeing this as a very occasional problem.



Task has been running for a very long time, and is still using CPU cycles - but isn't going anywhere.

boinc_task_state.xml shows that the last checkpoint as a long time ago, too:

<checkpoint_cpu_time>1117.872000</checkpoint_cpu_time>
<checkpoint_elapsed_time>1124.736256</checkpoint_elapsed_time>
<fraction_done>0.773977</fraction_done>

This is task 11409701, and it's being done with the windows_x86_64 version of the app.

I suspended the task manually, without leaving it in memory - from previous experience, I'm expecting it to resume from the checkpoint values and finish normally.

Edit - yes, it restarted while I was typing and has now validated. I should have said that some 40 tasks were processed normally and reported by the same machine, while this task was stalled and occupying one core.
ID: 1345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 411,194,802
RAC: 247,357
Message 1346 - Posted: 24 Jul 2015, 16:41:48 UTC - in response to Message 1345.  

Very strange.

I will run that same case on my test server and see if I can reproduce this. If it's not case specific, then this will be very hard to debug.

Richard -
In your experience do you think this is purely an app problem or could it be partly a client problem?
ID: 1346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 223,962,822
RAC: 112,610
Message 1347 - Posted: 24 Jul 2015, 17:00:54 UTC - in response to Message 1346.  

I've only seen it once or twice before - I think I possibly aborted the first, and worked out the 'restart from checkpoint' procedure later. This was the first time I dug into it as far as the properties dialog and boinc_task_state.

I've worked on a few client bugs recently, so I'm open-minded - but my gut instinct is that this is an application problem. Or, just possibly, the API code is missing an exception that could have been caught? I mentioned it in passing on the boinc_alpha list, so possibly David might take a look.

If/when I catch it again, I'll maybe take a look with Process Explorer. But my guess is that your copy will sail straight through without a hitch, and we'll be none the wiser.
ID: 1347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 411,194,802
RAC: 247,357
Message 1349 - Posted: 26 Jul 2015, 0:11:15 UTC - in response to Message 1347.  

I've run it now 3 times on my 64bit windows machine with the exact same version of the executable, and I can't get it to happen.

So whatever the issue is, it's going to be very difficult to debug. I usually run the GD app and never see this, so it might be something specific to the GBD app.

I've been trying to debug my Android version of the apps and for some reason the GBD app crashes with a random seg fault about 75% of the time; the GD app does not. I'm starting to wonder if this is more than a coincidence.
ID: 1349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Tasks stall while running


Main page · Your account · Message boards


Copyright © 2024 Arizona State University