Long running wu_Qsqrt421_DS1x5 units - how long to let them run?

Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
fractal

Send message
Joined: 12 Jul 12
Posts: 9
Credit: 10,000,929
RAC: 0
Message 1424 - Posted: 1 Jan 2016, 2:38:42 UTC

I have three work units on one of my machines that have been running for over six days now. They are all currently 17 hours past deadline. They all have been stuck at around 35% complete for the past three days. All three are wu_Qsqrt421_DS1x5_CV2_S815_N2_-<this part varies>

stderr contains

Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0_0
Now starting the targeted Martinet search:
    N2_L = -61.
    N2_U = -61.
      N2 = -61.
        N1_L = -226.
        N1_U = 551.
          N1 = -226.
          N1 = -225.
          N1 = -224.
          N1 = -223.
          N1 = -222.
          N1 = -221.
          N1 = -220.
          N1 = -219.
          N1 = -218.
          N1 = -217.
          N1 = -216.
          N1 = -215.
          N1 = -214.
          N1 = -213.
          N1 = -212.
          N1 = -211.
          N1 = -210.
          N1 = -209.
          N1 = -208.
          N1 = -207.
          N1 = -206.
          N1 = -205.
          N1 = -204.
          N1 = -203.
          N1 = -202.
          N1 = -201.
          N1 = -200.
          N1 = -199.
          N1 = -198.
          N1 = -197.
          N1 = -196.

and has occasionally has new numbers added to it.

The computer in question has been on the project for over a year now and no errors other than the work units I aborted the other day before they even started.

I don't mind letting them run if they might complete but figured I would ask if they are just wasting time.
ID: 1424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1425 - Posted: 1 Jan 2016, 2:59:42 UTC - in response to Message 1424.  

It has recently come to my attention that the Qsqrt421 cases suffer from the same problem that the Bounded app did a couple weeks ago. I am currently looking into a similar fix for these WUs.

The stderr for this WU looks particularly bad. I suspect it could take at least another 6 days to finish. I wont feel bad if you decide to kill it. Either way I will get you manual credit for the lost CPU cycles.

If anyone else has one of these bad WUs please report, either by message board or private message, so I can try and remove them from the system.

Sorry for the inconvenience!
ID: 1425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1426 - Posted: 1 Jan 2016, 2:59:53 UTC - in response to Message 1424.  

I have three work units on one of my machines that have been running for over six days now. They are all currently 17 hours past deadline. They all have been stuck at around 35% complete for the past three days. All three are wu_Qsqrt421_DS1x5_CV2_S815_N2_-<this part varies>

stderr contains

Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0_0
Now starting the targeted Martinet search:
    N2_L = -61.
    N2_U = -61.
      N2 = -61.
        N1_L = -226.
        N1_U = 551.
          N1 = -226.
          N1 = -225.
          N1 = -224.
          N1 = -223.
          N1 = -222.
          N1 = -221.
          N1 = -220.
          N1 = -219.
          N1 = -218.
          N1 = -217.
          N1 = -216.
          N1 = -215.
          N1 = -214.
          N1 = -213.
          N1 = -212.
          N1 = -211.
          N1 = -210.
          N1 = -209.
          N1 = -208.
          N1 = -207.
          N1 = -206.
          N1 = -205.
          N1 = -204.
          N1 = -203.
          N1 = -202.
          N1 = -201.
          N1 = -200.
          N1 = -199.
          N1 = -198.
          N1 = -197.
          N1 = -196.

and has occasionally has new numbers added to it.

The computer in question has been on the project for over a year now and no errors other than the work units I aborted the other day before they even started.

I don't mind letting them run if they might complete but figured I would ask if they are just wasting time.



Sounds like predicted: http://numberfields.asu.edu/NumberFields/forum_thread.php?id=257

Just let the unit run, you have addiotional time for long wu´s. ;)
ID: 1426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 12 Jul 12
Posts: 9
Credit: 10,000,929
RAC: 0
Message 1428 - Posted: 1 Jan 2016, 17:51:21 UTC - in response to Message 1426.  

They aren't hurting anything so I'll let them run.

It looks like it got 3 more points written to stderr since last night when I hit the return key 4 times on the "tail -f stderr.txt"

          N1 = -197.
          N1 = -196.




          N1 = -195.
          N1 = -194.
          N1 = -193.


Am I guessing correctly that it will continue to run until N1 counts up from -226 to +551?
ID: 1428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,353,862
RAC: 127,899
Message 1429 - Posted: 1 Jan 2016, 18:29:58 UTC - in response to Message 1425.  

If anyone else has one of these bad WUs please report.

Looks like I've got a couple:

wu_Qsqrt421_DS1x5_CV2_S815_N2_-55_N1_-518to462
That's taken over a week already to work down from

        N1_L = -518.
        N1_U = 462.
          N1 = -518.
          ...
          N1 = -131.

Looks like I've a way to go...

wu_Qsqrt421_DS1x8_CV1_S815_N2_-88_N1_-8044to-3281
The N1 range is even more frightening, but it's doing well - less than three days so far, and we're at

        N1_L = -8044.
        N1_U = -3281.
          N1 = -8044.
          ...
          N1 = -5838.

I have been having some problems recently with power outages related to the bad UK weather, but I'll keep plodding on...
ID: 1429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1430 - Posted: 1 Jan 2016, 21:20:50 UTC - in response to Message 1428.  

They aren't hurting anything so I'll let them run.

It looks like it got 3 more points written to stderr since last night when I hit the return key 4 times on the "tail -f stderr.txt"

          N1 = -197.
          N1 = -196.




          N1 = -195.
          N1 = -194.
          N1 = -193.


Am I guessing correctly that it will continue to run until N1 counts up from -226 to +551?


Yes. And based on previous experience, it will start accelerating again after it gets past the slow region (hopefully soon).
ID: 1430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1431 - Posted: 1 Jan 2016, 21:58:51 UTC - in response to Message 1429.  
Last modified: 1 Jan 2016, 22:01:58 UTC

I've done some analysis on the results already returned. The good news is that these long running WUs will eventually complete.

First for the DS1x5 work units:
The problem WUs are those with names ending in N1_*to* and where the N1 range includes 0. There will be 1 of these for each value of N2. The N2=-81 case finished in 8 days and the N2=-33 case finished in 6 days. There were a bunch of cases for N2<-81 and N2>-33 and all of these had better times, meaning the worst value of N2 will be somewhere between -81 and -33, and hopefully will not be too much worse than 8 days. So far nothing has been returned for N2 between -81 and -33.

Now for the DS1x8 work units:
The problem cases are again those with names ending in N1_*to*. This time the worst N2 will be between -142 and -66. The N2=-142 case took 5 days and the N2=-66 case took 2.6 days. Outside the interval (-142,-66) times get better.

The easiest thing (for me) would be to keep these WUs on the server and offer people double credits for letting them complete (Hunting down a set of over 100 non-contiguous WUs and individually cancelling them would be very tedious).
ID: 1431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 12 Jul 12
Posts: 9
Credit: 10,000,929
RAC: 0
Message 1445 - Posted: 5 Jan 2016, 18:07:20 UTC

I wouldn't count on 8 days. The spreadsheet doesn't paste very well but it shows the % complete for the past 4 days on three DX1x5 units.

DS1x5_CV2_S815 2-Jan 3-Jan 4-Jan 5-Jan

-61 36.4 36.8 37.6 38.2
-63 36.2 36.8 38.3 39.2
-73 46.1 50.8 54.7 56.2

So, the wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0 has been running 10d,23:41:37 and is 38.255% complete. It has completed 2% in 3 days. The -73 is doing better. This is on an i7-2700k/stock if that matters.

http://numberfields.asu.edu/NumberFields/workunit.php?wuid=12350747 timed out for me and has been given to someone else along with the others. I wonder if his opteron will catch up with the 10 day head start my i7 has ;)
ID: 1445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1446 - Posted: 5 Jan 2016, 19:55:36 UTC - in response to Message 1445.  

I wouldn't count on 8 days. The spreadsheet doesn't paste very well but it shows the % complete for the past 4 days on three DX1x5 units.

DS1x5_CV2_S815 2-Jan 3-Jan 4-Jan 5-Jan

-61 36.4 36.8 37.6 38.2
-63 36.2 36.8 38.3 39.2
-73 46.1 50.8 54.7 56.2

So, the wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0 has been running 10d,23:41:37 and is 38.255% complete. It has completed 2% in 3 days. The -73 is doing better. This is on an i7-2700k/stock if that matters.

http://numberfields.asu.edu/NumberFields/workunit.php?wuid=12350747 timed out for me and has been given to someone else along with the others. I wonder if his opteron will catch up with the 10 day head start my i7 has ;)


You bring up a good point. After a WU times out, someone with a faster computer could eventually catch up, in which case the first host would get no credit. If this happens to anyone just let me know and I'll rectify it.

In your case, I would expect a 10 day head start would be very hard to overtake.
ID: 1446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1447 - Posted: 5 Jan 2016, 20:56:23 UTC - in response to Message 1446.  

4 Units left, all runtime >3 (2 over 5) hours on Intel (R) Xeon.
No problems with them, every 10 minutes is an checkpoint.
ID: 1447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pschoefer

Send message
Joined: 10 Oct 15
Posts: 5
Credit: 38,148,839
RAC: 334
Message 1450 - Posted: 6 Jan 2016, 6:44:13 UTC
Last modified: 6 Jan 2016, 6:44:34 UTC

My Xeon completed one of the big WUs: wu_Qsqrt421_DS1x8_CV1_S815_N2_-71_N1_-8016to-3292

Progress was very slow at around N1 = -5300, but sped up again and the WU finished well before the deadline. :)
ID: 1450 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1451 - Posted: 6 Jan 2016, 16:43:11 UTC - in response to Message 1450.  

Just an update on the latest timing:

DS1x5: The N2 range has been narrowed to the interval (-78,-38) with the worst case timing so far of 12.7 days.

DS1x8: The N2 range has been narrowed to the interval (-133,-74) with the worst case timing of 11 days.

I have noticed swings in timing of +/-1 day on adjacent cases which I assume is due to differences in host computing power. The few hosts I checked had relatively fast processors. If you have an older computer (or speed < 2GHz) the time for these WUs could easily double and you may want to abort them.

I feel that run times on the order of 2 weeks is getting a little ridiculous. So I may have to resort to removing them from the server, especially if run times continue to rise. But I will wait for feedback from the users.
ID: 1451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1452 - Posted: 6 Jan 2016, 19:24:50 UTC - in response to Message 1451.  

Accepting units from decic search again. Lets see how many tasks I get.
ID: 1452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,353,862
RAC: 127,899
Message 1454 - Posted: 6 Jan 2016, 19:47:41 UTC - in response to Message 1451.  

Just an update on the latest timing:

DS1x5: The N2 range has been narrowed to the interval (-78,-38) with the worst case timing so far of 12.7 days.

I think my wu_Qsqrt421_DS1x5_CV2_S815_N2_-55_N1_-518to462 (pretty well slap in the middle of that range) may well be worse than that.

Currently at N1 = -111. after 12.5 days, and that's on a relatively fast i5-4570 CPU @ 3.20GHz. I am having to do a lot of other testing at the moment, though, which has involved several BOINC restarts - can't be helped, I'm afraid. But it doesn't matter here.
ID: 1454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jondi_hanluc

Send message
Joined: 10 Dec 12
Posts: 5
Credit: 22,083,545
RAC: 0
Message 1459 - Posted: 7 Jan 2016, 20:04:44 UTC

I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me.
These are the WUs http://numberfields.asu.edu/NumberFields/results.php?userid=9411&offset=0&show_names=0&state=6&appid= I have noticed that one has been given to Richard in this thread.
ID: 1459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1464 - Posted: 8 Jan 2016, 1:12:54 UTC - in response to Message 1459.  

I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me.
These are the WUs http://numberfields.asu.edu/NumberFields/results.php?userid=9411&offset=0&show_names=0&state=6&appid= I have noticed that one has been given to Richard in this thread.


I will increase the grace period when I get home later. That should relieve some of the worry about running out of time, and should reduce the chances of wasted resources (people crunching on the same long WU).
ID: 1464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1470 - Posted: 8 Jan 2016, 21:10:36 UTC - in response to Message 1464.  
Last modified: 8 Jan 2016, 21:11:16 UTC

I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me.
These are the WUs http://numberfields.asu.edu/NumberFields/results.php?userid=9411&offset=0&show_names=0&state=6&appid= I have noticed that one has been given to Richard in this thread.


I will increase the grace period when I get home later. That should relieve some of the worry about running out of time, and should reduce the chances of wasted resources (people crunching on the same long WU).


I forgot to mention I increased the grace period last night to 10 days, giving a total of 17 days for users to return results before the WU is reissued. If a user was already in the middle of a task, I'm not sure if they'll pick up the change.
ID: 1470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 12 Jul 12
Posts: 9
Credit: 10,000,929
RAC: 0
Message 1474 - Posted: 10 Jan 2016, 20:18:35 UTC

Current status on i7-2700k/stock

wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551 44% after 16d01h. Wingman running.
wu_Qsqrt421_DS1x5_CV2_S815_N2_-63_N1_-645to581 47% after 16d01h. Wingman running.
wu_Qsqrt421_DS1x5_CV2_S815_N2_-72_N1_-775to702 finished today after 15 days 20 hours 16 min 47 sec. Was at 60% yesterday. Wingman still running.
wu_Qsqrt421_DS1x8_CV1_S815_N2_-72_N1_-8018to-3291_0 finished after 7 days 18 hours 51 min 2 sec. No wingman.
wu_Qsqrt421_DS1x8_CV1_S815_N2_-73_N1_-8020to-3290_0 finished after 8 days 7 hours 11 min 27 sec. No wingman.
ID: 1474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,667,438
RAC: 287,904
Message 1475 - Posted: 10 Jan 2016, 20:45:45 UTC - in response to Message 1474.  

Current status on i7-2700k/stock

wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551 44% after 16d01h. Wingman running.
wu_Qsqrt421_DS1x5_CV2_S815_N2_-63_N1_-645to581 47% after 16d01h. Wingman running.
wu_Qsqrt421_DS1x5_CV2_S815_N2_-72_N1_-775to702 finished today after 15 days 20 hours 16 min 47 sec. Was at 60% yesterday. Wingman still running.
wu_Qsqrt421_DS1x8_CV1_S815_N2_-72_N1_-8018to-3291_0 finished after 7 days 18 hours 51 min 2 sec. No wingman.
wu_Qsqrt421_DS1x8_CV1_S815_N2_-73_N1_-8020to-3290_0 finished after 8 days 7 hours 11 min 27 sec. No wingman.


Ok. Thanks for the update.
ID: 1475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aurel
Avatar

Send message
Joined: 25 Feb 13
Posts: 216
Credit: 9,899,302
RAC: 0
Message 1495 - Posted: 18 Jan 2016, 11:32:30 UTC

Found one task on my server, http://numberfields.asu.edu/NumberFields/result.php?resultid=13633161

28k points. (WTF)

6 days and 6 hours runtime. :O

Didn´t guessed that I got some decic tasks.
ID: 1495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run?


Main page · Your account · Message boards


Copyright © 2024 Arizona State University