Message boards :
Number crunching :
Long running wu_Qsqrt421_DS1x5 units - how long to let them run?
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Jul 12 Posts: 9 Credit: 10,000,929 RAC: 0 |
I have three work units on one of my machines that have been running for over six days now. They are all currently 17 hours past deadline. They all have been stuck at around 35% complete for the past three days. All three are wu_Qsqrt421_DS1x5_CV2_S815_N2_-<this part varies> stderr contains Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0_0 Now starting the targeted Martinet search: N2_L = -61. N2_U = -61. N2 = -61. N1_L = -226. N1_U = 551. N1 = -226. N1 = -225. N1 = -224. N1 = -223. N1 = -222. N1 = -221. N1 = -220. N1 = -219. N1 = -218. N1 = -217. N1 = -216. N1 = -215. N1 = -214. N1 = -213. N1 = -212. N1 = -211. N1 = -210. N1 = -209. N1 = -208. N1 = -207. N1 = -206. N1 = -205. N1 = -204. N1 = -203. N1 = -202. N1 = -201. N1 = -200. N1 = -199. N1 = -198. N1 = -197. N1 = -196. and has occasionally has new numbers added to it. The computer in question has been on the project for over a year now and no errors other than the work units I aborted the other day before they even started. I don't mind letting them run if they might complete but figured I would ask if they are just wasting time. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
It has recently come to my attention that the Qsqrt421 cases suffer from the same problem that the Bounded app did a couple weeks ago. I am currently looking into a similar fix for these WUs. The stderr for this WU looks particularly bad. I suspect it could take at least another 6 days to finish. I wont feel bad if you decide to kill it. Either way I will get you manual credit for the lost CPU cycles. If anyone else has one of these bad WUs please report, either by message board or private message, so I can try and remove them from the system. Sorry for the inconvenience! |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
I have three work units on one of my machines that have been running for over six days now. They are all currently 17 hours past deadline. They all have been stuck at around 35% complete for the past three days. All three are wu_Qsqrt421_DS1x5_CV2_S815_N2_-<this part varies> Sounds like predicted: http://numberfields.asu.edu/NumberFields/forum_thread.php?id=257 Just let the unit run, you have addiotional time for long wu´s. ;) |
Send message Joined: 12 Jul 12 Posts: 9 Credit: 10,000,929 RAC: 0 |
They aren't hurting anything so I'll let them run. It looks like it got 3 more points written to stderr since last night when I hit the return key 4 times on the "tail -f stderr.txt" N1 = -197. N1 = -196. N1 = -195. N1 = -194. N1 = -193. Am I guessing correctly that it will continue to run until N1 counts up from -226 to +551? |
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,727,174 RAC: 112,761 |
If anyone else has one of these bad WUs please report. Looks like I've got a couple: wu_Qsqrt421_DS1x5_CV2_S815_N2_-55_N1_-518to462 That's taken over a week already to work down from N1_L = -518. N1_U = 462. N1 = -518. ... N1 = -131. Looks like I've a way to go... wu_Qsqrt421_DS1x8_CV1_S815_N2_-88_N1_-8044to-3281 The N1 range is even more frightening, but it's doing well - less than three days so far, and we're at N1_L = -8044. N1_U = -3281. N1 = -8044. ... N1 = -5838. I have been having some problems recently with power outages related to the bad UK weather, but I'll keep plodding on... |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
They aren't hurting anything so I'll let them run. Yes. And based on previous experience, it will start accelerating again after it gets past the slow region (hopefully soon). |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
I've done some analysis on the results already returned. The good news is that these long running WUs will eventually complete. First for the DS1x5 work units: The problem WUs are those with names ending in N1_*to* and where the N1 range includes 0. There will be 1 of these for each value of N2. The N2=-81 case finished in 8 days and the N2=-33 case finished in 6 days. There were a bunch of cases for N2<-81 and N2>-33 and all of these had better times, meaning the worst value of N2 will be somewhere between -81 and -33, and hopefully will not be too much worse than 8 days. So far nothing has been returned for N2 between -81 and -33. Now for the DS1x8 work units: The problem cases are again those with names ending in N1_*to*. This time the worst N2 will be between -142 and -66. The N2=-142 case took 5 days and the N2=-66 case took 2.6 days. Outside the interval (-142,-66) times get better. The easiest thing (for me) would be to keep these WUs on the server and offer people double credits for letting them complete (Hunting down a set of over 100 non-contiguous WUs and individually cancelling them would be very tedious). |
Send message Joined: 12 Jul 12 Posts: 9 Credit: 10,000,929 RAC: 0 |
I wouldn't count on 8 days. The spreadsheet doesn't paste very well but it shows the % complete for the past 4 days on three DX1x5 units. DS1x5_CV2_S815 2-Jan 3-Jan 4-Jan 5-Jan -61 36.4 36.8 37.6 38.2 -63 36.2 36.8 38.3 39.2 -73 46.1 50.8 54.7 56.2 So, the wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551_0 has been running 10d,23:41:37 and is 38.255% complete. It has completed 2% in 3 days. The -73 is doing better. This is on an i7-2700k/stock if that matters. http://numberfields.asu.edu/NumberFields/workunit.php?wuid=12350747 timed out for me and has been given to someone else along with the others. I wonder if his opteron will catch up with the 10 day head start my i7 has ;) |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
I wouldn't count on 8 days. The spreadsheet doesn't paste very well but it shows the % complete for the past 4 days on three DX1x5 units. You bring up a good point. After a WU times out, someone with a faster computer could eventually catch up, in which case the first host would get no credit. If this happens to anyone just let me know and I'll rectify it. In your case, I would expect a 10 day head start would be very hard to overtake. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
4 Units left, all runtime >3 (2 over 5) hours on Intel (R) Xeon. No problems with them, every 10 minutes is an checkpoint. |
Send message Joined: 10 Oct 15 Posts: 5 Credit: 38,148,839 RAC: 21 |
My Xeon completed one of the big WUs: wu_Qsqrt421_DS1x8_CV1_S815_N2_-71_N1_-8016to-3292 Progress was very slow at around N1 = -5300, but sped up again and the WU finished well before the deadline. :) |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
Just an update on the latest timing: DS1x5: The N2 range has been narrowed to the interval (-78,-38) with the worst case timing so far of 12.7 days. DS1x8: The N2 range has been narrowed to the interval (-133,-74) with the worst case timing of 11 days. I have noticed swings in timing of +/-1 day on adjacent cases which I assume is due to differences in host computing power. The few hosts I checked had relatively fast processors. If you have an older computer (or speed < 2GHz) the time for these WUs could easily double and you may want to abort them. I feel that run times on the order of 2 weeks is getting a little ridiculous. So I may have to resort to removing them from the server, especially if run times continue to rise. But I will wait for feedback from the users. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Accepting units from decic search again. Lets see how many tasks I get. |
Send message Joined: 28 Oct 11 Posts: 179 Credit: 223,727,174 RAC: 112,761 |
Just an update on the latest timing: I think my wu_Qsqrt421_DS1x5_CV2_S815_N2_-55_N1_-518to462 (pretty well slap in the middle of that range) may well be worse than that. Currently at N1 = -111. after 12.5 days, and that's on a relatively fast i5-4570 CPU @ 3.20GHz. I am having to do a lot of other testing at the moment, though, which has involved several BOINC restarts - can't be helped, I'm afraid. But it doesn't matter here. |
Send message Joined: 10 Dec 12 Posts: 5 Credit: 22,083,545 RAC: 0 |
I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me. These are the WUs http://numberfields.asu.edu/NumberFields/results.php?userid=9411&offset=0&show_names=0&state=6&appid= I have noticed that one has been given to Richard in this thread. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me. I will increase the grace period when I get home later. That should relieve some of the worry about running out of time, and should reduce the chances of wasted resources (people crunching on the same long WU). |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
I have 5 of these still crunching after 13 days, they range from 35-56% progress, all expired 4 days ago and have since been sent to other people. I'm not sure whether I should contact the other people and tell them that I'm still crunching them, I don't want to waste anyone's time including my own if they're completed before me. I forgot to mention I increased the grace period last night to 10 days, giving a total of 17 days for users to return results before the WU is reissued. If a user was already in the middle of a task, I'm not sure if they'll pick up the change. |
Send message Joined: 12 Jul 12 Posts: 9 Credit: 10,000,929 RAC: 0 |
Current status on i7-2700k/stock wu_Qsqrt421_DS1x5_CV2_S815_N2_-61_N1_-613to551 44% after 16d01h. Wingman running. wu_Qsqrt421_DS1x5_CV2_S815_N2_-63_N1_-645to581 47% after 16d01h. Wingman running. wu_Qsqrt421_DS1x5_CV2_S815_N2_-72_N1_-775to702 finished today after 15 days 20 hours 16 min 47 sec. Was at 60% yesterday. Wingman still running. wu_Qsqrt421_DS1x8_CV1_S815_N2_-72_N1_-8018to-3291_0 finished after 7 days 18 hours 51 min 2 sec. No wingman. wu_Qsqrt421_DS1x8_CV1_S815_N2_-73_N1_-8020to-3290_0 finished after 8 days 7 hours 11 min 27 sec. No wingman. |
Send message Joined: 8 Jul 11 Posts: 1323 Credit: 410,669,970 RAC: 246,632 |
Current status on i7-2700k/stock Ok. Thanks for the update. |
Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0 |
Found one task on my server, http://numberfields.asu.edu/NumberFields/result.php?resultid=13633161 28k points. (WTF) 6 days and 6 hours runtime. :O Didn´t guessed that I got some decic tasks. |