300day work unit

Message boards : Number crunching : 300day work unit
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Daniel Liebmann

Send message
Joined: 9 Jan 16
Posts: 6
Credit: 10,999,642
RAC: 0
Message 1507 - Posted: 24 Jan 2016, 14:01:00 UTC

Take a look at the screenshot:



Should i abort that WU or should i wait some days? I will not wait 290 Days :D
ID: 1507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 128
Credit: 106,820,689
RAC: 31,598
Message 1508 - Posted: 24 Jan 2016, 16:22:19 UTC - in response to Message 1507.  

It's really helpful when people post references to tasks as searchable (or better, clickable) text. But you showed enough of the task name for me to find

Host 27281
WU 12731079

Since your two predecessors have both aborted the task before you, you have the honour, sir. The task is yours, should you choose to accept it.

I presume a nice fast Xeon server like that runs continuously and reliably: it should be an ideal platform for the task.

Have a look at the thread Long running wu_Qsqrt421_DS1x5 units - how long to let them run?. The longest so far is 28 days, I think - but yours doesn't come into the slowest category (those with N2 in the interval [-78,-38], and the N1 range spanning zero). I think you can be confident the actual (rather than estimated) runtime will be well below 300 days - almost certainly below 30 days. And you should earn double credits at the end of it.
ID: 1508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Daniel Liebmann

Send message
Joined: 9 Jan 16
Posts: 6
Credit: 10,999,642
RAC: 0
Message 1509 - Posted: 24 Jan 2016, 16:55:39 UTC - in response to Message 1508.  

Double credit sounds nice and with 30 days of runtime i will be fine. I wasnt sure if the 300 days showing up is an accurate time-value or if the task may run in loop or something like that. Thx for your helpfull answer, i will not abort that wu ;-)
ID: 1509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 926
Credit: 96,976,971
RAC: 54,496
Message 1510 - Posted: 24 Jan 2016, 21:07:23 UTC - in response to Message 1508.  

Richard - Thanks for responding in my absence.

I returned late last night from conference/vacation and am now getting caught up on project issues that I have neglected over the last 2 weeks.
ID: 1510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 20 Dec 14
Posts: 17
Credit: 3,506,373
RAC: 1
Message 1517 - Posted: 28 Jan 2016, 3:15:35 UTC
Last modified: 28 Jan 2016, 3:16:46 UTC

I have another long running work unit that appears to be in the same batch as the original poster's batch. See wu_Qsqrt421_DS3x8_CV1_S815_N2_-194073_N1_805123to806926. The contents of my stderr.txt as of this writing is listed below:
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
k1 Start = 0.
k2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/Qsqrt421_DS3x8_CV1_S815_N2_-194073_N1_805123to806926.dat
    K = x^2 - 421
    S = [2]
    Disc Bound = 256
    Skip = P^1
    Num Congruences = 1
    SCALE = 0.815000
    N2_MIN = -194073
    N2_MAX = -194073
    N1_MIN = 805123
    N1_MAX = 806926
    |dK| = 421
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_Qsqrt421_DS3x8_CV1_S815_N2_-194073_N1_805123to806926_1_0
Now starting the targeted Martinet search:
    N2_L = -194073.
    N2_U = -194073.
      N2 = -194073.
        N1_L = 805123.
        N1_U = 806926.
          N1 = 805123.
          N1 = 805124.
          N1 = 805125.

This task has run for 1 day, 10 hours, and 56 minutes as of this writing. The time it took while processing N1 = 805123 was over 24 hours. After that long period of time, N1 = 805124 was written.

Another user who was assigned this work unit timed out. This work unit is causing this dual-core 32-bit Intel Core Duo T2400 computer to cancel work units it could not start before their deadlines because one of the cores in this computer is tied up on this mammoth work unit. (Don't criticize this user for not installing a 64-bit operating system or 64-bit BOINC because the CPU is incapable of running 64-bit code.)

I did not post in the other thread because that thread's parameters excluded this work unit because this work unit is part of a different batch than what that thread is discussing.
ID: 1517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jondi_hanluc

Send message
Joined: 10 Dec 12
Posts: 5
Credit: 22,083,545
RAC: 0
Message 1518 - Posted: 28 Jan 2016, 10:42:10 UTC - in response to Message 1509.  
Last modified: 28 Jan 2016, 10:43:03 UTC

Daniel, I was one of the ones that aborted that WU. I don't mind running long work units, I've complete about 3 so far, the longest being 29 days and I have half a dozen still going including one currently at 31 days, but that, along with another I got, showed a progress of a fraction of a percent after several hours, so I aborted it. Richard maybe right about the estimated runtime being wrong but the progress was unlike any of the other long WUs I've had.
ID: 1518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 926
Credit: 96,976,971
RAC: 54,496
Message 1521 - Posted: 28 Jan 2016, 20:35:34 UTC - in response to Message 1517.  

Hey Jesse -

Sorry for the late response, somehow I missed your message.

The DS3x8 batch should not suffer from the original problems that were discussed in the other threads. So I am sorry to hear this case is taking so long; maybe there is still an issue in my formula for breaking up the search space. I am going to guess that this WU will start to speed up as the N1 value increases. This should be the fast region which is why a bunch of N1s are lumped into the same WU, but sometimes the edge values can take longer (but not 24 hours).

Something that might be happening, at least I've seen it when I run stand-alone, is that the output buffers sometimes are not flushed until they reach a certain size. When this happens, the output appears to be hung on 1 value of N1 for a long time and then all of a sudden the entire WU finishes, dumping the rest of the output to the screen at one time. I'm not sure if stderr acts the same way.
ID: 1521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 128
Credit: 106,820,689
RAC: 31,598
Message 1522 - Posted: 29 Jan 2016, 0:19:52 UTC - in response to Message 1521.  

It's not as simple as that, I'me afraid. I can see three different classes of long-running DS.x8 tasks.

Examples:

1) wu_Qsqrt421_DS3x8_CV1_S815_N2_-194129_N1_805669_k2_-2 (just finished). I had four of these - now three - which reached 97% without drawing attention to themselves, but are extremely slow to finish the last 3%. Overall, they seem to be finishing in under 2 days - so hardly 'long' at all.

2) The ones I mentioned in the adjacent thread a couple of days ago.

wu_Qsqrt421_DS1x8_CV1_S815_N2_-115_N1_-8062to-3289 (20 days)
wu_Qsqrt421_DS1x8_CV1_S815_N2_-92_N1_-8049to-3280 (19 days)
wu_Qsqrt421_DS1x8_CV1_S815_N2_-117_N1_-8062to-3291 (12 days)

All are spending their slow-running time in the 48% - 53% area (so far - I don't know when they speed up again).

3) 'Slow starters'

wu_Qsqrt421_DS3x8_CV1_S815_N2_-194088_N1_806767to808794 (2 days running, 0.275% progress)
wu_Qsqrt421_DS1x5_CV2_S815_N2_-44_N1_-343to298 (1 day 21 hours, 17.653%)
ID: 1522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 20 Dec 14
Posts: 17
Credit: 3,506,373
RAC: 1
Message 1523 - Posted: 29 Jan 2016, 3:38:07 UTC - in response to Message 1521.  

I have been taught that stderr in C is not buffered. If you are using C++, cerr is the C++ version of stderr and is not buffered. C++ also provides a buffered version of stderr called clog, which is naturally much easier on SSDs (because a read-modify-write cycle for each character must occur without a buffer which therefore racks up plenty of wear and tear to the SSD) and faster on all media but is inappropriate for reporting errors if there are any time-sensitive errors that any delay caused by a buffer is unacceptable. It is meant as a place for a program to output logs that would be inappropriate to pipe to an output file in case the program's output is piped to a file, and such logs would be inappropriate to dump to that file.

By the way, the work unit has sped up since the first N1 completed, but I still have a long way to go. I still want to try to complete this work unit, but I am at 0.340% as of this writing. I have run this for 2 days, 11 hours, and 18 minutes as of this writing. Here is my current stderr below:
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
k1 Start = 0.
k2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/Qsqrt421_DS3x8_CV1_S815_N2_-194073_N1_805123to806926.dat
    K = x^2 - 421
    S = [2]
    Disc Bound = 256
    Skip = P^1
    Num Congruences = 1
    SCALE = 0.815000
    N2_MIN = -194073
    N2_MAX = -194073
    N1_MIN = 805123
    N1_MAX = 806926
    |dK| = 421
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_Qsqrt421_DS3x8_CV1_S815_N2_-194073_N1_805123to806926_1_0
Now starting the targeted Martinet search:
    N2_L = -194073.
    N2_U = -194073.
      N2 = -194073.
        N1_L = 805123.
        N1_U = 806926.
          N1 = 805123.
          N1 = 805124.
          N1 = 805125.
          N1 = 805126.
          N1 = 805127.
          N1 = 805128.
          N1 = 805129.
ID: 1523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 20 Dec 14
Posts: 17
Credit: 3,506,373
RAC: 1
Message 1528 - Posted: 3 Feb 2016, 4:35:27 UTC

I have aborted my result because the computer that had the tardy result managed to finish and report it. I therefore consider further processing of this work unit to be an inefficient use of energy and processor time which could be put to better use on another result from this project or another project.
ID: 1528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Daniel Liebmann

Send message
Joined: 9 Jan 16
Posts: 6
Credit: 10,999,642
RAC: 0
Message 1529 - Posted: 3 Feb 2016, 14:38:10 UTC

wu_Qsqrt421_DS3x8_CV1_S815_N2_-194114_N1_805809to806198_1 is one of my longest (and last of that kind) running task, with currently ~7days and will finish in ~2,5 days.
ID: 1529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 926
Credit: 96,976,971
RAC: 54,496
Message 1530 - Posted: 3 Feb 2016, 16:35:48 UTC - in response to Message 1528.  

I have aborted my result because the computer that had the tardy result managed to finish and report it. I therefore consider further processing of this work unit to be an inefficient use of energy and processor time which could be put to better use on another result from this project or another project.


Jesse -
That makes sense. I went ahead and granted you manual credit for the lost CPU cycles.

For anyone else in the same situation, let me know and I can do the same for you (private message works).
ID: 1530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 926
Credit: 96,976,971
RAC: 54,496
Message 1531 - Posted: 3 Feb 2016, 16:37:06 UTC - in response to Message 1529.  

wu_Qsqrt421_DS3x8_CV1_S815_N2_-194114_N1_805809to806198_1 is one of my longest (and last of that kind) running task, with currently ~7days and will finish in ~2,5 days.


Ok Daniel. Thanks for hanging in there!
ID: 1531 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 20 Dec 14
Posts: 17
Credit: 3,506,373
RAC: 1
Message 1533 - Posted: 4 Feb 2016, 3:55:33 UTC - in response to Message 1530.  

Thanks!
ID: 1533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 20 Dec 14
Posts: 17
Credit: 3,506,373
RAC: 1
Message 1534 - Posted: 4 Feb 2016, 4:26:30 UTC - in response to Message 1530.  

By the way, I placed a suggestion in the wish list at http://numberfields.asu.edu/NumberFields/forum_thread.php?id=276 that could help you diagnose work units that are poorly split (they are too short e.g. a few seconds to a few minutes and therefore tax your back end by having it process loads of short work units a.k.a. runts rather than a reasonable number of regular-sized work units or they are giant work units). Basically, I would like future versions of your programs to place UTC timestamps into each line in the stderr output. This may help you figure out which numbers are slow and which ones are fast, and allow you to adjust your work unit generator to produce fewer runts and giants.
ID: 1534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 926
Credit: 96,976,971
RAC: 54,496
Message 1535 - Posted: 4 Feb 2016, 7:12:30 UTC - in response to Message 1534.  

By the way, I placed a suggestion in the wish list at http://numberfields.asu.edu/NumberFields/forum_thread.php?id=276 that could help you diagnose work units that are poorly split (they are too short e.g. a few seconds to a few minutes and therefore tax your back end by having it process loads of short work units a.k.a. runts rather than a reasonable number of regular-sized work units or they are giant work units). Basically, I would like future versions of your programs to place UTC timestamps into each line in the stderr output. This may help you figure out which numbers are slow and which ones are fast, and allow you to adjust your work unit generator to produce fewer runts and giants.


Yes, that's a good suggestion. I will add it to the list!

I do have some post-processing scripts which, given all the results from a batch of WUs, plot timing as a function of various loop indices. The data is a little "noisy" since each WU is processed by a different computer, but nonetheless it has helped me to find better ways to breakup the search space.
ID: 1535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : 300day work unit


Main page · Your account · Message boards


Copyright © 2019 Arizona State University