Talk about your long ones

Message boards : Number crunching : Talk about your long ones
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Matthias Lehmkuhl

Send message
Joined: 10 Jan 12
Posts: 8
Credit: 2,194,088
RAC: 1,277
Message 1320 - Posted: 26 Jun 2015, 12:36:58 UTC - in response to Message 1318.  

looks like there is an issue on the "Get Decics with Bounded Discriminant v3.03"
on one of my computers only one result finish OK,
here is the good one
http://numberfields.asu.edu/NumberFields/result.php?resultid=11107827

all other end in calculation error with different times
sample result for calculation error
http://numberfields.asu.edu/NumberFields/result.php?resultid=11107958

Host
http://numberfields.asu.edu/NumberFields/results.php?hostid=22930&offset=0&show_names=0&state=0&appid=2

stderr is empty for that results with calculation error.

CPU usage is less 50% for the last result (i've seen) and based on

same on the other host
http://numberfields.asu.edu/NumberFields/results.php?hostid=2794
Matthias
ID: 1320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 10 Jan 12
Posts: 8
Credit: 2,194,088
RAC: 1,277
Message 1321 - Posted: 26 Jun 2015, 13:30:21 UTC

here is the stderr.txt contents for the result
http://numberfields.asu.edu/NumberFields/result.php?resultid=11108092
I've copied some GetBoundedDecics_state updates before calculation error

Checkpoint Flag = 0.
a5 Starting Index = 0.
a22 Starting Value = -1000000000.
a21 Starting Value = -1000000000.
a32 Starting Value = -1000000000.
a31 Starting Value = -1000000000.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Entering MartinetSearch routine...
Die Syntax fr den Dateinamen, Verzeichnisnamen oder die Datentr„gerbezeichnung ist falsch.
Disc Bound = 120000000000.00000000
Reading file ../../projects/numberfields.asu.edu_NumberFields/wu_12E10_SF73-0_Idx9_Grp37421of124454.dat:
K = y^2 - 73
TgtFlag = 0
a1 Index = 9
NumVals_a5 = 1
a5 values:
25 + -6w
a22_L = -3
a22_U = -3
a21_L = 41
a21_U = 41
a32_L = 23
a32_U = 46
|dK| = 73
Signature = [2,0]
a11 = -1
a12 = 2
sig1a1 = -10.544003745317531167871648326239706435
sig2a1 = 6.5440037453175311678716483262397064346
Ca1_pre = 30.800000
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_12E10_SF73-0_Idx9_Grp37421of124454_0_0
Now starting the Martinet search:

Doing case a5 = 25 + -6w...
2nd part of Martinet bound = 18.978846.
Martinet bound = 49.778846.
a22_L = -3.
a22_U = -3.
a22 = -3.
a21_L = 41.
a21_U = 41.
a21 = 41.
a32_L = 23.
a32_U = 46.

hope this will help.
Matthias
ID: 1321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1322 - Posted: 26 Jun 2015, 16:20:54 UTC - in response to Message 1321.  

Matthias - since your results have no stderr, it's hard to determine exactly what happened; however, I noticed some other users were also failing and the error was related to the writing of a certain temp file. I uploaded a new app which should fix this issue. I'm hoping that will fix your problem too.
ID: 1322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1323 - Posted: 26 Jun 2015, 16:30:03 UTC - in response to Message 1322.  

Matthias - since your results have no stderr, it's hard to determine exactly what happened; however, I noticed some other users were also failing and the error was related to the writing of a certain temp file. I uploaded a new app which should fix this issue. I'm hoping that will fix your problem too.


I should have also mentioned, the problem with the temp file was an issue with multiple processes - one process was "cleaning up" and happened to clean up the temp file that another process was using.

So if you only run 1 NumberFields process at a time you wouldn't notice this (and hence why I didn't catch it while testing!)
ID: 1323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1324 - Posted: 26 Jun 2015, 16:45:34 UTC - in response to Message 1319.  

I see I've picked up a new v2.05 app already. I'll abort unstarted v2.01 tasks, but let the ones which are already running complete without interruption.


Hey Richard - I should have mentioned I duplicated your observation of the Qsqrt WUs completing really fast after a restart. A good test is to suspend the WU and then restart it - you should now see it continue normally. Ultimately, the problem was the way the checkpoint file was being written (%ld format in fprintf when it should have been %lld).
ID: 1324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1325 - Posted: 26 Jun 2015, 23:49:25 UTC - in response to Message 1324.  

It looks like some users are still having problems writing the temp file. It might be caused by an older version of the app running simultaneously, but it's hard to say. The temp file is created by the mpqs factoring algorithm in PARI, which I turned on within the last week (it's a faster algorithm, but requires temp files). To be on the safe side, I have turned mpqs back off for now. And yes, that means another version of the app to download. Sorry!
ID: 1325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 10 Jan 12
Posts: 8
Credit: 2,194,088
RAC: 1,277
Message 1326 - Posted: 27 Jun 2015, 15:43:02 UTC - in response to Message 1322.  

Hi Eric,
Get Decics with Bounded Discriminant v3.04 was also getting calculation errors, but I found also some finished/valid results.

But the new one "Get Decics with Bounded Discriminant v3.05" is running good. No errors found. All results are valid.
Could just check on one of my hosts the uploaded results.
one additional point, the runtime of the 3.05 is nearly the same like the CPU time.


one point regarding the temp files, why is there the problem?
normally boinc apps should use the slot dir for temp files. In this case two running apps using the same temp file name should not get a conflict.

one additional point, the runtime of the 3.05 is nearly the same like the CPU time.
http://numberfields.asu.edu/NumberFields/result.php?resultid=11110368
but on 3.04 the runtime is twice of the CPU time
http://numberfields.asu.edu/NumberFields/result.php?resultid=11143093
For me it looks like, the faster algorithm i not really faster related to the runtime.
Matthias
ID: 1326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1327 - Posted: 27 Jun 2015, 19:21:15 UTC - in response to Message 1326.  

Hi Eric,
Get Decics with Bounded Discriminant v3.04 was also getting calculation errors, but I found also some finished/valid results.

But the new one "Get Decics with Bounded Discriminant v3.05" is running good. No errors found. All results are valid.
Could just check on one of my hosts the uploaded results.
one additional point, the runtime of the 3.05 is nearly the same like the CPU time.


one point regarding the temp files, why is there the problem?
normally boinc apps should use the slot dir for temp files. In this case two running apps using the same temp file name should not get a conflict.

one additional point, the runtime of the 3.05 is nearly the same like the CPU time.
http://numberfields.asu.edu/NumberFields/result.php?resultid=11110368
but on 3.04 the runtime is twice of the CPU time
http://numberfields.asu.edu/NumberFields/result.php?resultid=11143093
For me it looks like, the faster algorithm i not really faster related to the runtime.


The problem with the temp files, is that the PARI code looks for "standard" temp locations, such as /tmp on linux (I forget the standard windows location). This in itself can be a problem, as some users were getting errors "could not find suitable temp directory". So the problem is that PARI knows nothing about the BOINC slots, and ends up using the same directory for all processes. I could probably hack the code to use the BOINC slots, but if I were to hack anything it would be to remove the mpqs dependence on file I/O (Not easy! If you know any CS student out there looking for a class project, this would be a challenging one).

Regarding the difference in runtime vs cpu time, I can't say why that is. It could be the same issue discussed in this thread:
http://numberfields.asu.edu/NumberFields/forum_thread.php?id=217&postid=1194#1194

Ultimately, I would like to use the default PARI factoring algorithm, which utilizes mpqs. In some test cases, it was almost 50% faster. The few times it was slower, it was only by a small amount.
ID: 1327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDLS

Send message
Joined: 15 Mar 15
Posts: 11
Credit: 113,280,935
RAC: 0
Message 1330 - Posted: 1 Jul 2015, 22:31:05 UTC

So...if there's a huge split between a task's CPU Time and its CPU Time at Last Checkpoint (say, several hours), is that an indication that all is not well and an abortion is in order?

That seems to be a common thread of the tasks that get to a point where they're stuck for days.
ID: 1330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1331 - Posted: 2 Jul 2015, 2:57:33 UTC - in response to Message 1330.  

So...if there's a huge split between a task's CPU Time and its CPU Time at Last Checkpoint (say, several hours), is that an indication that all is not well and an abortion is in order?

That seems to be a common thread of the tasks that get to a point where they're stuck for days.


It's hard to say. The checkpointing occurs within the innermost loops, so if it's taking a long time to checkpoint that means it's spending a bunch of time on the tests at the heart of the loops - these involve computing irreducibility, computing the discriminant, and factoring the discriminant. It's usually the factoring that slows things down. This does not necessarily imply things are stuck, but it could be an indication that this will turn out to be one of the longer running WUs.

Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss. If you're client hasn't uploaded the latest app, then that would most likely be the culprit.
(Latest Apps as of July 1st: Version 2.07 for GD, and version 3.05 for GBD)
ID: 1331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDLS

Send message
Joined: 15 Mar 15
Posts: 11
Credit: 113,280,935
RAC: 0
Message 1332 - Posted: 3 Jul 2015, 13:18:44 UTC - in response to Message 1331.  

Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss.


How about a huge gap between CPU time and Elapsed time (called CPU Time and Run Time on the web page)? Is that a cause for concern?
ID: 1332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1333 - Posted: 3 Jul 2015, 15:50:07 UTC - in response to Message 1332.  

Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss.


How about a huge gap between CPU time and Elapsed time (called CPU Time and Run Time on the web page)? Is that a cause for concern?


How much of a gap are we talking about? It's easy to see the Run Time greater than the CPU time by several factors. For example, if you have a core i7 with 4 physical cores and you are hyperthreaded with 8 WUs running, then you will see the Run = 2*CPU (approx). If you start running all kinds of computationally intensive processes in parallel with BOINC, the gap can get even bigger.

Now if you have N physical cores and are running N WUs, and that's all that your system is doing, then the two times should be close. If they are not, then yes there is cause for concern. In windows you could check the task manager for anything suspicious.

The only time I saw this caused by the NumberFields app, was a file contention problem, where multiple WUs were fighting over limited disk space. This actually happened to me when I had inadvertently filled up my tmp partition and PARI had no space for writing it's temp files - instead of erroring out, the app became "stuck" waiting to write to the disk. HOWEVER, this can no longer happen, because all file I/O has been removed from the NumberFields app (except for the standard check-pointing and other BOINC file I/O).
ID: 1333 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDLS

Send message
Joined: 15 Mar 15
Posts: 11
Credit: 113,280,935
RAC: 0
Message 1334 - Posted: 4 Jul 2015, 4:59:38 UTC - in response to Message 1333.  

I've searched the BOINC documentation, but have found no indication of what "CPU time" and "Elapsed Time" actually mean. Or how they relate to the terms "CPU Time" and "Run Time" on your website. Can you explain? I could list my guesses but that seems pointless.

Also, I apologize if these inquiries should be in the "Questions & Answers" section of this forum, but this thread seems to be better supported.

Unrelated note: we should be hitting 3 million Numbers@Home credits sometime tonight...exciting.
ID: 1334 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1329
Credit: 423,120,158
RAC: 485,094
Message 1335 - Posted: 4 Jul 2015, 8:05:52 UTC - in response to Message 1334.  

I've searched the BOINC documentation, but have found no indication of what "CPU time" and "Elapsed Time" actually mean. Or how they relate to the terms "CPU Time" and "Run Time" on your website. Can you explain? I could list my guesses but that seems pointless.

Also, I apologize if these inquiries should be in the "Questions & Answers" section of this forum, but this thread seems to be better supported.

Unrelated note: we should be hitting 3 million Numbers@Home credits sometime tonight...exciting.


My understanding is that Run time is how long the WU runs for as measured on a clock and CPU time is how much time would have been spent on the CPU if it had been at 100%. So for example, if you were only using 50% of the CPU, the run time would by twice the CPU time.

Congrats on the 3 million mark!
ID: 1335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Talk about your long ones


Main page · Your account · Message boards


Copyright © 2024 Arizona State University