Message boards :
Number crunching :
Talk about your long ones
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 10 Jan 12 Posts: 8 Credit: 2,406,752 RAC: 88 |
looks like there is an issue on the "Get Decics with Bounded Discriminant v3.03" on one of my computers only one result finish OK, here is the good one http://numberfields.asu.edu/NumberFields/result.php?resultid=11107827 all other end in calculation error with different times sample result for calculation error http://numberfields.asu.edu/NumberFields/result.php?resultid=11107958 Host http://numberfields.asu.edu/NumberFields/results.php?hostid=22930&offset=0&show_names=0&state=0&appid=2 stderr is empty for that results with calculation error. CPU usage is less 50% for the last result (i've seen) and based on same on the other host http://numberfields.asu.edu/NumberFields/results.php?hostid=2794 Matthias |
Send message Joined: 10 Jan 12 Posts: 8 Credit: 2,406,752 RAC: 88 |
here is the stderr.txt contents for the result http://numberfields.asu.edu/NumberFields/result.php?resultid=11108092 I've copied some GetBoundedDecics_state updates before calculation error Checkpoint Flag = 0. a5 Starting Index = 0. a22 Starting Value = -1000000000. a21 Starting Value = -1000000000. a32 Starting Value = -1000000000. a31 Starting Value = -1000000000. PolyCount starting value = 0. Stat Count 1 = 0. Stat Count 2 = 0. Stat Count 3 = 0. Elapsed Time = 0 (sec). Entering MartinetSearch routine... Die Syntax fr den Dateinamen, Verzeichnisnamen oder die Datentr„gerbezeichnung ist falsch. Disc Bound = 120000000000.00000000 Reading file ../../projects/numberfields.asu.edu_NumberFields/wu_12E10_SF73-0_Idx9_Grp37421of124454.dat: K = y^2 - 73 TgtFlag = 0 a1 Index = 9 NumVals_a5 = 1 a5 values: 25 + -6w a22_L = -3 a22_U = -3 a21_L = 41 a21_U = 41 a32_L = 23 a32_U = 46 |dK| = 73 Signature = [2,0] a11 = -1 a12 = 2 sig1a1 = -10.544003745317531167871648326239706435 sig2a1 = 6.5440037453175311678716483262397064346 Ca1_pre = 30.800000 Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_12E10_SF73-0_Idx9_Grp37421of124454_0_0 Now starting the Martinet search: Doing case a5 = 25 + -6w... 2nd part of Martinet bound = 18.978846. Martinet bound = 49.778846. a22_L = -3. a22_U = -3. a22 = -3. a21_L = 41. a21_U = 41. a21 = 41. a32_L = 23. a32_U = 46. hope this will help. Matthias |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
Matthias - since your results have no stderr, it's hard to determine exactly what happened; however, I noticed some other users were also failing and the error was related to the writing of a certain temp file. I uploaded a new app which should fix this issue. I'm hoping that will fix your problem too. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
Matthias - since your results have no stderr, it's hard to determine exactly what happened; however, I noticed some other users were also failing and the error was related to the writing of a certain temp file. I uploaded a new app which should fix this issue. I'm hoping that will fix your problem too. I should have also mentioned, the problem with the temp file was an issue with multiple processes - one process was "cleaning up" and happened to clean up the temp file that another process was using. So if you only run 1 NumberFields process at a time you wouldn't notice this (and hence why I didn't catch it while testing!) |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
I see I've picked up a new v2.05 app already. I'll abort unstarted v2.01 tasks, but let the ones which are already running complete without interruption. Hey Richard - I should have mentioned I duplicated your observation of the Qsqrt WUs completing really fast after a restart. A good test is to suspend the WU and then restart it - you should now see it continue normally. Ultimately, the problem was the way the checkpoint file was being written (%ld format in fprintf when it should have been %lld). |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
It looks like some users are still having problems writing the temp file. It might be caused by an older version of the app running simultaneously, but it's hard to say. The temp file is created by the mpqs factoring algorithm in PARI, which I turned on within the last week (it's a faster algorithm, but requires temp files). To be on the safe side, I have turned mpqs back off for now. And yes, that means another version of the app to download. Sorry! |
Send message Joined: 10 Jan 12 Posts: 8 Credit: 2,406,752 RAC: 88 |
Hi Eric, Get Decics with Bounded Discriminant v3.04 was also getting calculation errors, but I found also some finished/valid results. But the new one "Get Decics with Bounded Discriminant v3.05" is running good. No errors found. All results are valid. Could just check on one of my hosts the uploaded results. one additional point, the runtime of the 3.05 is nearly the same like the CPU time. one point regarding the temp files, why is there the problem? normally boinc apps should use the slot dir for temp files. In this case two running apps using the same temp file name should not get a conflict. one additional point, the runtime of the 3.05 is nearly the same like the CPU time. http://numberfields.asu.edu/NumberFields/result.php?resultid=11110368 but on 3.04 the runtime is twice of the CPU time http://numberfields.asu.edu/NumberFields/result.php?resultid=11143093 For me it looks like, the faster algorithm i not really faster related to the runtime. Matthias |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
Hi Eric, The problem with the temp files, is that the PARI code looks for "standard" temp locations, such as /tmp on linux (I forget the standard windows location). This in itself can be a problem, as some users were getting errors "could not find suitable temp directory". So the problem is that PARI knows nothing about the BOINC slots, and ends up using the same directory for all processes. I could probably hack the code to use the BOINC slots, but if I were to hack anything it would be to remove the mpqs dependence on file I/O (Not easy! If you know any CS student out there looking for a class project, this would be a challenging one). Regarding the difference in runtime vs cpu time, I can't say why that is. It could be the same issue discussed in this thread: http://numberfields.asu.edu/NumberFields/forum_thread.php?id=217&postid=1194#1194 Ultimately, I would like to use the default PARI factoring algorithm, which utilizes mpqs. In some test cases, it was almost 50% faster. The few times it was slower, it was only by a small amount. |
Send message Joined: 15 Mar 15 Posts: 11 Credit: 113,280,935 RAC: 0 |
So...if there's a huge split between a task's CPU Time and its CPU Time at Last Checkpoint (say, several hours), is that an indication that all is not well and an abortion is in order? That seems to be a common thread of the tasks that get to a point where they're stuck for days. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
So...if there's a huge split between a task's CPU Time and its CPU Time at Last Checkpoint (say, several hours), is that an indication that all is not well and an abortion is in order? It's hard to say. The checkpointing occurs within the innermost loops, so if it's taking a long time to checkpoint that means it's spending a bunch of time on the tests at the heart of the loops - these involve computing irreducibility, computing the discriminant, and factoring the discriminant. It's usually the factoring that slows things down. This does not necessarily imply things are stuck, but it could be an indication that this will turn out to be one of the longer running WUs. Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss. If you're client hasn't uploaded the latest app, then that would most likely be the culprit. (Latest Apps as of July 1st: Version 2.07 for GD, and version 3.05 for GBD) |
Send message Joined: 15 Mar 15 Posts: 11 Credit: 113,280,935 RAC: 0 |
Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss. How about a huge gap between CPU time and Elapsed time (called CPU Time and Run Time on the web page)? Is that a cause for concern? |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
Note that there should not be very many of the slow WUs, so if you are seeing this alot, it could be a sign that something else is amiss. How much of a gap are we talking about? It's easy to see the Run Time greater than the CPU time by several factors. For example, if you have a core i7 with 4 physical cores and you are hyperthreaded with 8 WUs running, then you will see the Run = 2*CPU (approx). If you start running all kinds of computationally intensive processes in parallel with BOINC, the gap can get even bigger. Now if you have N physical cores and are running N WUs, and that's all that your system is doing, then the two times should be close. If they are not, then yes there is cause for concern. In windows you could check the task manager for anything suspicious. The only time I saw this caused by the NumberFields app, was a file contention problem, where multiple WUs were fighting over limited disk space. This actually happened to me when I had inadvertently filled up my tmp partition and PARI had no space for writing it's temp files - instead of erroring out, the app became "stuck" waiting to write to the disk. HOWEVER, this can no longer happen, because all file I/O has been removed from the NumberFields app (except for the standard check-pointing and other BOINC file I/O). |
Send message Joined: 15 Mar 15 Posts: 11 Credit: 113,280,935 RAC: 0 |
I've searched the BOINC documentation, but have found no indication of what "CPU time" and "Elapsed Time" actually mean. Or how they relate to the terms "CPU Time" and "Run Time" on your website. Can you explain? I could list my guesses but that seems pointless. Also, I apologize if these inquiries should be in the "Questions & Answers" section of this forum, but this thread seems to be better supported. Unrelated note: we should be hitting 3 million Numbers@Home credits sometime tonight...exciting. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 492,109,891 RAC: 547,407 |
I've searched the BOINC documentation, but have found no indication of what "CPU time" and "Elapsed Time" actually mean. Or how they relate to the terms "CPU Time" and "Run Time" on your website. Can you explain? I could list my guesses but that seems pointless. My understanding is that Run time is how long the WU runs for as measured on a clock and CPU time is how much time would have been spent on the CPU if it had been at 100%. So for example, if you were only using 50% of the CPU, the run time would by twice the CPU time. Congrats on the 3 million mark! |