Message boards :
Number crunching :
Signal 4
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
Ran some Decics but after an hour or so, got a Signal 4. According to the interwebz this is an illegal instruction and I should inform you about it. But here's the weird thing - the Get Decics app is more than 2 years old. So surely someone would have seen this before. Unless this is another one of those Mavericks problems? |
Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0 |
Indeed unusual. Can you send the details of the system you are running on and perhaps stderr.txt? |
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
Indeed unusual. Can you send the details of the system you are running on and perhaps stderr.txt? Machine: http://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=4811 stderr.txt for one wu: <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 4 </message> <stderr_txt> Checkpoint Flag = 0. Cvec Starting Index = 0. N1 Start = 0. N2 Start = 0. PolyCount starting value = 0. Stat Count 1 = 0. Stat Count 2 = 0. Stat Count 3 = 0. Elapsed Time = 0 (sec). Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp11273of682667.dat K = x^2 - 2 S = [2, 5] Disc Bound = 15625000000000000 Skip = (P^2)*(Q^6) Num Congruences = 3 |dK| = 8 Signature = [2,0] Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp11273of682667_0_0 Now starting the targeted Martinet search: N2_L = -143. N2_U = 142. </stderr_txt> ]]> |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? |
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? Just now, so I'll try again. Thanks |
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? So far, so good. w00t!! |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? Good to hear! |
Send message Joined: 8 Jul 11 Posts: 46 Credit: 7,144,042 RAC: 0 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? I don't see the undef in the stderr and there is really no reason we should get an undef so I'm going to assume it was in the client app. |
Send message Joined: 1 Jul 12 Posts: 13 Credit: 2,099,843 RAC: 0 |
I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer? And now bad... I had some WUs that progressed nicely all the way to 80%+ and then they all crapped out at the same time. The stderr is below. I had left these WUs crunching all alone until I suspended all tasks to travel home. I restarted them and that seemed OK, but as I say, all 8 WUs crapped out at about the same time. STDERR: <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 4 </message> <stderr_txt> Checkpoint Flag = 0. Cvec Starting Index = 0. N1 Start = 0. N2 Start = 0. PolyCount starting value = 0. Stat Count 1 = 0. Stat Count 2 = 0. Stat Count 3 = 0. Elapsed Time = 0 (sec). Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp18669of682667.dat K = x^2 - 2 S = [2, 5] Disc Bound = 15625000000000000 Skip = (P^2)*(Q^6) Num Congruences = 3 |dK| = 8 Signature = [2,0] Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp18669of682667_0_0 Now starting the targeted Martinet search: N2_L = -143. N2_U = 143. Checkpoint Flag = 0. Cvec Starting Index = 0. N1 Start = 0. N2 Start = 0. PolyCount starting value = 0. Stat Count 1 = 0. Stat Count 2 = 0. Stat Count 3 = 0. Elapsed Time = 0 (sec). Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp18669of682667.dat K = x^2 - 2 S = [2, 5] Disc Bound = 15625000000000000 Skip = (P^2)*(Q^6) Num Congruences = 3 |dK| = 8 Signature = [2,0] Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp18669of682667_0_0 Now starting the targeted Martinet search: N2_L = -143. N2_U = 143. N2_L = -142. N2_U = 143. </stderr_txt> ]]> |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
The only thing that looks odd in the stderr is that it doesn't appear to be checkpointing. The program starts where you see the words "Checkpoint Flag = 0". So the 2nd time you see this is when it started up the 2nd time and it still says that the Checkpoint Flag is zero, which means the checkpoint file doesn't exist and the search is essentially starting over. Looking through your other results, it looks like sometimes checkpointing does work. Also, going back several days, you were returning successful results, but I noticed the following warning in some of the stderrs: *** Warning: TMPDIR is set (/var/folders/pv/pq1qjq2n37s7cxx9fkywtqsh0000gn/T/), but is not writable. Maybe this is a hint to what is going on. Could there be a problem writing to disk, which is causing the checkpoint to fail? |
Send message Joined: 3 Jun 12 Posts: 9 Credit: 3,254,916 RAC: 54 |
Hi I just got 5 WUs crashing at the same time on my iMac with that weird signal 4 error. Bad thing is that 4 of them had above 14 hours of crunching... I have experienced that same crash with other boinc projects in the past, not very often but on a regular basis, I mentioned it on projects forums, and never got a clear answer from any project admin (some of them did search). <core_client_version>7.4.36</core_client_version> The very weird thing is that when this happens, all of one project WUs crash at the same time, but it won't affect other projects running at the same time - I have an i7 with 8 WUs of various projects running at the same time, 24/24 7/7. Those who answered me kn the forums said "it has something to do with your Mac, look in the system logs", which I did trying to figure out in the Mac console some event happening at that precise moment. Unfortunately I could never see anything obvious, I'm not techy enough to see that. For example now I can see that before the crash I have many events like 06/03/2015 22:16:36,000 kernel[0]: RTS: Scan time-out on file '/Volumes/Macintosh HD/Users/Shared/BOINC Data/slots/7/GetDecics_state' but they also happen all the time on other boinc WUs running at the same time and just before the crash I see 06/03/2015 22:18:54,868 com.apple.xpc.launchd[1]: (com.apple.ReportCrash.Root[45493]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash.DirectoryService then again some scan time out and then 5 crash reports are generated in 2 seconds, below is one of them, hope this can help in any way : Process: GetDecics_1.02_x86_64-apple-darwin [19605] Thanks for your help ! |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Sorry, nothing stands out from those logs. I did check your stderrs for any hints but those all looked normal (except for the "process got signal 4"). Maybe Greg will see something in those logs (he's a little better than me with those things). |
Send message Joined: 3 Jun 12 Posts: 9 Credit: 3,254,916 RAC: 54 |
Thanks for your quick answer, hopefully Gred has some other idea. Just one more thing : I had another WU failing this morning, alone (it was the only numberfields WU running by that time). In the meantime I had some leiden, lattice, rosetta, citizen science grid, seti and WUProp WUs that terminate successfully. So the issue may be related to *my* machine and some other people machines (since this thread was not started by me) BUT it has to do with the way numberfields application is coded, some instruction it uses that interacts badly with my system and causes its failure. If you google "process got signal 4" you'll find some entries with seti (2011), malariacontrol (2006), milkyway (2008), lattice (2014, me ! they never answered anything)... I found a very interesting post on eOn forum (2013) regarding a "process got signal 8" error and someone (admin ?) answered "The signal 8 error means " SIGFPE 8 Core Floating point exception" and is project-related. I have edited my input parameters handle this error and have not had such an error has not occurred since." So maybe it's not the same issue but I read "edited my input parameters handle this error", what does it mean ? maybe somehow the same kind of "parameter setup" can be done do "ignore" it ? or was this the project who meant that he dealt with this error in the eOn application code ? |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Thanks for your quick answer, hopefully Gred has some other idea. I'm not sure if this will help, but I just upgraded the applications. I actually built these versions about 6 weeks ago and since then have extensively tested the linux versions. These versions used the latest boinc libraries and the latest stable pari libraries. As you pointed out earlier, the regular decic app was over 2 years old. Maybe these newer executables will fix whatever was conflicting with your system. |
Send message Joined: 3 Jun 12 Posts: 9 Credit: 3,254,916 RAC: 54 |
Hopefully ! I'll see when I get the next WUs - eventually I won't see it if they work OK, I run many projects at the same time (the most I can ! :) ) so if all goes well as expected I may not notice it. Thanks for the update. |
Send message Joined: 27 Feb 15 Posts: 2 Credit: 200,215 RAC: 0 |
I'm not sure if this will help, but I just upgraded the applications. I actually built these versions about 6 weeks ago and since then have extensively tested the linux versions. My linux system (ver. 7.2.42 on Linux Mint 15) doesn't get any work since the applications upgrade. Here are the messages: With both applications checked: NumberFields@home 3-8-2015 1:44:44 PM Sending scheduler request: To fetch work. NumberFields@home 3-8-2015 1:44:44 PM Requesting new tasks for CPU NumberFields@home 3-8-2015 1:44:45 PM Scheduler request completed: got 0 new tasks NumberFields@home 3-8-2015 1:44:45 PM No tasks sent With just Get Decics checked: NumberFields@home 3-8-2015 1:47:04 PM Sending scheduler request: Requested by user. NumberFields@home 3-8-2015 1:47:04 PM Requesting new tasks for CPU NumberFields@home 3-8-2015 1:47:07 PM Scheduler request completed: got 0 new tasks NumberFields@home 3-8-2015 1:47:07 PM No tasks sent NumberFields@home 3-8-2015 1:47:07 PM No tasks are available for Get Decic Fields With just Get Bounded Decics checked: NumberFields@home 3-8-2015 1:48:43 PM Sending scheduler request: To fetch work. NumberFields@home 3-8-2015 1:48:43 PM Requesting new tasks for CPU NumberFields@home 3-8-2015 1:48:45 PM Scheduler request completed: got 0 new tasks NumberFields@home 3-8-2015 1:48:45 PM No tasks sent NumberFields@home 3-8-2015 1:48:45 PM No tasks are available for Get Decics with Bounded Discriminant |
Send message Joined: 3 Jun 12 Posts: 9 Credit: 3,254,916 RAC: 54 |
When you look at any project server status you actually never know if the WU available (there are plenty for the moment on NF for both applications) are available for any platform, or just some... |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 253,381,573 RAC: 179,068 |
I got a number of WUs allocated and running with the new applications, up 8 Mar 2015, 5:21:40 UTC. That was the last - nothing new allocated since then. |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
My hosts are having this same problem. I've been looking into this for the last hour. I have upgraded to the latest client and reattached to the project, but I am still getting "No tasks are available for Get Decic Fields". This seems to be a problem with the scheduler. Looking at the logs, I can see that it is giving work out to some users, and not others. The logs are not descriptive enough to determine why some users are not given work. I will continue to look into this... |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
My hosts are having this same problem. I've been looking into this for the last hour. I have upgraded to the latest client and reattached to the project, but I am still getting "No tasks are available for Get Decic Fields". Found the problem. I disabled the "need reliable" mechanism and now my hosts are getting work. I had been using my hosts to do some testing and needed to abort a bunch of NumberFields WUs. This flagged my hosts as "unreliable". There were plenty of normal WUs available, so I'm not sure why BOINC was only giving out need-reliable WUs. |