Signal 4

Message boards : Number crunching : Signal 4
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 1161 - Posted: 17 Nov 2014, 19:03:48 UTC

Ran some Decics but after an hour or so, got a Signal 4.

According to the interwebz this is an illegal instruction and I should inform you about it. But here's the weird thing - the Get Decics app is more than 2 years old. So surely someone would have seen this before. Unless this is another one of those Mavericks problems?
ID: 1161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Greg Tucker
Project administrator
Project developer
Project tester

Send message
Joined: 8 Jul 11
Posts: 46
Credit: 7,144,042
RAC: 0
Message 1163 - Posted: 17 Nov 2014, 20:19:48 UTC - in response to Message 1161.  

Indeed unusual. Can you send the details of the system you are running on and perhaps stderr.txt?
ID: 1163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 1164 - Posted: 17 Nov 2014, 20:43:16 UTC - in response to Message 1163.  

Indeed unusual. Can you send the details of the system you are running on and perhaps stderr.txt?


Machine:

http://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=4811

stderr.txt for one wu:

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process got signal 4
</message>
<stderr_txt>
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp11273of682667.dat
K = x^2 - 2
S = [2, 5]
Disc Bound = 15625000000000000
Skip = (P^2)*(Q^6)
Num Congruences = 3
|dK| = 8
Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp11273of682667_0_0
Now starting the targeted Martinet search:
N2_L = -143.
N2_U = 142.

</stderr_txt>
]]>
ID: 1164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1166 - Posted: 17 Nov 2014, 21:13:45 UTC - in response to Message 1164.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?
ID: 1166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 1168 - Posted: 17 Nov 2014, 23:58:11 UTC - in response to Message 1166.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?



Just now, so I'll try again.

Thanks
ID: 1168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 1169 - Posted: 18 Nov 2014, 1:59:57 UTC - in response to Message 1166.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?



So far, so good. w00t!!
ID: 1169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1170 - Posted: 18 Nov 2014, 2:28:32 UTC - in response to Message 1169.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?



So far, so good. w00t!!


Good to hear!
ID: 1170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Greg Tucker
Project administrator
Project developer
Project tester

Send message
Joined: 8 Jul 11
Posts: 46
Credit: 7,144,042
RAC: 0
Message 1172 - Posted: 19 Nov 2014, 4:47:00 UTC - in response to Message 1170.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?



So far, so good. w00t!!


Good to hear!


I don't see the undef in the stderr and there is really no reason we should get an undef so I'm going to assume it was in the client app.
ID: 1172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 1 Jul 12
Posts: 13
Credit: 2,099,843
RAC: 0
Message 1173 - Posted: 19 Nov 2014, 4:55:28 UTC - in response to Message 1170.  

I remember getting some stange errors like that a long time ago. I couldn't figure out why it was happening, but they went away after I rebooted my system. How long has it been since you restarted your computer?



So far, so good. w00t!!


Good to hear!


And now bad...

I had some WUs that progressed nicely all the way to 80%+ and then they all crapped out at the same time. The stderr is below. I had left these WUs crunching all alone until I suspended all tasks to travel home. I restarted them and that seemed OK, but as I say, all 8 WUs crapped out at about the same time.

STDERR:

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process got signal 4
</message>
<stderr_txt>
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp18669of682667.dat
K = x^2 - 2
S = [2, 5]
Disc Bound = 15625000000000000
Skip = (P^2)*(Q^6)
Num Congruences = 3
|dK| = 8
Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp18669of682667_0_0
Now starting the targeted Martinet search:
N2_L = -143.
N2_U = 143.
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp18669of682667.dat
K = x^2 - 2
S = [2, 5]
Disc Bound = 15625000000000000
Skip = (P^2)*(Q^6)
Num Congruences = 3
|dK| = 8
Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp18669of682667_0_0
Now starting the targeted Martinet search:
N2_L = -143.
N2_U = 143.
N2_L = -142.
N2_U = 143.

</stderr_txt>
]]>
ID: 1173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1174 - Posted: 19 Nov 2014, 6:02:22 UTC - in response to Message 1173.  

The only thing that looks odd in the stderr is that it doesn't appear to be checkpointing. The program starts where you see the words "Checkpoint Flag = 0". So the 2nd time you see this is when it started up the 2nd time and it still says that the Checkpoint Flag is zero, which means the checkpoint file doesn't exist and the search is essentially starting over.

Looking through your other results, it looks like sometimes checkpointing does work. Also, going back several days, you were returning successful results, but I noticed the following warning in some of the stderrs:

*** Warning: TMPDIR is set (/var/folders/pv/pq1qjq2n37s7cxx9fkywtqsh0000gn/T/), but is not writable.

Maybe this is a hint to what is going on. Could there be a problem writing to disk, which is causing the checkpoint to fail?
ID: 1174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 3 Jun 12
Posts: 8
Credit: 3,053,184
RAC: 1,143
Message 1245 - Posted: 6 Mar 2015, 22:51:59 UTC

Hi

I just got 5 WUs crashing at the same time on my iMac with that weird signal 4 error. Bad thing is that 4 of them had above 14 hours of crunching...

I have experienced that same crash with other boinc projects in the past, not very often but on a regular basis, I mentioned it on projects forums, and never got a clear answer from any project admin (some of them did search).

<core_client_version>7.4.36</core_client_version>
<![CDATA[
<message>
process got signal 4
</message>
<stderr_txt>
Checkpoint Flag = 0.
Cvec Starting Index = 0.
N1 Start = 0.
N2 Start = 0.
PolyCount starting value = 0.
Stat Count 1 = 0.
Stat Count 2 = 0.
Stat Count 3 = 0.
Elapsed Time = 0 (sec).
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp334013of682667.dat
K = x^2 - 2
S = [2, 5]
Disc Bound = 15625000000000000
Skip = (P^2)*(Q^6)
Num Congruences = 3
|dK| = 8
Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp334013of682667_0_0
Now starting the targeted Martinet search:
N2_L = -142.
N2_U = 143.

</stderr_txt>
]]>


The very weird thing is that when this happens, all of one project WUs crash at the same time, but it won't affect other projects running at the same time - I have an i7 with 8 WUs of various projects running at the same time, 24/24 7/7.

Those who answered me kn the forums said "it has something to do with your Mac, look in the system logs", which I did trying to figure out in the Mac console some event happening at that precise moment. Unfortunately I could never see anything obvious, I'm not techy enough to see that.

For example now I can see that before the crash I have many events like

06/03/2015 22:16:36,000 kernel[0]: RTS: Scan time-out on file '/Volumes/Macintosh HD/Users/Shared/BOINC Data/slots/7/GetDecics_state'


but they also happen all the time on other boinc WUs running at the same time

and just before the crash I see

06/03/2015 22:18:54,868 com.apple.xpc.launchd[1]: (com.apple.ReportCrash.Root[45493]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash.DirectoryService


then again some scan time out and then 5 crash reports are generated in 2 seconds, below is one of them, hope this can help in any way :

Process: GetDecics_1.02_x86_64-apple-darwin [19605]
Path: /Volumes/VOLUME/*/GetDecics_1.02_x86_64-apple-darwin
Identifier: GetDecics_1.02_x86_64-apple-darwin
Version: ???
Code Type: X86-64 (Native)
Parent Process: boinc [39]
Responsible: boinc [39]
User ID: 505

Date/Time: 2015-03-06 22:18:55.650 +0100
OS Version: Mac OS X 10.10.2 (14C109)
Report Version: 11
Anonymous UUID: 99E98E42-3596-D6BD-3945-BC745EC5D3D6


Time Awake Since Boot: 170000 seconds

Crashed Thread: 0 Dispatch queue: com.apple.main-thread

Exception Type: EXC_CRASH (SIGILL)
Exception Codes: 0x0000000000000000, 0x0000000000000000

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa82088 Flx_rem + 1656
1 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa833ef Flx_resultant + 223
2 GetDecics_1.02_x86_64-apple-darwin 0x000000010ac90baf ZX_resultant_all + 2143
3 GetDecics_1.02_x86_64-apple-darwin 0x000000010ac9116c ZX_disc_all + 140
4 GetDecics_1.02_x86_64-apple-darwin 0x000000010ac7fdd8 poldisc0 + 104
5 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa36c63 Mart52Engine_Tgt(long*, long*, long, long, int, long, long, long, long long*, std::basic_ofstream<char, std::char_traits<char> >&) + 18947
6 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa37c46 TgtMartinet(char*, char*, int, long, long, long, long long*) + 2134
7 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa30c82 main + 658
8 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa309e8 start + 52

Thread 1:
0 libsystem_c.dylib 0x00007fff8c61ad3a tzload + 323
1 libsystem_c.dylib 0x00007fff8c61a691 tzsetwall_basic + 162
2 libsystem_c.dylib 0x00007fff8c61a880 _st_tzset_basic + 363
3 libsystem_c.dylib 0x00007fff8c61c00f localtime_r + 41
4 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa37d66 boinc_msg_prefix(char*, int) + 96
5 GetDecics_1.02_x86_64-apple-darwin 0x000000010aa39750 timer_thread(void*) + 1390
6 libsystem_pthread.dylib 0x00007fff8d91c268 _pthread_body + 131
7 libsystem_pthread.dylib 0x00007fff8d91c1e5 _pthread_start + 176
8 libsystem_pthread.dylib 0x00007fff8d91a41d thread_start + 13

Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x00000000000011fa rbx: 0x0000000000006bbf rcx: 0x0000000113dfd2c8 rdx: 0x0000000000002651
rdi: 0x0000000113dfd2e0 rsi: 0x0000000000000009 rbp: 0x000000000000000a rsp: 0x00007fff551cf530
r8: 0x0000000113dfd340 r9: 0x0000000000000001 r10: 0x0000000000000002 r11: 0x0000000000000000
r12: 0x0000000000000000 r13: 0x0000000000000009 r14: 0x0000000113dfd340 r15: 0x0000000000000009
rip: 0x000000010aa82088 rfl: 0x0000000000000246 cr2: 0x0000000000c67ce0

Logical CPU: 0
Error Code: 0x00000000
Trap Number: 222


Binary Images:
0x10aa2f000 - 0x10ae43fef +GetDecics_1.02_x86_64-apple-darwin (???) <F5AE0C31-27E1-7C5D-759D-169DA4965479> /Volumes/VOLUME/*/GetDecics_1.02_x86_64-apple-darwin
0x7fff687cd000 - 0x7fff68803837 dyld (353.2.1) <65DCCB06-339C-3E25-9702-600A28291D0E> /usr/lib/dyld
0x7fff8abeb000 - 0x7fff8ac1bfff libsystem_m.dylib (3086.1) <1E12AB45-6D96-36D0-A226-F24D9FB0D9D6> /usr/lib/system/libsystem_m.dylib
0x7fff8ac23000 - 0x7fff8ac23ff7 libunc.dylib (29) <5676F7EA-C1DF-329F-B006-D2C3022B7D70> /usr/lib/system/libunc.dylib
0x7fff8b828000 - 0x7fff8b82aff7 libsystem_coreservices.dylib (9) <41B7C578-5A53-31C8-A96F-C73E030B0938> /usr/lib/system/libsystem_coreservices.dylib
0x7fff8b82b000 - 0x7fff8b83cff7 libsystem_coretls.dylib (35.1.2) <BC691CD1-17B6-39A5-BD02-AF973695FD1D> /usr/lib/system/libsystem_coretls.dylib
0x7fff8b846000 - 0x7fff8b84efff libsystem_platform.dylib (63) <64E34079-D712-3D66-9CE2-418624A5C040> /usr/lib/system/libsystem_platform.dylib
0x7fff8b8dd000 - 0x7fff8b905fff libsystem_info.dylib (459) <B85A85D5-8530-3A93-B0C3-4DEC41F79478> /usr/lib/system/libsystem_info.dylib
0x7fff8b916000 - 0x7fff8b918ff7 libsystem_sandbox.dylib (358.1.1) <95312E09-DA28-324A-A084-F3E574D0210E> /usr/lib/system/libsystem_sandbox.dylib
0x7fff8c5c6000 - 0x7fff8c652ff7 libsystem_c.dylib (1044.10.1) <199ED5EB-77A1-3D43-AA51-81779CE0A742> /usr/lib/system/libsystem_c.dylib
0x7fff8d894000 - 0x7fff8d895fff libSystem.B.dylib (1213) <90B107BC-FF74-32CC-B1CF-4E02F544D957> /usr/lib/libSystem.B.dylib
0x7fff8d8cd000 - 0x7fff8d8d1fff libcache.dylib (69) <45E9A2E7-99C4-36B2-BEE3-0C4E11614AD1> /usr/lib/system/libcache.dylib
0x7fff8d919000 - 0x7fff8d922fff libsystem_pthread.dylib (105.10.1) <3103AA7F-3BAE-3673-9649-47FFD7E15C97> /usr/lib/system/libsystem_pthread.dylib
0x7fff8d95c000 - 0x7fff8d961ff7 libmacho.dylib (862) <126CA2ED-DE91-308F-8881-B9DAEC3C63B6> /usr/lib/system/libmacho.dylib
0x7fff8e420000 - 0x7fff8e421fff libsystem_secinit.dylib (18) <581DAD0F-6B63-3A48-B63B-917AF799ABAA> /usr/lib/system/libsystem_secinit.dylib
0x7fff8e74a000 - 0x7fff8e790ff7 libauto.dylib (186) <A260789B-D4D8-316A-9490-254767B8A5F1> /usr/lib/libauto.dylib
0x7fff8e791000 - 0x7fff8e807fe7 libcorecrypto.dylib (233.1.2) <E1789801-3985-3949-B736-6B3378873301> /usr/lib/system/libcorecrypto.dylib
0x7fff8e915000 - 0x7fff8e915ff7 liblaunch.dylib (559.10.3) <DFCDEBDF-8247-3DC7-9879-E7E497DDA4B4> /usr/lib/system/liblaunch.dylib
0x7fff8f9b3000 - 0x7fff8f9c9ff7 libsystem_asl.dylib (267) <F153AC5B-0542-356E-88C8-20A62CA704E2> /usr/lib/system/libsystem_asl.dylib
0x7fff8fa14000 - 0x7fff8fa1dff7 libsystem_notify.dylib (133.1.1) <61147800-F320-3DAA-850C-BADF33855F29> /usr/lib/system/libsystem_notify.dylib
0x7fff90f4f000 - 0x7fff90f55fff libsystem_trace.dylib (72.1.3) <A9E6B7D8-C327-3742-AC54-86C94218B1DF> /usr/lib/system/libsystem_trace.dylib
0x7fff90f6b000 - 0x7fff90fa3ffb libsystem_network.dylib (411.1) <2EC3A005-473F-3C36-A665-F88B5BACC7F0> /usr/lib/system/libsystem_network.dylib
0x7fff90fa4000 - 0x7fff90fa9ff7 libsystem_stats.dylib (163.10.18) <9B8CCF24-DDDB-399A-9237-4BEC225D2E8C> /usr/lib/system/libsystem_stats.dylib
0x7fff91155000 - 0x7fff9115bff7 libsystem_networkextension.dylib (167.1.10) <29AB225B-D7FB-30ED-9600-65D44B9A9442> /usr/lib/system/libsystem_networkextension.dylib
0x7fff91f21000 - 0x7fff91f28ff7 libcompiler_rt.dylib (35) <BF8FC133-EE10-3DA6-9B90-92039E28678F> /usr/lib/system/libcompiler_rt.dylib
0x7fff928db000 - 0x7fff92905ff7 libdispatch.dylib (442.1.4) <502CF32B-669B-3709-8862-08188225E4F0> /usr/lib/system/libdispatch.dylib
0x7fff92d72000 - 0x7fff92d72ff7 libkeymgr.dylib (28) <77845842-DE70-3CC5-BD01-C3D14227CED5> /usr/lib/system/libkeymgr.dylib
0x7fff92ea5000 - 0x7fff92ea7ff7 libquarantine.dylib (76) <DC041627-2D92-361C-BABF-A869A5C72293> /usr/lib/system/libquarantine.dylib
0x7fff92ea8000 - 0x7fff92ec4ff7 libsystem_malloc.dylib (53.1.1) <19BCC257-5717-3502-A71F-95D65AFA861B> /usr/lib/system/libsystem_malloc.dylib
0x7fff92f32000 - 0x7fff92f3dfff libcommonCrypto.dylib (60061) <D381EBC6-69D8-31D3-8084-5A80A32CB748> /usr/lib/system/libcommonCrypto.dylib
0x7fff94584000 - 0x7fff9458cffb libcopyfile.dylib (118.1.2) <0C68D3A6-ACDD-3EF3-991A-CC82C32AB836> /usr/lib/system/libcopyfile.dylib
0x7fff94e01000 - 0x7fff94e09fff libsystem_dnssd.dylib (561.1.1) <62B70ECA-E40D-3C63-896E-7F00EC386DDB> /usr/lib/system/libsystem_dnssd.dylib
0x7fff95301000 - 0x7fff95302ffb libremovefile.dylib (35) <3485B5F4-6CE8-3C62-8DFD-8736ED6E8531> /usr/lib/system/libremovefile.dylib
0x7fff95303000 - 0x7fff95320fff libsystem_kernel.dylib (2782.10.72) <97CD7ACD-EA0C-3434-BEFC-FCD013D6BB73> /usr/lib/system/libsystem_kernel.dylib
0x7fff95cf7000 - 0x7fff95d46ff7 libstdc++.6.dylib (104.1) <803F6AC8-87DC-3E24-9E80-729B551F6FFF> /usr/lib/libstdc++.6.dylib
0x7fff96716000 - 0x7fff9691046f libobjc.A.dylib (647) <759E155D-BC42-3D4E-869B-6F57D477177C> /usr/lib/libobjc.A.dylib
0x7fff96991000 - 0x7fff96992fff libDiagnosticMessagesClient.dylib (100) <2EE8E436-5CDC-34C5-9959-5BA218D507FB> /usr/lib/libDiagnosticMessagesClient.dylib
0x7fff96a11000 - 0x7fff96a13fff libsystem_configuration.dylib (699.1.5) <5E14864E-089A-3D84-85A4-980B776427A8> /usr/lib/system/libsystem_configuration.dylib
0x7fff96b47000 - 0x7fff96b6ffff libxpc.dylib (559.10.3) <876216DC-D5D3-381E-8AF9-49AE464E5107> /usr/lib/system/libxpc.dylib
0x7fff96be0000 - 0x7fff96c0bfff libc++abi.dylib (125) <88A22A0F-87C6-3002-BFBA-AC0F2808B8B9> /usr/lib/libc++abi.dylib
0x7fff97a26000 - 0x7fff97a7afff libc++.1.dylib (120) <1B9530FD-989B-3174-BB1C-BDC159501710> /usr/lib/libc++.1.dylib
0x7fff9866b000 - 0x7fff98670ff7 libunwind.dylib (35.3) <BE7E51A0-B6EA-3A54-9CCA-9D88F683A6D6> /usr/lib/system/libunwind.dylib
0x7fff99d21000 - 0x7fff99d22ff7 libsystem_blocks.dylib (65) <9615D10A-FCA7-3BE4-AA1A-1B195DACE1A1> /usr/lib/system/libsystem_blocks.dylib
0x7fff9a134000 - 0x7fff9a137ff7 libdyld.dylib (353.2.1) <4E33E416-F1D8-3598-B8CC-6863E2ECD0E6> /usr/lib/system/libdyld.dylib

External Modification Summary:
Calls made by other processes targeting this process:
task_for_pid: 5177
thread_create: 0
thread_set_state: 0
Calls made by this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by all processes on this machine:
task_for_pid: 6885315
thread_create: 0
thread_set_state: 0

VM Region Summary:
ReadOnly portion of Libraries: Total=80.6M resident=40.3M(50%) swapped_out_or_unallocated=40.3M(50%)
Writable regions: Total=224.9M written=2036K(1%) resident=2428K(1%) swapped_out=0K(0%) unallocated=222.5M(99%)

REGION TYPE VIRTUAL
=========== =======
Kernel Alloc Once 4K
MALLOC 216.4M
MALLOC (admin) 16K
STACK GUARD 56.0M
Stack 8216K
VM_ALLOCATE 8K
__DATA 940K
__LINKEDIT 70.4M
__TEXT 10.2M
mapped file 8K
shared memory 4K
=========== =======
TOTAL 362.0M


Thanks for your help !
ID: 1245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1246 - Posted: 7 Mar 2015, 5:08:27 UTC - in response to Message 1245.  

Sorry, nothing stands out from those logs. I did check your stderrs for any hints but those all looked normal (except for the "process got signal 4").

Maybe Greg will see something in those logs (he's a little better than me with those things).
ID: 1246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 3 Jun 12
Posts: 8
Credit: 3,053,184
RAC: 1,143
Message 1247 - Posted: 7 Mar 2015, 9:48:10 UTC

Thanks for your quick answer, hopefully Gred has some other idea.

Just one more thing : I had another WU failing this morning, alone (it was the only numberfields WU running by that time). In the meantime I had some leiden, lattice, rosetta, citizen science grid, seti and WUProp WUs that terminate successfully.

So the issue may be related to *my* machine and some other people machines (since this thread was not started by me) BUT it has to do with the way numberfields application is coded, some instruction it uses that interacts badly with my system and causes its failure.

If you google "process got signal 4" you'll find some entries with seti (2011), malariacontrol (2006), milkyway (2008), lattice (2014, me ! they never answered anything)...

I found a very interesting post on eOn forum (2013) regarding a "process got signal 8" error and someone (admin ?) answered "The signal 8 error means " SIGFPE 8 Core Floating point exception" and is project-related. I have edited my input parameters handle this error and have not had such an error has not occurred since." So maybe it's not the same issue but I read "edited my input parameters handle this error", what does it mean ? maybe somehow the same kind of "parameter setup" can be done do "ignore" it ? or was this the project who meant that he dealt with this error in the eOn application code ?
ID: 1247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1248 - Posted: 8 Mar 2015, 0:14:33 UTC - in response to Message 1247.  

Thanks for your quick answer, hopefully Gred has some other idea.

Just one more thing : I had another WU failing this morning, alone (it was the only numberfields WU running by that time). In the meantime I had some leiden, lattice, rosetta, citizen science grid, seti and WUProp WUs that terminate successfully.

So the issue may be related to *my* machine and some other people machines (since this thread was not started by me) BUT it has to do with the way numberfields application is coded, some instruction it uses that interacts badly with my system and causes its failure.

If you google "process got signal 4" you'll find some entries with seti (2011), malariacontrol (2006), milkyway (2008), lattice (2014, me ! they never answered anything)...

I found a very interesting post on eOn forum (2013) regarding a "process got signal 8" error and someone (admin ?) answered "The signal 8 error means " SIGFPE 8 Core Floating point exception" and is project-related. I have edited my input parameters handle this error and have not had such an error has not occurred since." So maybe it's not the same issue but I read "edited my input parameters handle this error", what does it mean ? maybe somehow the same kind of "parameter setup" can be done do "ignore" it ? or was this the project who meant that he dealt with this error in the eOn application code ?


I'm not sure if this will help, but I just upgraded the applications. I actually built these versions about 6 weeks ago and since then have extensively tested the linux versions.

These versions used the latest boinc libraries and the latest stable pari libraries. As you pointed out earlier, the regular decic app was over 2 years old. Maybe these newer executables will fix whatever was conflicting with your system.
ID: 1248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 3 Jun 12
Posts: 8
Credit: 3,053,184
RAC: 1,143
Message 1249 - Posted: 8 Mar 2015, 12:07:24 UTC

Hopefully !

I'll see when I get the next WUs - eventually I won't see it if they work OK, I run many projects at the same time (the most I can ! :) ) so if all goes well as expected I may not notice it.

Thanks for the update.
ID: 1249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BobCat13

Send message
Joined: 27 Feb 15
Posts: 2
Credit: 200,215
RAC: 0
Message 1250 - Posted: 8 Mar 2015, 17:58:18 UTC - in response to Message 1248.  

I'm not sure if this will help, but I just upgraded the applications. I actually built these versions about 6 weeks ago and since then have extensively tested the linux versions.

These versions used the latest boinc libraries and the latest stable pari libraries. As you pointed out earlier, the regular decic app was over 2 years old. Maybe these newer executables will fix whatever was conflicting with your system.

My linux system (ver. 7.2.42 on Linux Mint 15) doesn't get any work since the applications upgrade. Here are the messages:

With both applications checked:
NumberFields@home	3-8-2015 1:44:44 PM	Sending scheduler request: To fetch work.	
NumberFields@home	3-8-2015 1:44:44 PM	Requesting new tasks for CPU	
NumberFields@home	3-8-2015 1:44:45 PM	Scheduler request completed: got 0 new tasks	
NumberFields@home	3-8-2015 1:44:45 PM	No tasks sent

With just Get Decics checked:
NumberFields@home	3-8-2015 1:47:04 PM	Sending scheduler request: Requested by user.	
NumberFields@home	3-8-2015 1:47:04 PM	Requesting new tasks for CPU	
NumberFields@home	3-8-2015 1:47:07 PM	Scheduler request completed: got 0 new tasks	
NumberFields@home	3-8-2015 1:47:07 PM	No tasks sent	
NumberFields@home	3-8-2015 1:47:07 PM	No tasks are available for Get Decic Fields

With just Get Bounded Decics checked:
NumberFields@home	3-8-2015 1:48:43 PM	Sending scheduler request: To fetch work.	
NumberFields@home	3-8-2015 1:48:43 PM	Requesting new tasks for CPU	
NumberFields@home	3-8-2015 1:48:45 PM	Scheduler request completed: got 0 new tasks	
NumberFields@home	3-8-2015 1:48:45 PM	No tasks sent	
NumberFields@home	3-8-2015 1:48:45 PM	No tasks are available for Get Decics with Bounded Discriminant
ID: 1250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 3 Jun 12
Posts: 8
Credit: 3,053,184
RAC: 1,143
Message 1251 - Posted: 8 Mar 2015, 20:51:52 UTC
Last modified: 8 Mar 2015, 20:54:25 UTC

When you look at any project server status you actually never know if the WU available (there are plenty for the moment on NF for both applications) are available for any platform, or just some...
ID: 1251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,414,582
RAC: 128,405
Message 1252 - Posted: 8 Mar 2015, 21:27:18 UTC

I got a number of WUs allocated and running with the new applications, up 8 Mar 2015, 5:21:40 UTC. That was the last - nothing new allocated since then.
ID: 1252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1253 - Posted: 8 Mar 2015, 21:50:46 UTC - in response to Message 1250.  

My hosts are having this same problem. I've been looking into this for the last hour. I have upgraded to the latest client and reattached to the project, but I am still getting "No tasks are available for Get Decic Fields".

This seems to be a problem with the scheduler. Looking at the logs, I can see that it is giving work out to some users, and not others. The logs are not descriptive enough to determine why some users are not given work.

I will continue to look into this...
ID: 1253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,802,678
RAC: 289,026
Message 1254 - Posted: 8 Mar 2015, 23:45:03 UTC - in response to Message 1253.  

My hosts are having this same problem. I've been looking into this for the last hour. I have upgraded to the latest client and reattached to the project, but I am still getting "No tasks are available for Get Decic Fields".

This seems to be a problem with the scheduler. Looking at the logs, I can see that it is giving work out to some users, and not others. The logs are not descriptive enough to determine why some users are not given work.

I will continue to look into this...


Found the problem. I disabled the "need reliable" mechanism and now my hosts are getting work. I had been using my hosts to do some testing and needed to abort a bunch of NumberFields WUs. This flagged my hosts as "unreliable". There were plenty of normal WUs available, so I'm not sure why BOINC was only giving out need-reliable WUs.
ID: 1254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Signal 4


Main page · Your account · Message boards


Copyright © 2024 Arizona State University