SIGILL on Minimum Discriminant Septics WUs

Message boards : Number crunching : SIGILL on Minimum Discriminant Septics WUs
Message board moderation

To post messages, you must log in.

AuthorMessage
mugnaio [TNAA]

Send message
Joined: 19 May 15
Posts: 4
Credit: 2,018,919
RAC: 34
Message 2035 - Posted: 7 May 2018, 18:12:14 UTC

Yesterday I noticed that may (>80) septics WUs ended in error.
Looking at the stderr output of some of them I saw that all of them failed with the same error, that is

SIGILL: illegal instruction
Stack trace (4 frames):
[0x40a132]
[0xa7f370]
[0x7d84c2]
[0x7ffef7767c50]

(taken from https://numberfields.asu.edu/NumberFields/result.php?resultid=26488960).
I noticed that the last address in the stack trace changes, while the first three seem to be always the same.

Sometimes (e.g. https://numberfields.asu.edu/NumberFields/result.php?resultid=26492750) the message points to a bug:

SIGILL: illegal instruction
*** bug in PARI/GP (Segmentation Fault), please report. *** Error in the PARI system. End of program.

The time when the WU ends ranges from 200 to 20000 seconds.
ID: 2035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 945
Credit: 104,216,638
RAC: 68,245
Message 2037 - Posted: 7 May 2018, 20:13:51 UTC - in response to Message 2035.  

I will look into this when I get home later.

We successfully processed every one of the first 200000 WUs and I have collated those results successfully. This tells me if there is a bug it must be confined to a specific platform (unless we just got lucky on the first 200000, which would mean it is a very rare bug).
ID: 2037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mugnaio [TNAA]

Send message
Joined: 19 May 15
Posts: 4
Credit: 2,018,919
RAC: 34
Message 2039 - Posted: 7 May 2018, 21:02:11 UTC - in response to Message 2037.  

I will look into this when I get home later.

We successfully processed every one of the first 200000 WUs and I have collated those results successfully. This tells me if there is a bug it must be confined to a specific platform (unless we just got lucky on the first 200000, which would mean it is a very rare bug).



This same host processed successfully some of those and many others in the last 2 days; its id is 87320. Many of the WUs were resent and completed successfully, others are still in progress and some others went on error for different causes, so I think that you may be right. If I'm not wrong this is the stock 7.6 client for Debian 9, running on an AMD Ryzen. Feel free to ask if you need more infos.
ID: 2039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 945
Credit: 104,216,638
RAC: 68,245
Message 2042 - Posted: 8 May 2018, 5:05:07 UTC - in response to Message 2039.  

So I looked at two of these WUs:
https://numberfields.asu.edu/NumberFields/workunit.php?wuid=24328194
https://numberfields.asu.edu/NumberFields/workunit.php?wuid=24332416
Both were returned successfully by other hosts that were running the same app (linux). I also ran them offline and they ran just fine.

The error message you are getting is usually associated with memory violations. It's possibly a memory related bug that I don't see since my system has plenty of free RAM. But I wonder if it could also be a glitch in your host? Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs?
ID: 2042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mugnaio [TNAA]

Send message
Joined: 19 May 15
Posts: 4
Credit: 2,018,919
RAC: 34
Message 2043 - Posted: 8 May 2018, 17:26:47 UTC - in response to Message 2042.  

Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs?


At this time I'm processing Decic Fields wus and they seem to run fine. I'll reboot and switch back to Septics in the next days when I can check them closer.
ID: 2043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mugnaio [TNAA]

Send message
Joined: 19 May 15
Posts: 4
Credit: 2,018,919
RAC: 34
Message 2055 - Posted: 18 May 2018, 15:57:09 UTC - in response to Message 2043.  

Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs?


At this time I'm processing Decic Fields wus and they seem to run fine. I'll reboot and switch back to Septics in the next days when I can check them closer.


I rebooted a couple of days ago and ran a hundred Septics jobs without further errors...
ID: 2055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 945
Credit: 104,216,638
RAC: 68,245
Message 2056 - Posted: 18 May 2018, 18:06:23 UTC - in response to Message 2055.  


I rebooted a couple of days ago and ran a hundred Septics jobs without further errors...


Good to know. Thanks!
ID: 2056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : SIGILL on Minimum Discriminant Septics WUs


Main page · Your account · Message boards


Copyright © 2019 Arizona State University