Message boards :
Number crunching :
SIGILL on Minimum Discriminant Septics WUs
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 May 15 Posts: 4 Credit: 2,018,919 RAC: 0 |
Yesterday I noticed that may (>80) septics WUs ended in error. Looking at the stderr output of some of them I saw that all of them failed with the same error, that is SIGILL: illegal instruction Stack trace (4 frames): [0x40a132] [0xa7f370] [0x7d84c2] [0x7ffef7767c50] (taken from https://numberfields.asu.edu/NumberFields/result.php?resultid=26488960). I noticed that the last address in the stack trace changes, while the first three seem to be always the same. Sometimes (e.g. https://numberfields.asu.edu/NumberFields/result.php?resultid=26492750) the message points to a bug: SIGILL: illegal instruction *** bug in PARI/GP (Segmentation Fault), please report. *** Error in the PARI system. End of program. The time when the WU ends ranges from 200 to 20000 seconds. |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
I will look into this when I get home later. We successfully processed every one of the first 200000 WUs and I have collated those results successfully. This tells me if there is a bug it must be confined to a specific platform (unless we just got lucky on the first 200000, which would mean it is a very rare bug). |
Send message Joined: 19 May 15 Posts: 4 Credit: 2,018,919 RAC: 0 |
I will look into this when I get home later. This same host processed successfully some of those and many others in the last 2 days; its id is 87320. Many of the WUs were resent and completed successfully, others are still in progress and some others went on error for different causes, so I think that you may be right. If I'm not wrong this is the stock 7.6 client for Debian 9, running on an AMD Ryzen. Feel free to ask if you need more infos. |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
So I looked at two of these WUs: https://numberfields.asu.edu/NumberFields/workunit.php?wuid=24328194 https://numberfields.asu.edu/NumberFields/workunit.php?wuid=24332416 Both were returned successfully by other hosts that were running the same app (linux). I also ran them offline and they ran just fine. The error message you are getting is usually associated with memory violations. It's possibly a memory related bug that I don't see since my system has plenty of free RAM. But I wonder if it could also be a glitch in your host? Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs? |
Send message Joined: 19 May 15 Posts: 4 Credit: 2,018,919 RAC: 0 |
Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs? At this time I'm processing Decic Fields wus and they seem to run fine. I'll reboot and switch back to Septics in the next days when I can check them closer. |
Send message Joined: 19 May 15 Posts: 4 Credit: 2,018,919 RAC: 0 |
Could you reboot your system to see if the errors go away, before I go down a rabbit hole looking for obscure memory bugs? I rebooted a couple of days ago and ran a hundred Septics jobs without further errors... |
Send message Joined: 8 Jul 11 Posts: 1344 Credit: 532,708,184 RAC: 547,640 |
Good to know. Thanks! |