Linux + Intel ARC = GPU HANG GetDecics_4.02

Message boards : Number crunching : Linux + Intel ARC = GPU HANG GetDecics_4.02
Message board moderation

To post messages, you must log in.

AuthorMessage
Paul

Send message
Joined: 7 Jul 23
Posts: 3
Credit: 24,480
RAC: 0
Message 3532 - Posted: 7 Jul 2023, 19:22:42 UTC

Just thought I should report this. Not sure what I can do. I think app this is in beta/testing. Please let me know if I can do something to help. Only found one other project that even supports ARC on Linux, but their app seems to work fine on the same system.

Jul 06 18:05:21 <hostname> kernel: i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85defffa, in GetDecics_4.02_ [149986]
Jul 06 18:05:21 <hostname> kernel: i915 0000:07:00.0: [drm] GetDecics_4.02_[149986] context reset due to GPU hang
Jul 06 18:11:36 <hostname> kernel: i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85defffa, in GetDecics_4.02_ [155909]
Jul 06 18:11:36 <hostname> kernel: i915 0000:07:00.0: [drm] GetDecics_4.02_[155909] context reset due to GPU hang
ID: 3532 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 7 Jul 23
Posts: 3
Credit: 24,480
RAC: 0
Message 3533 - Posted: 8 Jul 2023, 1:16:41 UTC

In case the error task goes away before someone gets to this, I'll copy it here:

<core_client_version>7.22.0</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
GPU Summary String = [CAL|AMDRadeonRX6800XT|1|16368MB||200][INTEL|Intel(R)Arc(TM)A750Graphics|1|7721MB||300].
Loading GPU lookup table from file.
GPU was not found in the lookup table.  Using default values:
  numBlocks = 1024.
  threadsPerBlock = 32.
  polyBufferSize = 32768.
Successfully Built Program.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantInit.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantMpInit.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB7DegA9.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB7DegA8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA9.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA7.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB5.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB4.

Successfully Created Stage 2 Kernel: pdtKernelDiv2.
Successfully Created Stage 2 Kernel: pdtKernelDiv5.
Successfully Created Stage 2 Kernel: pdtKernelDivP.

Successfully Created Stage 3 Kernel.

Successfully Created Polynomial Memory Buffer.
Successfully Created Output Flag Memory Buffer.
Successfully Created Discriminant Data Buffer.
Successfully Created PolyA Data Buffer.
Successfully Created PolyB Data Buffer.
Successfully Created DegA Data Buffer.
Successfully Created DegB Data Buffer.
Successfully Created G Data Buffer.
Successfully Created H Data Buffer.
Successfully Created mpA Data Buffer.
Successfully Created mpB Data Buffer.

OpenCL initialization was successful.
CHECKPOINT_FILE = wu_sf6_DS-11x11_Grp583191of1500000_checkpoint.
Checkpoint Flag = 0.
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf6_DS-11x11_Grp583191of1500000.dat
    K = x^2 - 10
    S = [2, 5]
    Disc Bound = 10000000000000
    Skip = (P^4)*(Q^5)
    Num Congruences = 64
    SCALE = 1.000000
    |dK| = 40
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf6_DS-11x11_Grp583191of1500000_0_r2076229659_0
Now starting the targeted Martinet search:
  Num Cvecs = 64.
    Doing Cvec 1.
    Doing Cvec 2.

</stderr_txt>
]]>


(Sorry for the typo in the original post, too.)
ID: 3533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 411,419,698
RAC: 248,318
Message 3534 - Posted: 8 Jul 2023, 3:38:19 UTC - in response to Message 3533.  

It looks like it passed the build phase of the openCL but then the GPU hung while processing it. In the past, that was usually caused by not enough GPU RAM, an older card (not enough FLOPS), or a problem with the driver for the card. You have plenty of RAM and the A750 should be comparable to the Nvidia 3060, so in theory this card should work great. That would point to a bad driver. So the obvious question is if you have the latest driver installed? If you dual boot into windows, you could also try it there.

Does intel have an equivalent to the nvidia-smi command (or the AMD radeontop command) for viewing utilization on the GPU? If so, use that to make sure it's not running out of RAM, and when it hangs check if utilization goes to zero.

At one point in the past I had a bad driver for my AMD card and the GPU app would periodically hang. Radeontop showed me that utilization went to 0%, so it was acting like the GPU was turned off. From the client I would click "Suspend GPU" under Activity, and then click "Use GPU always" to turn it back on - this would cause the GPU to wake up and continue processing. It's a long shot, but maybe something similar is happening here.
ID: 3534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 7 Jul 23
Posts: 3
Credit: 24,480
RAC: 0
Message 3539 - Posted: 8 Jul 2023, 9:19:38 UTC - in response to Message 3534.  

The business of determining installed *and* most current driver versions in Linux is somewhat difficult. I cannot even find a driver version listed on the official driver page, but of course, that makes sense bcause consumer drivers are filed upstream and distro maintainers decided what to deploy. My distro, Fedora, is very current. Given that drivers were last updated months ago, I'm sure I have the latest stable drivers, but not anything beta or rc. I'm completely unfamiliar with the Intel Arc, but have considerable experience with AMDGPU, and the situation seems to be identical, FWICT.

In any case, I cannot get newer drivers, easily.

What do you make of the fact that PrimeGrid runs fine on the same hardware & drivers? Unrelated?

I couldn't find any tasks completed on Linux + Arc, but that was a manual search; I didn't know how else to do it on the web interface.

I could try Win10/11 on VM. Been thinking about trying that; never done pass-through of PCI device with libvirt. Maybe I'll give that a shot.

No, it appears there isn't much support for monitoring or fan control for the Arc. This is also a problem with new AMD cards, too. But, I don't find anything about it with a quick search. Some i2c bus sensors seem to work with `sensors`; I see temps, at least, but no power or fan info.

I'm interested in what you say about bad AMD driver. I believe this the situation I have now with AMD & RDNA3 card. I get a similar kernel error "GPU hang" or equivalent, but, as the primary display adapter, the whole desktop/video crashes. However, that is accompanied by many other kernel messages, including "failed reset" and the like. With RDNA2 AMD card, on the contrary, same driver, I occasionally get other BOINC app errors with similar initial kernel error messages, too, but they do not cascade, and the damage is isolated to that opencl process. In these cases, the GPU load *does not* remain pegged at 100%, it drops, power consumption drops, too, but progress on that WU stalls. I can abort these WU manually or, eventually, they get reaped, somehow, and I don't notice until I look to see WUs reported as "error" in my project account's task list. I'm sure stop/start GPU use with boincmgr/boinccmd would do the same as you suggest.

So, yes, it could be a driver issue. But, I think it's unlikely to be hardware or system config. In that case, I'm not sure trying Win would tell me much more than what we already suspect. And, any difference in the two apps for the two platforms, Win & Linux, could also explain such a differential result, too.

I could try to report to the OSS driver maintainers. Unless someone here can help me provided more details to them, I doubt my report will be very helpful to them.
ID: 3539 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 411,419,698
RAC: 248,318
Message 3540 - Posted: 8 Jul 2023, 16:19:14 UTC - in response to Message 3539.  

What do you make of the fact that PrimeGrid runs fine on the same hardware & drivers? Unrelated?


I think that's because the PrimeGrid code is much simpler than the NumberFields code. The NumberFields algorithm is very complex, including multiple nested loops with break outs when certain conditions are met. The code adheres to all openCL specs but sometimes the vendor's openCL implementation is flawed. I saw this a couple years ago with the AMD openCL implementation - every new driver would crash, and it was always in a different way, sometimes during the build phase and sometimes during execution. AMD eventually got their act together, and their drivers work now most of the time, with only the occasional hiccup. I think this is what's happening with the Intel drivers; however, there were some older Intel cards that worked last year when I first put the app out (if memory serves).

I mentioned trying Windows because the drivers are usually different and the vendors usually put more resources into testing/fixing the Windows drivers. If the app works on Windows then that would be a starting point for the Intel developers to diagnose the problem with the Linux version of the drivers.
ID: 3540 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1323
Credit: 411,419,698
RAC: 248,318
Message 3542 - Posted: 8 Jul 2023, 16:50:55 UTC - in response to Message 3540.  

I did a quick search of the message boards. Found the following which is good to know:
The following cards have returned successful results: HD500, HD515, Gen9, UHD620, UHD630.

The following links are also interesting. The 2nd shows that there have been successful results from the Arc A750 GPUs (I believe on Windows).
https://numberfields.asu.edu/NumberFields/forum_thread.php?id=579
https://numberfields.asu.edu/NumberFields/forum_thread.php?id=501&postid=3421#3421
ID: 3542 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Linux + Intel ARC = GPU HANG GetDecics_4.02


Main page · Your account · Message boards


Copyright © 2024 Arizona State University