Posts by Paul

1) Message boards : Number crunching : Linux + Intel ARC = GPU HANG GetDecics_4.02 (Message 3539)
Posted 8 Jul 2023 by Paul
Post:
The business of determining installed *and* most current driver versions in Linux is somewhat difficult. I cannot even find a driver version listed on the official driver page, but of course, that makes sense bcause consumer drivers are filed upstream and distro maintainers decided what to deploy. My distro, Fedora, is very current. Given that drivers were last updated months ago, I'm sure I have the latest stable drivers, but not anything beta or rc. I'm completely unfamiliar with the Intel Arc, but have considerable experience with AMDGPU, and the situation seems to be identical, FWICT.

In any case, I cannot get newer drivers, easily.

What do you make of the fact that PrimeGrid runs fine on the same hardware & drivers? Unrelated?

I couldn't find any tasks completed on Linux + Arc, but that was a manual search; I didn't know how else to do it on the web interface.

I could try Win10/11 on VM. Been thinking about trying that; never done pass-through of PCI device with libvirt. Maybe I'll give that a shot.

No, it appears there isn't much support for monitoring or fan control for the Arc. This is also a problem with new AMD cards, too. But, I don't find anything about it with a quick search. Some i2c bus sensors seem to work with `sensors`; I see temps, at least, but no power or fan info.

I'm interested in what you say about bad AMD driver. I believe this the situation I have now with AMD & RDNA3 card. I get a similar kernel error "GPU hang" or equivalent, but, as the primary display adapter, the whole desktop/video crashes. However, that is accompanied by many other kernel messages, including "failed reset" and the like. With RDNA2 AMD card, on the contrary, same driver, I occasionally get other BOINC app errors with similar initial kernel error messages, too, but they do not cascade, and the damage is isolated to that opencl process. In these cases, the GPU load *does not* remain pegged at 100%, it drops, power consumption drops, too, but progress on that WU stalls. I can abort these WU manually or, eventually, they get reaped, somehow, and I don't notice until I look to see WUs reported as "error" in my project account's task list. I'm sure stop/start GPU use with boincmgr/boinccmd would do the same as you suggest.

So, yes, it could be a driver issue. But, I think it's unlikely to be hardware or system config. In that case, I'm not sure trying Win would tell me much more than what we already suspect. And, any difference in the two apps for the two platforms, Win & Linux, could also explain such a differential result, too.

I could try to report to the OSS driver maintainers. Unless someone here can help me provided more details to them, I doubt my report will be very helpful to them.
2) Message boards : Number crunching : Linux + Intel ARC = GPU HANG GetDecics_4.02 (Message 3533)
Posted 8 Jul 2023 by Paul
Post:
In case the error task goes away before someone gets to this, I'll copy it here:

<core_client_version>7.22.0</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
GPU Summary String = [CAL|AMDRadeonRX6800XT|1|16368MB||200][INTEL|Intel(R)Arc(TM)A750Graphics|1|7721MB||300].
Loading GPU lookup table from file.
GPU was not found in the lookup table.  Using default values:
  numBlocks = 1024.
  threadsPerBlock = 32.
  polyBufferSize = 32768.
Successfully Built Program.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantInit.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantMpInit.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB7DegA9.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB7DegA8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA9.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA8.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB6DegA7.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB5.
Successfully Created Stage 1 Kernel: pdtKernelSubResultantDegB4.

Successfully Created Stage 2 Kernel: pdtKernelDiv2.
Successfully Created Stage 2 Kernel: pdtKernelDiv5.
Successfully Created Stage 2 Kernel: pdtKernelDivP.

Successfully Created Stage 3 Kernel.

Successfully Created Polynomial Memory Buffer.
Successfully Created Output Flag Memory Buffer.
Successfully Created Discriminant Data Buffer.
Successfully Created PolyA Data Buffer.
Successfully Created PolyB Data Buffer.
Successfully Created DegA Data Buffer.
Successfully Created DegB Data Buffer.
Successfully Created G Data Buffer.
Successfully Created H Data Buffer.
Successfully Created mpA Data Buffer.
Successfully Created mpB Data Buffer.

OpenCL initialization was successful.
CHECKPOINT_FILE = wu_sf6_DS-11x11_Grp583191of1500000_checkpoint.
Checkpoint Flag = 0.
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf6_DS-11x11_Grp583191of1500000.dat
    K = x^2 - 10
    S = [2, 5]
    Disc Bound = 10000000000000
    Skip = (P^4)*(Q^5)
    Num Congruences = 64
    SCALE = 1.000000
    |dK| = 40
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf6_DS-11x11_Grp583191of1500000_0_r2076229659_0
Now starting the targeted Martinet search:
  Num Cvecs = 64.
    Doing Cvec 1.
    Doing Cvec 2.

</stderr_txt>
]]>


(Sorry for the typo in the original post, too.)
3) Message boards : Number crunching : Linux + Intel ARC = GPU HANG GetDecics_4.02 (Message 3532)
Posted 7 Jul 2023 by Paul
Post:
Just thought I should report this. Not sure what I can do. I think app this is in beta/testing. Please let me know if I can do something to help. Only found one other project that even supports ARC on Linux, but their app seems to work fine on the same system.

Jul 06 18:05:21 <hostname> kernel: i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85defffa, in GetDecics_4.02_ [149986]
Jul 06 18:05:21 <hostname> kernel: i915 0000:07:00.0: [drm] GetDecics_4.02_[149986] context reset due to GPU hang
Jul 06 18:11:36 <hostname> kernel: i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85defffa, in GetDecics_4.02_ [155909]
Jul 06 18:11:36 <hostname> kernel: i915 0000:07:00.0: [drm] GetDecics_4.02_[155909] context reset due to GPU hang





Main page · Your account · Message boards


Copyright © 2024 Arizona State University