New GPU OpenCL versions available

Message boards : News : New GPU OpenCL versions available
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,946,198
RAC: 288,840
Message 3045 - Posted: 6 Feb 2021, 7:18:39 UTC - in response to Message 3044.  

When you get a chance, could you look at the stderr.txt for a WU that's stuck and paste the output here (before you unstick it). That should give me an idea where in the processing it is before it hangs.


Don't worry about getting the stderr output. I gathered the information from the WUs that failed. It is building the opencl code without error and doesn't hang until it gets to the GPU (which I suspected was the case). It appears to hang because it is waiting on the GPU to finish. I don't know if the GPU is actually stuck, or something is blocking the kernels from running.
ID: 3045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CosminZ

Send message
Joined: 20 Jun 12
Posts: 3
Credit: 51,772,719
RAC: 0
Message 3046 - Posted: 6 Feb 2021, 8:21:15 UTC

On linux with amd vega64 it got stock with 100% utilization and 80w power usage (not doing anything), until i disabled the screen saver. After that worked for 2 days without any problem.
For vega64 the app is very inefficient, 7 minutes for a workunit at 160-170w and maximum stable is numBlocks 1024 with threadsPerBlock 64. Running 2 workunits it hangs with the same 100% utilization and 80w power usage. Amd driver apencl=rocr.
A linux laptop with amd rx580 finishes in 8 to 8:10 minutes using 40w with 1024-64 settings, and 9:30 minutes using 38-40w with 1024-32. Many workunits error up with "SIGSEGV: segmentation violation" using amd driver opencl=legacy,rocr.
Both graphics cards with rocm 4.0.1 driver have a memory problem, memory goes from 0 to 3200-3500mb and back every 2-3 seconds with lag in desktop and running times in hours.
ID: 3046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,946,198
RAC: 288,840
Message 3047 - Posted: 6 Feb 2021, 17:34:31 UTC - in response to Message 3046.  

On linux with amd vega64 it got stock with 100% utilization and 80w power usage (not doing anything), until i disabled the screen saver. After that worked for 2 days without any problem.
For vega64 the app is very inefficient, 7 minutes for a workunit at 160-170w and maximum stable is numBlocks 1024 with threadsPerBlock 64. Running 2 workunits it hangs with the same 100% utilization and 80w power usage. Amd driver apencl=rocr.
A linux laptop with amd rx580 finishes in 8 to 8:10 minutes using 40w with 1024-64 settings, and 9:30 minutes using 38-40w with 1024-32. Many workunits error up with "SIGSEGV: segmentation violation" using amd driver opencl=legacy,rocr.
Both graphics cards with rocm 4.0.1 driver have a memory problem, memory goes from 0 to 3200-3500mb and back every 2-3 seconds with lag in desktop and running times in hours.


Thanks for the feedback! Without it it's hard to know anything is wrong since everything runs smoothly on my machines. The only time I've had issues is when I stream video while the app is running, and I've learned to suspend GPU processing in the client before doing such things. I thought that was typical behavior with GPU processing, but maybe I am wrong?

The inefficiency on the vega64 might be the driver, since all apps use the exact same opencl code. The biggest source of problems is the opencl compiler which comes bundled with the driver.

Regarding the SIGSEGV errors, I looked at the stderr and it appears these errors all occur before the GPU is acquired. In fact, they occur just before the polynomial buffers are allocated on the CPU. I don't see any memory allocation errors, but maybe the SIGSEGV is thrown before the return from the malloc call. Whatever the cause, this is not a GPU or opencl problem. The problem could be a resource problem on the CPU side, since the opencl apps require bigger data buffers on the CPU. If you continue to see these errors try reducing the numBlocks to 256 or 512 (keep threadsPerBlock at 64 for best performance).

Not sure what to make of the memory usage jumping between 0 and 3.2GB. Memory usage on the GPU should never get that high with the NumberFields app, unless numBlocks was set way too high.
ID: 3047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CosminZ

Send message
Joined: 20 Jun 12
Posts: 3
Credit: 51,772,719
RAC: 0
Message 3048 - Posted: 6 Feb 2021, 22:31:11 UTC

The last driver version from amd site changed the opencl from "pal" to "rocr" for cards build after vega and this might be the problem. Amd RX580 it is using the same version of opencl=legacy for cards build before vega64.
RadeonOpenCompute, rocm4.0.1 it is open source driver and it comes with all the new libraries and compilers and this can create problems, one boinc project results in errors with this driver but works with amd driver. The app settings were standard and it looked like it tried to fill the gpu memory , from 0 to 700mb to 1200mb all the way to 3200-3500mb and back to zero every 2-3 seconds.
ID: 3048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mg13 [HWU]
Avatar

Send message
Joined: 24 May 19
Posts: 38
Credit: 1,165,927
RAC: 1,094
Message 3049 - Posted: 8 Feb 2021, 0:10:27 UTC - in response to Message 3044.  

Yes, I meant the Microsoft Edge browser, which has in the settings>system>Use hardware acceleration when available "enabled".
When I start it, having that option active, probably requiring help from the GPU, it somehow unlocks it, creates the checkpoint, and completes wu processing.
If I do not use this trick and leave the PC to process the BOINC client all day, as I normally do, without intervening, in the evening I will still find the WU to finish because the percentage of processing will be blocked, while the time, always processing, flows inexorable.
At least the system crashes and then it ends first and the WU will go wrong.
I hope I have cleared up your doubts and explained myself well.
Hello.


This almost sounds like a Windows 10 issue, as if it's blocking computation until hardware acceleration is turned on. I'm not a Windows user anymore, but I found online that hardware acceleration can be disabled in Windows 10. Is there any chance this is disabled and that's what's blocking it until enabled by Edge?

But then again, if the old app used to work with your current Windows configuration, then I'm not sure what's going on.

When you get a chance, could you look at the stderr.txt for a WU that's stuck and paste the output here (before you unstick it). That should give me an idea where in the processing it is before it hangs.


I wanted to do a try to disable hardware acceleration in the Edge browser and for security I also revived the system and as they imagine when you block wu processing and consequently the GPU, my trick does not work, that is, if I start the browser does not unlock my GPU.
I also tried to reset the Edge browser and restarted the system for security and the trick works, when you block WU processing and consequently the GPU, because apparently by default it enables GPU hardware acceleration.
There was a WU that blocked my system and then it came back, but it didn't go wrong, on the contrary it completed the processing and if I remember correctly it is the one that took the most time to process among those of yesterday.
If it can be useful to you, I noticed from the GPU metrics, that when the percentage of WU processing freezes and GPU usage also splashes 99% video memory rises from about 800/900 MB to about 1900/2000 MB.
If you then pause WU processing or close the BOINC client, after about a minute the GPU unlocks and the video memory returns to its initial values of about 800/900 MB.
I also noticed, opening task manager, that the app GetDecics_4.02_windows_x86_64__opencl_amd does not appear until you block the percentage of WU processing and that is, after one minute from the beginning and only 4/5% usage appears in the CPU column, while in the GPU column the usage is 0%, despite the GPU metrics indicating 99% usage.
A curiosity, in the file "gpuLookupTable_v402.txt", for AMD GPUs, in the column "GPU Name" it says RX570, in the column "numBlocks" it says 2048 and in the column "threadsPerBlock" it says 64, why does the app take these parameters also for my GPU which is an AMD Radeon RX 5700 XT 50th Anniversary and not an RX570?
ID: 3049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,946,198
RAC: 288,840
Message 3050 - Posted: 8 Feb 2021, 3:58:04 UTC - in response to Message 3049.  

A curiosity, in the file "gpuLookupTable_v402.txt", for AMD GPUs, in the column "GPU Name" it says RX570, in the column "numBlocks" it says 2048 and in the column "threadsPerBlock" it says 64, why does the app take these parameters also for my GPU which is an AMD Radeon RX 5700 XT 50th Anniversary and not an RX570?


That would be a flaw in the lookup table code. It checks if the string in the lookup table is a substring of the BOINC provided GPU string. In this case "RX570" is a substring of "RX5700". This method should be made more robust so that there is a one-to-one mapping between entries in the lookup table and actual GPUs. The RX5700 should be able to handle the 2048x64 setting, so not a big deal.
ID: 3050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mg13 [HWU]
Avatar

Send message
Joined: 24 May 19
Posts: 38
Credit: 1,165,927
RAC: 1,094
Message 3051 - Posted: 8 Feb 2021, 10:53:55 UTC - in response to Message 3050.  
Last modified: 8 Feb 2021, 10:54:48 UTC

I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC?
However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information:

AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570
AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary
ID: 3051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,946,198
RAC: 288,840
Message 3052 - Posted: 8 Feb 2021, 16:44:59 UTC - in response to Message 3051.  

I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC?
However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information:

AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570
AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary


This is a "string" from a software perspective, meaning just a series of raw characters. The characters "RX570" are part of "RX5700", which is what we mean when we say "RX570" is a substring of "RX5700". This has nothing to do with the actual graphics cards.

The string given by the BOINC api is printed at the top of the stderr. The stderr is given with each completed task. For example here is one of your tasks:
https://numberfields.asu.edu/NumberFields/result.php?resultid=107841147
And your GPU string is:
GPU Summary String = [CAL|AMDRadeonRX5700XT50thAnniversary|1|8176MB||200]
ID: 3052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mg13 [HWU]
Avatar

Send message
Joined: 24 May 19
Posts: 38
Credit: 1,165,927
RAC: 1,094
Message 3053 - Posted: 8 Feb 2021, 22:41:33 UTC - in response to Message 3052.  

I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC?
However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information:

AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570
AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary


This is a "string" from a software perspective, meaning just a series of raw characters. The characters "RX570" are part of "RX5700", which is what we mean when we say "RX570" is a substring of "RX5700". This has nothing to do with the actual graphics cards.

The string given by the BOINC api is printed at the top of the stderr. The stderr is given with each completed task. For example here is one of your tasks:
https://numberfields.asu.edu/NumberFields/result.php?resultid=107841147
And your GPU string is:
GPU Summary String = [CAL|AMDRadeonRX5700XT50thAnniversary|1|8176MB||200]


Thank you for your explanation.
I noticed that string in the app's WU stderr, but because you told me about BOINC, I was thinking about the client and that the string was inside some of its files.
ID: 3053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mg13 [HWU]
Avatar

Send message
Joined: 24 May 19
Posts: 38
Credit: 1,165,927
RAC: 1,094
Message 3054 - Posted: 12 Feb 2021, 12:50:21 UTC

Update.
AMD has released a new optional version of the drivers on 21.2.2 (WIN 10 64 bits. 20H2) and IT DOES NOT WORK.
OpenCL: AMD/ATI 0 GPU: AMD Radeon RX 5700 XT 50th Anniversary (driver version 3188.4 (PAL, LC), OpenCL 2.0 AMD-APP device version (3188.4), 8176MB, 8176MB available, peak GFLOPS 9370).
ID: 3054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mg13 [HWU]
Avatar

Send message
Joined: 24 May 19
Posts: 38
Credit: 1,165,927
RAC: 1,094
Message 3055 - Posted: 23 Feb 2021, 12:27:26 UTC - in response to Message 3054.  

Update.
AMD has released a new optional version of the drivers on 21.2.3 (WIN 10 64 bits. 20H2) and IT DOES NOT WORK.
OpenCL: AMD/ATI 0 GPU: AMD Radeon RX 5700 XT 50th Anniversary (driver version 3188.4 (PAL, LC), OpenCL 2.0 AMD-APP device version (3188.4), 8176MB, 8176MB available, peak GFLOPS 9370).
ID: 3055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : News : New GPU OpenCL versions available


Main page · Your account · Message boards


Copyright © 2024 Arizona State University