New GPU OpenCL versions available

Author	Message
Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3045 - Posted: 6 Feb 2021, 7:18:39 UTC - in response to Message 3044. When you get a chance, could you look at the stderr.txt for a WU that's stuck and paste the output here (before you unstick it). That should give me an idea where in the processing it is before it hangs. Don't worry about getting the stderr output. I gathered the information from the WUs that failed. It is building the opencl code without error and doesn't hang until it gets to the GPU (which I suspected was the case). It appears to hang because it is waiting on the GPU to finish. I don't know if the GPU is actually stuck, or something is blocking the kernels from running. ID: 3045 · Rating: 0 · rate: / Reply Quote

CosminZ Send message Joined: 20 Jun 12 Posts: 3 Credit: 51,772,719 RAC: 0	Message 3046 - Posted: 6 Feb 2021, 8:21:15 UTC On linux with amd vega64 it got stock with 100% utilization and 80w power usage (not doing anything), until i disabled the screen saver. After that worked for 2 days without any problem. For vega64 the app is very inefficient, 7 minutes for a workunit at 160-170w and maximum stable is numBlocks 1024 with threadsPerBlock 64. Running 2 workunits it hangs with the same 100% utilization and 80w power usage. Amd driver apencl=rocr. A linux laptop with amd rx580 finishes in 8 to 8:10 minutes using 40w with 1024-64 settings, and 9:30 minutes using 38-40w with 1024-32. Many workunits error up with "SIGSEGV: segmentation violation" using amd driver opencl=legacy,rocr. Both graphics cards with rocm 4.0.1 driver have a memory problem, memory goes from 0 to 3200-3500mb and back every 2-3 seconds with lag in desktop and running times in hours. ID: 3046 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3047 - Posted: 6 Feb 2021, 17:34:31 UTC - in response to Message 3046. On linux with amd vega64 it got stock with 100% utilization and 80w power usage (not doing anything), until i disabled the screen saver. After that worked for 2 days without any problem. For vega64 the app is very inefficient, 7 minutes for a workunit at 160-170w and maximum stable is numBlocks 1024 with threadsPerBlock 64. Running 2 workunits it hangs with the same 100% utilization and 80w power usage. Amd driver apencl=rocr. A linux laptop with amd rx580 finishes in 8 to 8:10 minutes using 40w with 1024-64 settings, and 9:30 minutes using 38-40w with 1024-32. Many workunits error up with "SIGSEGV: segmentation violation" using amd driver opencl=legacy,rocr. Both graphics cards with rocm 4.0.1 driver have a memory problem, memory goes from 0 to 3200-3500mb and back every 2-3 seconds with lag in desktop and running times in hours. Thanks for the feedback! Without it it's hard to know anything is wrong since everything runs smoothly on my machines. The only time I've had issues is when I stream video while the app is running, and I've learned to suspend GPU processing in the client before doing such things. I thought that was typical behavior with GPU processing, but maybe I am wrong? The inefficiency on the vega64 might be the driver, since all apps use the exact same opencl code. The biggest source of problems is the opencl compiler which comes bundled with the driver. Regarding the SIGSEGV errors, I looked at the stderr and it appears these errors all occur before the GPU is acquired. In fact, they occur just before the polynomial buffers are allocated on the CPU. I don't see any memory allocation errors, but maybe the SIGSEGV is thrown before the return from the malloc call. Whatever the cause, this is not a GPU or opencl problem. The problem could be a resource problem on the CPU side, since the opencl apps require bigger data buffers on the CPU. If you continue to see these errors try reducing the numBlocks to 256 or 512 (keep threadsPerBlock at 64 for best performance). Not sure what to make of the memory usage jumping between 0 and 3.2GB. Memory usage on the GPU should never get that high with the NumberFields app, unless numBlocks was set way too high. ID: 3047 · Rating: 0 · rate: / Reply Quote

CosminZ Send message Joined: 20 Jun 12 Posts: 3 Credit: 51,772,719 RAC: 0	Message 3048 - Posted: 6 Feb 2021, 22:31:11 UTC The last driver version from amd site changed the opencl from "pal" to "rocr" for cards build after vega and this might be the problem. Amd RX580 it is using the same version of opencl=legacy for cards build before vega64. RadeonOpenCompute, rocm4.0.1 it is open source driver and it comes with all the new libraries and compilers and this can create problems, one boinc project results in errors with this driver but works with amd driver. The app settings were standard and it looked like it tried to fill the gpu memory , from 0 to 700mb to 1200mb all the way to 3200-3500mb and back to zero every 2-3 seconds. ID: 3048 · Rating: 0 · rate: / Reply Quote

mg13 [HWU] Send message Joined: 24 May 19 Posts: 38 Credit: 1,653,150 RAC: 1,242	Message 3049 - Posted: 8 Feb 2021, 0:10:27 UTC - in response to Message 3044. Yes, I meant the Microsoft Edge browser, which has in the settings>system>Use hardware acceleration when available "enabled". When I start it, having that option active, probably requiring help from the GPU, it somehow unlocks it, creates the checkpoint, and completes wu processing. If I do not use this trick and leave the PC to process the BOINC client all day, as I normally do, without intervening, in the evening I will still find the WU to finish because the percentage of processing will be blocked, while the time, always processing, flows inexorable. At least the system crashes and then it ends first and the WU will go wrong. I hope I have cleared up your doubts and explained myself well. Hello. This almost sounds like a Windows 10 issue, as if it's blocking computation until hardware acceleration is turned on. I'm not a Windows user anymore, but I found online that hardware acceleration can be disabled in Windows 10. Is there any chance this is disabled and that's what's blocking it until enabled by Edge? But then again, if the old app used to work with your current Windows configuration, then I'm not sure what's going on. When you get a chance, could you look at the stderr.txt for a WU that's stuck and paste the output here (before you unstick it). That should give me an idea where in the processing it is before it hangs. I wanted to do a try to disable hardware acceleration in the Edge browser and for security I also revived the system and as they imagine when you block wu processing and consequently the GPU, my trick does not work, that is, if I start the browser does not unlock my GPU. I also tried to reset the Edge browser and restarted the system for security and the trick works, when you block WU processing and consequently the GPU, because apparently by default it enables GPU hardware acceleration. There was a WU that blocked my system and then it came back, but it didn't go wrong, on the contrary it completed the processing and if I remember correctly it is the one that took the most time to process among those of yesterday. If it can be useful to you, I noticed from the GPU metrics, that when the percentage of WU processing freezes and GPU usage also splashes 99% video memory rises from about 800/900 MB to about 1900/2000 MB. If you then pause WU processing or close the BOINC client, after about a minute the GPU unlocks and the video memory returns to its initial values of about 800/900 MB. I also noticed, opening task manager, that the app GetDecics_4.02_windows_x86_64__opencl_amd does not appear until you block the percentage of WU processing and that is, after one minute from the beginning and only 4/5% usage appears in the CPU column, while in the GPU column the usage is 0%, despite the GPU metrics indicating 99% usage. A curiosity, in the file "gpuLookupTable_v402.txt", for AMD GPUs, in the column "GPU Name" it says RX570, in the column "numBlocks" it says 2048 and in the column "threadsPerBlock" it says 64, why does the app take these parameters also for my GPU which is an AMD Radeon RX 5700 XT 50th Anniversary and not an RX570? ID: 3049 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3050 - Posted: 8 Feb 2021, 3:58:04 UTC - in response to Message 3049. A curiosity, in the file "gpuLookupTable_v402.txt", for AMD GPUs, in the column "GPU Name" it says RX570, in the column "numBlocks" it says 2048 and in the column "threadsPerBlock" it says 64, why does the app take these parameters also for my GPU which is an AMD Radeon RX 5700 XT 50th Anniversary and not an RX570? That would be a flaw in the lookup table code. It checks if the string in the lookup table is a substring of the BOINC provided GPU string. In this case "RX570" is a substring of "RX5700". This method should be made more robust so that there is a one-to-one mapping between entries in the lookup table and actual GPUs. The RX5700 should be able to handle the 2048x64 setting, so not a big deal. ID: 3050 · Rating: 0 · rate: / Reply Quote

mg13 [HWU] Send message Joined: 24 May 19 Posts: 38 Credit: 1,653,150 RAC: 1,242	Message 3051 - Posted: 8 Feb 2021, 10:53:55 UTC - in response to Message 3050. Last modified: 8 Feb 2021, 10:54:48 UTC I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC? However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information: AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570 AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary ID: 3051 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3052 - Posted: 8 Feb 2021, 16:44:59 UTC - in response to Message 3051. I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC? However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information: AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570 AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary This is a "string" from a software perspective, meaning just a series of raw characters. The characters "RX570" are part of "RX5700", which is what we mean when we say "RX570" is a substring of "RX5700". This has nothing to do with the actual graphics cards. The string given by the BOINC api is printed at the top of the stderr. The stderr is given with each completed task. For example here is one of your tasks: https://numberfields.asu.edu/NumberFields/result.php?resultid=107841147 And your GPU string is: GPU Summary String = [CAL\|AMDRadeonRX5700XT50thAnniversary\|1\|8176MB\|\|200] ID: 3052 · Rating: 0 · rate: / Reply Quote

mg13 [HWU] Send message Joined: 24 May 19 Posts: 38 Credit: 1,653,150 RAC: 1,242	Message 3053 - Posted: 8 Feb 2021, 22:41:33 UTC - in response to Message 3052. I'm sorry for my ignorance, but where do I find the GPU string provided by BOINC? However I think the RX570 is not a substring of the RX5700, because they are 2 GPUs of different architecture see the following links for more information: AMD RADEON RX570 https://www.amd.com/en/products/graphics/radeon-rx-570 AMD RADEON RX5700 https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt-50th-anniversary This is a "string" from a software perspective, meaning just a series of raw characters. The characters "RX570" are part of "RX5700", which is what we mean when we say "RX570" is a substring of "RX5700". This has nothing to do with the actual graphics cards. The string given by the BOINC api is printed at the top of the stderr. The stderr is given with each completed task. For example here is one of your tasks: https://numberfields.asu.edu/NumberFields/result.php?resultid=107841147 And your GPU string is: GPU Summary String = [CAL\|AMDRadeonRX5700XT50thAnniversary\|1\|8176MB\|\|200] Thank you for your explanation. I noticed that string in the app's WU stderr, but because you told me about BOINC, I was thinking about the client and that the string was inside some of its files. ID: 3053 · Rating: 0 · rate: / Reply Quote

mg13 [HWU] Send message Joined: 24 May 19 Posts: 38 Credit: 1,653,150 RAC: 1,242	Message 3054 - Posted: 12 Feb 2021, 12:50:21 UTC Update. AMD has released a new optional version of the drivers on 21.2.2 (WIN 10 64 bits. 20H2) and IT DOES NOT WORK. OpenCL: AMD/ATI 0 GPU: AMD Radeon RX 5700 XT 50th Anniversary (driver version 3188.4 (PAL, LC), OpenCL 2.0 AMD-APP device version (3188.4), 8176MB, 8176MB available, peak GFLOPS 9370). ID: 3054 · Rating: 0 · rate: / Reply Quote

mg13 [HWU] Send message Joined: 24 May 19 Posts: 38 Credit: 1,653,150 RAC: 1,242	Message 3055 - Posted: 23 Feb 2021, 12:27:26 UTC - in response to Message 3054. Update. AMD has released a new optional version of the drivers on 21.2.3 (WIN 10 64 bits. 20H2) and IT DOES NOT WORK. OpenCL: AMD/ATI 0 GPU: AMD Radeon RX 5700 XT 50th Anniversary (driver version 3188.4 (PAL, LC), OpenCL 2.0 AMD-APP device version (3188.4), 8176MB, 8176MB available, peak GFLOPS 9370). ID: 3055 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 6 Jun 18 Posts: 5 Credit: 8,284,062 RAC: 1,109	Message 3847 - Posted: 29 May 2025, 23:14:43 UTC So, will there be any updates to the nvidia linux apps? OpenCL perhaps? The CUDA drivers have also come a long was since cuda30 and some other projects are many versions further along. ID: 3847 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3852 - Posted: 30 May 2025, 5:22:33 UTC - in response to Message 3847. So, will there be any updates to the nvidia linux apps? OpenCL perhaps? The CUDA drivers have also come a long was since cuda30 and some other projects are many versions further along. I see no reason to update the openCL versions, since that's just code. But updating the CUDA versions might be helpful since those are built using the latest CUDA improvements which take advantage of the newer GPUs. ID: 3852 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 6 Jun 18 Posts: 5 Credit: 8,284,062 RAC: 1,109	Message 3853 - Posted: 30 May 2025, 5:27:39 UTC - in response to Message 3852. So, will there be any updates to the nvidia linux apps? OpenCL perhaps? The CUDA drivers have also come a long was since cuda30 and some other projects are many versions further along. I see no reason to update the openCL versions, since that's just code. But updating the CUDA versions might be helpful since those are built using the latest CUDA improvements which take advantage of the newer GPUs. Just fyi, you don't have an OpenCL version for nvidia on Linux. ID: 3853 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3854 - Posted: 30 May 2025, 5:44:12 UTC - in response to Message 3853. So, will there be any updates to the nvidia linux apps? OpenCL perhaps? The CUDA drivers have also come a long was since cuda30 and some other projects are many versions further along. I see no reason to update the openCL versions, since that's just code. But updating the CUDA versions might be helpful since those are built using the latest CUDA improvements which take advantage of the newer GPUs. Just fyi, you don't have an OpenCL version for nvidia on Linux. True. So my statement about updating the CUDA app only benefits the Nvidia Linux app. ID: 3854 · Rating: 0 · rate: / Reply Quote

DKlimax Send message Joined: 8 Jun 23 Posts: 16 Credit: 30,271,559 RAC: 99,596	Message 3860 - Posted: 30 May 2025, 20:31:43 UTC Just little warning, I was looking at code and trying to get it compile and have it run correctly and so far I failed (using current GCC and MSYS2/Mingw64). If you get it to compile, could you please use GMP compiled with "--enable-fat", so all assembler optimizations are included it woudl be great. (Mingw64 already ships such version and it can be used as is) Even better if one could bump GMP and PARI to current versions (for PARI it is required for current GCC or compile fails). Note: author of libGMP is bit crazy and vast majority of optimizations are in assembler and as such correct compilation is needed to get most of performance out of it. Note 2: Currently in my compile GetDecics stops on nfinit0 which complains that used polynomial is not irreducible. Cause for this spurious error is so far unknown. (GP as published accepts it without any problems) ID: 3860 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3861 - Posted: 30 May 2025, 20:55:23 UTC - in response to Message 3860. Just little warning, I was looking at code and trying to get it compile and have it run correctly and so far I failed (using current GCC and MSYS2/Mingw64). If you get it to compile, could you please use GMP compiled with "--enable-fat", so all assembler optimizations are included it woudl be great. (Mingw64 already ships such version and it can be used as is) Even better if one could bump GMP and PARI to current versions (for PARI it is required for current GCC or compile fails). Note: author of libGMP is bit crazy and vast majority of optimizations are in assembler and as such correct compilation is needed to get most of performance out of it. Note 2: Currently in my compile GetDecics stops on nfinit0 which complains that used polynomial is not irreducible. Cause for this spurious error is so far unknown. (GP as published accepts it without any problems) You must be referring to the CPU app. The current executables are built with O3 optimization and are about as efficient as possible for the gnu compiler. With that said, I was unaware of the enable-fat option with gmp. It would be interesting to see how much faster the code runs on various cpus with this option. Agreed that any new build should use the latest stable version of Pari (as well as boinc and gmp). ID: 3861 · Rating: 0 · rate: / Reply Quote

DKlimax Send message Joined: 8 Jun 23 Posts: 16 Credit: 30,271,559 RAC: 99,596	Message 3870 - Posted: 4 Jun 2025, 13:00:40 UTC - in response to Message 3861. Just little warning, I was looking at code and trying to get it compile and have it run correctly and so far I failed (using current GCC and MSYS2/Mingw64). If you get it to compile, could you please use GMP compiled with "--enable-fat", so all assembler optimizations are included it woudl be great. (Mingw64 already ships such version and it can be used as is) Even better if one could bump GMP and PARI to current versions (for PARI it is required for current GCC or compile fails). Note: author of libGMP is bit crazy and vast majority of optimizations are in assembler and as such correct compilation is needed to get most of performance out of it. Note 2: Currently in my compile GetDecics stops on nfinit0 which complains that used polynomial is not irreducible. Cause for this spurious error is so far unknown. (GP as published accepts it without any problems) You must be referring to the CPU app. The current executables are built with O3 optimization and are about as efficient as possible for the gnu compiler. With that said, I was unaware of the enable-fat option with gmp. It would be interesting to see how much faster the code runs on various cpus with this option. Agreed that any new build should use the latest stable version of Pari (as well as boinc and gmp). I meant both. For OpenCL/CUDA it is important that serialized parts are as efficient as possible. (At bare minimum it would allow one core to handle more GPUs or at least give some extra cycles to CPU version) As for optimization, that was five years ago. Since then there were lots of major versions for GCC (about five). Good probability we can get some speed up even from compiler upgrade. Two notes. I am working on out-of-band caching for OpenCL version using intercept library and looking at improvements CUDA might have mostly advantage in compilation/caching. And if I could get GetDecics to compile I'd like to make CUDA version for Windows too. ID: 3870 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,241,017 RAC: 822,748	Message 3871 - Posted: 4 Jun 2025, 16:01:21 UTC - in response to Message 3870. I meant both. For OpenCL/CUDA it is important that serialized parts are as efficient as possible. (At bare minimum it would allow one core to handle more GPUs or at least give some extra cycles to CPU version) As for optimization, that was five years ago. Since then there were lots of major versions for GCC (about five). Good probability we can get some speed up even from compiler upgrade. Two notes. I am working on out-of-band caching for OpenCL version using intercept library and looking at improvements CUDA might have mostly advantage in compilation/caching. And if I could get GetDecics to compile I'd like to make CUDA version for Windows too. I'd be curious to see how you get a CUDA Windows app. I cross compile using mingw and at the time it was not possible to cross compile with CUDA, hence I went the route of OpenCL where the compiling is off loaded to the user's GPU at runtime. ID: 3871 · Rating: 0 · rate: / Reply Quote

DKlimax Send message Joined: 8 Jun 23 Posts: 16 Credit: 30,271,559 RAC: 99,596	Message 3878 - Posted: 7 Jun 2025, 17:13:05 UTC - in response to Message 3871. I meant both. For OpenCL/CUDA it is important that serialized parts are as efficient as possible. (At bare minimum it would allow one core to handle more GPUs or at least give some extra cycles to CPU version) As for optimization, that was five years ago. Since then there were lots of major versions for GCC (about five). Good probability we can get some speed up even from compiler upgrade. Two notes. I am working on out-of-band caching for OpenCL version using intercept library and looking at improvements CUDA might have mostly advantage in compilation/caching. And if I could get GetDecics to compile I'd like to make CUDA version for Windows too. I'd be curious to see how you get a CUDA Windows app. I cross compile using mingw and at the time it was not possible to cross compile with CUDA, hence I went the route of OpenCL where the compiling is off loaded to the user's GPU at runtime. Looks like option "--compiler-bindir <path> (-ccbin) might be the ticket. From help: Specify the directory in which the host compiler executable resides. The host compiler executable name can be also specified to ensure that the correct host compiler is selected. In addition, driver prefix options ('--input-drive-prefix', '--dependency-drive-prefix', or '--drive-prefix') may need to be specified, if nvcc is executed in a Cygwin shell or a MinGW shell on Windows. If this wouldn't work for some there are several other options. (Includes: nvcc producing DLL, LLVM compiling CUDA and either getting statically or DLL used) BTW: Which version of source code is canonical(deployed) one? There's "GetDecicsSrc" and then "Mingw64" and they differ. ID: 3878 · Rating: 0 · rate: / Reply Quote