CUDA work units?

Author	Message
Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3105 - Posted: 30 Aug 2021, 19:41:05 UTC I am getting only open_cl work units for my GTX 1650 Super under Win10. Are there any CUDA work units being sent out? ID: 3105 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3106 - Posted: 31 Aug 2021, 0:28:44 UTC - in response to Message 3105. I am getting only open_cl work units for my GTX 1650 Super under Win10. Are there any CUDA work units being sent out? There's only a CUDA app for linux. The openCL version is not too much slower than CUDA. ID: 3106 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3107 - Posted: 31 Aug 2021, 12:54:31 UTC - in response to Message 3106. Thanks. By the way, I ran a test for the efficiency (energy per work unit) between an RX 570 under Ubuntu 20.04.3 and the GTX 1650 Super under Win10. Somewhat to my surprise, they were about the same. I can try a GTX 1650 Super on Linux later. That should be the best I would think. ID: 3107 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3108 - Posted: 31 Aug 2021, 15:43:22 UTC - in response to Message 3107. Thanks. By the way, I ran a test for the efficiency (energy per work unit) between an RX 570 under Ubuntu 20.04.3 and the GTX 1650 Super under Win10. Somewhat to my surprise, they were about the same. I can try a GTX 1650 Super on Linux later. That should be the best I would think. Yes, that would be interesting to see the difference. Thanks! ID: 3108 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3109 - Posted: 31 Aug 2021, 20:40:06 UTC - in response to Message 3108. In more detail, for the GTX 1650 Super under Win10, I measured the average board power with GPU-Z, and it was 67.7 watts. And I averaged the time over 26 samples, and got 18.6 minutes, so the energy per work unit is 1262 watt-seconds. That is probably as good an accuracy as I can get. With the RX 570 under Ubuntu 20.04.3, I saw a power of 89 watts, though that is averaged by eye using a Linux utility. And the time was measured for only six samples, and was 13.7 minutes, so the energy per work unit is 1220 watt-seconds. So that number is not as accurate at the other one, but good enough for my purposes. So that is close. ID: 3109 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3110 - Posted: 1 Sep 2021, 4:13:34 UTC - in response to Message 3109. In more detail, for the GTX 1650 Super under Win10, I measured the average board power with GPU-Z, and it was 67.7 watts. And I averaged the time over 26 samples, and got 18.6 minutes, so the energy per work unit is 1262 watt-seconds. That is probably as good an accuracy as I can get. With the RX 570 under Ubuntu 20.04.3, I saw a power of 89 watts, though that is averaged by eye using a Linux utility. And the time was measured for only six samples, and was 13.7 minutes, so the energy per work unit is 1220 watt-seconds. So that number is not as accurate at the other one, but good enough for my purposes. So that is close. I think you mean units of "watt-minutes", correct? ID: 3110 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3111 - Posted: 1 Sep 2021, 17:55:51 UTC - in response to Message 3110. I think you mean units of "watt-minutes", correct? Yes! I used minutes for all my data, but listed seconds. Thanks. ID: 3111 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3121 - Posted: 1 Oct 2021, 1:46:26 UTC - in response to Message 3108. Last modified: 1 Oct 2021, 2:20:31 UTC I can try a GTX 1650 Super on Linux later. That should be the best I would think. Well I was able to do a GTX 1060 first, which will be almost identical to the 1650 Super. It is on a Ryzen 2700 machine operating Ubuntu 20.04.3, and supported by one free virtual core (the others are on Universe). 4.02 Get Decic Fields (cuda 30) GTX 1060 on Ryzen 2700: 9.7 minutes at 67 watts (14 samples) 35% CPU usage => So energy is 652 watt-minutes That is quite nice, about half the energy of the others, and I like the low CPU usage that CUDA provides also. That makes it a good fit on that machine for me. EDIT: BoincTasks initially estimated a run time of about 20 minutes, so I will let the rest of them finish (another 24 in the buffer) and see if the actual run time average changes. The one currently in process is taking around 19 minutes. I will just let it run for a few days and let BoincTasks figure out the average. ID: 3121 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3135 - Posted: 4 Oct 2021, 11:06:06 UTC - in response to Message 3121. Last modified: 4 Oct 2021, 11:58:23 UTC I have now been running these for 3 1/2 days, using "<rec_half_life_days>1.000000</rec_half_life_days>" in cc_config.xml to speed up the convergence on the time estimates. They run on average for 11 minutes, 4 seconds (664 seconds). So the energy is 664 seconds X 67 watts / 60 = 741 watt-minutes per CUDA work unit. That is a pretty accurate value by now, and nice speedup on this card on Linux. ID: 3135 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 5 Oct 19 Posts: 11 Credit: 2,176,974 RAC: 0	Message 3220 - Posted: 8 Feb 2022, 13:27:58 UTC - in response to Message 3135. I have been running an RX 570 for almost a day and have reported about 150 valids on a Ryzen 3600 machine (Ubuntu 20.04.3). https://numberfields.asu.edu/NumberFields/results.php?hostid=2798526&offset=0&show_names=0&state=4&appid= So I have a good average time of 8 minutes 15 seconds. At about 87 watts power, this gives an energy of 718 watt-seconds, or even slightly better than the Nvidia card, not that it matters with numbers this close. But the RX 570 is helped a little by the fact that it is supported by four free cores of the Ryzen 3600, whereas previously it was only one free core (the others are on BOINC, being WCG/ARP at the moment). As I recall, the Nvidia GTX 1060 is about one generation later in technology than the RX 570, so this is a good performance. I think this is a keeper. ID: 3220 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3221 - Posted: 8 Feb 2022, 16:24:06 UTC - in response to Message 3220. Thanks for the update Jim. ID: 3221 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 4 Jan 25 Posts: 38 Credit: 153,986,863 RAC: 495,635	Message 3792 - Posted: 24 Jan 2025, 19:29:49 UTC - in response to Message 3106. Last modified: 24 Jan 2025, 19:41:06 UTC I am getting only open_cl work units for my GTX 1650 Super under Win10. Are there any CUDA work units being sent out? There's only a CUDA app for linux. The openCL version is not too much slower than CUDA. I noticed that with my RTX 4060Ti Super running under Windows, the run times were slightly longer & the APR slightly lower than yours running under LINUX, then i noticed the OpenCL v CUDA initialisation line in the Stderr output files. Looking in my Stderr output for a Valid Task, i noticed that it appears to be compiling the OpenCL kernel on every run. One thing that would help performance on all systems using OpenCL would be if the application made use of the OpenCL Compiler Cache. On the initial run, the kernel is compiled (and on that initial compilation it can be done for all possible valid settings values). On subsequent runs, the appropriate cached pre-compiled kernel is used (unless there is a hardware change in which case the cache is cleared & the kernels re-compiled till the next hardware change). The other thing i noticed was that on a system with two very different video cards, all work is reported as being done with the most powerful card in the system (as reported by BOINC). eg for a while for me it was a GTX 1070 & a RTX 2060 Super. Even the work processed by the GTX 1070 was reported as being processed by the RTX 2060 Super in Stderr output. For the caching to work, i suspect that this might need to be resolved- when the GPU application starts up, it needs to query the hardware it is starting on and not use the BOINC reported hardware, in order to make use of the right OpenCL kernel (although thinking about it, the present system seems to work OK making use of default values, and the lookup table, so just caching & re-using kernels compiled on those inputs should still work). EDIT- although i notice in the CUDA Stderr output the line Setting GPU device number 0. which doesn't appear in the OpenCL Stderr output. Grant Darwin NT, Australia. ID: 3792 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3795 - Posted: 25 Jan 2025, 3:32:35 UTC - in response to Message 3792. I am getting only open_cl work units for my GTX 1650 Super under Win10. Are there any CUDA work units being sent out? There's only a CUDA app for linux. The openCL version is not too much slower than CUDA. I noticed that with my RTX 4060Ti Super running under Windows, the run times were slightly longer & the APR slightly lower than yours running under LINUX, then i noticed the OpenCL v CUDA initialisation line in the Stderr output files. Looking in my Stderr output for a Valid Task, i noticed that it appears to be compiling the OpenCL kernel on every run. One thing that would help performance on all systems using OpenCL would be if the application made use of the OpenCL Compiler Cache. On the initial run, the kernel is compiled (and on that initial compilation it can be done for all possible valid settings values). On subsequent runs, the appropriate cached pre-compiled kernel is used (unless there is a hardware change in which case the cache is cleared & the kernels re-compiled till the next hardware change). The other thing i noticed was that on a system with two very different video cards, all work is reported as being done with the most powerful card in the system (as reported by BOINC). eg for a while for me it was a GTX 1070 & a RTX 2060 Super. Even the work processed by the GTX 1070 was reported as being processed by the RTX 2060 Super in Stderr output. For the caching to work, i suspect that this might need to be resolved- when the GPU application starts up, it needs to query the hardware it is starting on and not use the BOINC reported hardware, in order to make use of the right OpenCL kernel (although thinking about it, the present system seems to work OK making use of default values, and the lookup table, so just caching & re-using kernels compiled on those inputs should still work). EDIT- although i notice in the CUDA Stderr output the line Setting GPU device number 0. which doesn't appear in the OpenCL Stderr output. That is a great point regarding the compiler cache, and I've often wondered about that - the first 20 seconds of each job on my AMD card is to compile the openCL code. This is something I will look into later when I get some free time. When I run the code offline, it always uses the cached version, so I had assumed maybe it was something that had to be changed in the BOINC manager. ID: 3795 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1425 Credit: 782,171,961 RAC: 811,337	Message 3796 - Posted: 28 Jan 2025, 5:25:05 UTC - in response to Message 3795. That link regarding the OpenCL Compiler Cache appears to be specifically for the intel implementation. I don't see anything similar in the OpenCL standard where I can tell it to save a cached copy of the compiled code. It looks like the app can get access to the compiled code and then I could manually cache it. Not sure if that's the optimal solution, but either way it will require some modifications to the application code. Given my limited time right now, I'm not sure I have the bandwidth to do that along with the subsequent testing and porting to all the openCL platforms (windows/linux and amd/nvidia/intel). But I will put it near the top of the to-do list. ID: 3796 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 4 Jan 25 Posts: 38 Credit: 153,986,863 RAC: 495,635	Message 3797 - Posted: 28 Jan 2025, 9:00:39 UTC - in response to Message 3796. That link regarding the OpenCL Compiler Cache appears to be specifically for the intel implementation. I don't see anything similar in the OpenCL standard where I can tell it to save a cached copy of the compiled code. It looks like the app can get access to the compiled code and then I could manually cache it. Not sure if that's the optimal solution, but either way it will require some modifications to the application code. Given my limited time right now, I'm not sure I have the bandwidth to do that along with the subsequent testing and porting to all the openCL platforms (windows/linux and amd/nvidia/intel). But I will put it near the top of the to-do list. Yeah, unfortunately it looks like it might be on a per hardware manufacture OpenCl implementation basis. OpenCl is a standard, but each GPU manufacturer implements it in their own way. NVidia don't even make mention of it in their OpenCL Best Practices Guide. It's probably the Compilation caching options section in their CUDA C++ Programming Guide (if any of that makes any sense). I know it can be done as back in the days of Seti, there was an optimised application developed using OpenCL that ran on AMD and NVidia hardware (and Intel although it wasn't really work it back then given the state of the iGPUs then), and the application was built to determine the number of Compute Units available & then it was set up so that it would build kernels for all possible needed values, then the kernels were kept in the Seti project directory so they were ready to be used as needed & didn't have to all be re-built every time a task started. eg MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43039 MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43064 MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43160 MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44259 MB_clFFTplan_GeForceGTX1070_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44274 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43039 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43064 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43160 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44259 MB_clFFTplan_GeForceGTX1070_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44274 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43039 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43064 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43160 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44259 MB_clFFTplan_GeForceRTX2060_256_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44274 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43039 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43064 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_43160 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44259 MB_clFFTplan_GeForceRTX2060_512_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_44274 MB_clFFTplan_GeForceRTX2060_1024_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceRTX2060_2048_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceRTX2060_4096_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 MB_clFFTplan_GeForceRTX2060_8192_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3557.bin_41881 etc, etc Given my limited time right now, I'm not sure I have the bandwidth to do that along with the subsequent testing and porting to all the openCL platforms (windows/linux and amd/nvidia/intel). I understand all too well about having things to do, but only so many hours in a day. It would be nice if the OpenCL standard was standard between all the video card manufacturers... Grant Darwin NT, Australia. ID: 3797 · Rating: 0 · rate: / Reply Quote