GPU status update

Author	Message
hsdecalc Send message Joined: 22 Feb 19 Posts: 3 Credit: 5,733,515 RAC: 2,377	Message 2466 - Posted: 18 Jun 2019, 9:19:36 UTC Ok or wrong: On my pc the Nvidia GPU tasks additionally need a full CPU core (Get Decic Fields v3.03 (opencl_nvidia) windows_x86_64). The GTX 1080 Ti card has a power usage of only 25% and the gpu utilization is 50%. So I run two tasks. The gpu is underutilized and the cpu load is too high. https://numberfields.asu.edu/NumberFields/result.php?resultid=50063132 ID: 2466 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,007,392 RAC: 822,681	Message 2467 - Posted: 18 Jun 2019, 17:10:17 UTC - in response to Message 2466. Ok or wrong: On my pc the Nvidia GPU tasks additionally need a full CPU core (Get Decic Fields v3.03 (opencl_nvidia) windows_x86_64). The GTX 1080 Ti card has a power usage of only 25% and the gpu utilization is 50%. So I run two tasks. The gpu is underutilized and the cpu load is too high. https://numberfields.asu.edu/NumberFields/result.php?resultid=50063132 GPU utilization will be fixed with optimization. This is dependent on individual card. My utilization is 97%, but that's because I optimized for my card, and those settings are now the default. When I find the time I will add an app_info.xml interface so that volunteers can easily tweak settings and report back the optimized values for their card. I'm not sure why the cpu load is high. I've noticed that too on the cuda version. As a test, I removed the call to the GPU and the app finished within seconds. That means the vast majority of the time should be spent on the GPU. I'm guessing the CPU is "waiting" on the GPU to finish. The CPU buffers polynomials and then throws them at the GPU. I currently do this in blocks. I could increase the buffer to include every polynomial so there is only 1 call to the GPU. Then the CPU would not have to constantly wait on the GPU, but this would require close to 100GB of memory (worst case) which is not practical. But maybe there is a better way, I just haven't found it yet. ID: 2467 · Rating: 0 · rate: / Reply Quote

sterling Send message Joined: 14 Jun 19 Posts: 1 Credit: 1,234,896 RAC: 0	Message 2473 - Posted: 20 Jun 2019, 4:41:08 UTC Error: Failed to obtain OpenCL device id. Error: Failed to initialize OpenCL. Any chance I could look at your code that initializes ? ID: 2473 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,007,392 RAC: 822,681	Message 2474 - Posted: 20 Jun 2019, 15:26:55 UTC - in response to Message 2473. Error: Failed to obtain OpenCL device id. Error: Failed to initialize OpenCL. Any chance I could look at your code that initializes? Please do! I will send it to you. Just so you know, when it cant obtain a device id, that's usually a sign of a configuration problem (not a code problem). I've seen it on my own system when I updated the graphics driver and then the client lost its connection to the GPU. To fix it I had to reboot. Feel free to also look at the openCL code. Some AMD cards have a problem with it and I still don't know why (my best guess is buggy drivers, but it could be something in the code) ID: 2474 · Rating: 0 · rate: / Reply Quote

Aurel Send message Joined: 25 Feb 13 Posts: 216 Credit: 9,899,302 RAC: 0	Message 2475 - Posted: 20 Jun 2019, 20:46:17 UTC - in response to Message 2467. Last modified: 20 Jun 2019, 20:47:16 UTC Ok or wrong: On my pc the Nvidia GPU tasks additionally need a full CPU core (Get Decic Fields v3.03 (opencl_nvidia) windows_x86_64). The GTX 1080 Ti card has a power usage of only 25% and the gpu utilization is 50%. So I run two tasks. The gpu is underutilized and the cpu load is too high. https://numberfields.asu.edu/NumberFields/result.php?resultid=50063132 GPU utilization will be fixed with optimization. This is dependent on individual card. My utilization is 97%, but that's because I optimized for my card, and those settings are now the default. When I find the time I will add an app_info.xml interface so that volunteers can easily tweak settings and report back the optimized values for their card. I'm not sure why the cpu load is high. I've noticed that too on the cuda version. As a test, I removed the call to the GPU and the app finished within seconds. That means the vast majority of the time should be spent on the GPU. I'm guessing the CPU is "waiting" on the GPU to finish. The CPU buffers polynomials and then throws them at the GPU. I currently do this in blocks. I could increase the buffer to include every polynomial so there is only 1 call to the GPU. Then the CPU would not have to constantly wait on the GPU, but this would require close to 100GB of memory (worst case) which is not practical. But maybe there is a better way, I just haven't found it yet. Could you increase the buffer to the following: Read the user / computer specific max memory usage from there computing preferences and divide it but the max. running tasks (number of threads and graphics devices, if there supported / enabled) and set this value as max largest buffer? This are just my two cents~ ID: 2475 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1391 Credit: 700,007,392 RAC: 822,681	Message 2481 - Posted: 30 Jun 2019, 4:12:05 UTC - in response to Message 2478. Thanks for the info. I will use this when I optimize by card. Just added an app_config.xml to two of my machines so I can run 2 concurrent WU's on my GPU's which increases their load from 79% to 97%. 1st machine is an I5 (4 cores) with a GTX1060 gpu. <app_config> <app> <name>GetDecics</name> <max_concurrent>4</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> 2nd machine is an I7 (8 multithreaded cores) with a GTX1070 gpu. <app_config> <app> <name>GetDecics</name> <max_concurrent>8</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> So 2 cores are allocated to the GPU's and the other cores are running CPU WU's. This is giving me a slight increase in throughput. ID: 2481 · Rating: 0 · rate: / Reply Quote