Message boards :
Number crunching :
Running slow on Intel GPU
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
On B580, it takes about 7 times as long as 2080ti to complete a workunit. https://imgur.com/a/hti7VZj GPU Vector Engine XVE Arrays Active: 1.0% Idle: 7.5% Stalled: 91.5% GPU Computing Threads Dispatch XVE Threads Occupancy 80.0% Thread Dispatcher Active 0.0% GPU L3 Cache Bandwidth and Misses Average Bandwidth, GB/sec L3 Read: 289.754 /sec L3 Write: 269.693 /sec L3 Misses, Misses/sec 1,906,192,273 /sec L3 Input Available 24.8% L3 Output Ready 11.6% L3 Busy 100.0% L3 Stalled 14.6% SQ Full 2.2% GPU Memory Access Average Bandwidth, GB/sec Read: 237.404 /sec Write: 74.790 /sec GPU Memory Active 99.8% |
Eric DriverSend message Joined: 8 Jul 11 Posts: 1434 Credit: 807,036,129 RAC: 873,185 |
Hard to say why. I'm not familiar with the Intel B580, so I did a quick search and verified that it should be comparable to the 2080Ti. I noticed that the CPU time was also very high for the system with the B580. The CPU assembles batches of polynomials to test and then hands them over to the GPU. It does this in parallel as the GPU is working on a previous batch. So if a CPU is very slow or is under heavy load then the GPU could be idling as it waits on the CPU. The GPU could also be under heavy load itself from other processes you might be running on it. Only you can determine if one of these things are the cause. The problem could also be a poor Intel driver, in particular the openCL implementation. Finally, note that not all WUs have the same runtime. I have seen swings as high as a factor of 2. So it's really better to run a bunch of WUs and look at averages. |
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
I noticed that this caused high memory reads on Intel GPUs, but not on NVIDIA GPUs. Even on older NVIDIA GPUs with smaller caches, bandwidth bottlenecks weren't as severe. This could be one of the root causes of the problem. VTune showed that memory activity was close to 100%, XVE SBID stalls were close to 100%, and XVE array activity and XVE pipelines were very low.By comparison with other projects, this may be abnormal. Intel may have some issues with OpenCL, causing a severe bandwidth bottleneck. However, Intel GPUs also tend to be slower in other projects. As for CPU time, this may be a common issue for Intel. In GFN-18, Intel also fully occupies a CPU thread, but reducing its CPU time by running a large number of CPU tasks and lowering the priority of GPU tasks does not result in longer task execution times. On Nvidia GPUs, larger GFN tasks have a lower CPU time ratio. On Intel GPUs, the CPU may be waiting rather than sleeping while the GPU performs computations. However, on Amicable Numbers and AP27, the CPU time is smaller and does not always occupy an entire thread. GPU XVE Stall Reasons XVE Instruction Fetch Stall 0.9% XVE Barrier Stall 0.0% XVE Dist or Acc Stall 48.8% XVE Send Stall 4.0% XVE Pipe Stall 0.1% XVE SBID Stall 99.1% XVE Control Stall 1.1% XVE Other Stall 0.0% GPU XVE Pipelines ALU0 and ALU1 Utilization 0.0% ALU0 and XMX Utilization 0.0% Multiple Pipe Utilization 0.0% XVE ALU0 pipeline active 0.1% XVE ALU1 pipeline active 1.2% XVE XMX pipeline active 0.0% |
|
Send message Joined: 4 Jan 25 Posts: 39 Credit: 175,259,059 RAC: 819,859 |
From Phoronix's B580 review One of the compute driver bugs with the Intel Compute Runtime on Battlemage appears to be around having much higher latency than Alchemist and the NVIDIA/AMD competition. Here's a look at the OpenCL kernel latency being many times higher on the Arc B580 than the Alchemist A-Series. Hopefully this Intel Compute Runtime issue can be quickly worked out. While these are LINUX benchmarks, there's a very high likelyhood that this issue is shared with the Windows drivers as well. If so, it would appear that Intel have yet to sort out this issue, even though it's been over 9 months since it's release. Grant Darwin NT, Australia. |
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
I'm using the latest driver, 32.0.101.7026. I've found that lowering threadsPerBlock to smaller values like 4 or 8 can improve performance on Intel GPU. |
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
Setting a small threadsPerBlock reduces the task's runtime from over 3000 seconds to around 400 seconds. https://numberfields.asu.edu/NumberFields/result.php?resultid=263760673 GPU was not found in the lookup table. Using default values: numBlocks = 1024. threadsPerBlock = 32. polyBufferSize = 32768. Run time 55 min 58 sec CPU time 55 min 58 sec https://numberfields.asu.edu/NumberFields/result.php?resultid=263760666 GPU found in lookup table: GPU Name = B580. numBlocks = 2560. threadsPerBlock = 4. polyBufferSize = 10240. Run time 13 min 31 sec CPU time 13 min 31 sec On Linux, modifications will cause the file size to change and the program will not run. https://numberfields.asu.edu/NumberFields/result.php?resultid=263769261 <core_client_version>8.0.4</core_client_version> <![CDATA[ <message> couldn't start app: Task file gpuLookupTable_v402.txt: file has the wrong size</message> ]]> |
Eric DriverSend message Joined: 8 Jul 11 Posts: 1434 Credit: 807,036,129 RAC: 873,185 |
Setting a small threadsPerBlock reduces the task's runtime from over 3000 seconds to around 400 seconds. So it could be an architectural thing with Intel GPUs. The threadsPerBlock is the number of threads to run in lockstep. 32 is the optimal number for Nvidia's architecture and it's equal to what they call the warp size. It sounds like I will need to update the lookup table at some point in the near future, so everyone can benefit from this. But I'd like to see some more B580 hosts verify this before I change it. |
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
There are WUs runtime under Linux. Lowering threadsPerBlock value reduces the run time on the Intel GPU. Also, the x86_64-pc-linux-gnu application doesn't have the problem with the entire thread being occupied; the cpu time is significantly shorter than the run time. There may be some issue with the windows_x86_64 application. Arch Linux [6.16.3-arch1-1|libc 2.42] https://numberfields.asu.edu/NumberFields/result.php?resultid=263784218 GPU Summary String = [CAL|AMDRadeonGraphics(radeonsi,raphael_mendocino,LLVM20.1.8,DRM3.64,6.16.3-arch1-1)|1|2048MB||300][INTEL|Intel(R)Arc(TM)B580Graphics|1|11605MB||300]. Loading GPU lookup table from file. GPU was not found in the lookup table. Using default values: numBlocks = 1024. threadsPerBlock = 32. polyBufferSize = 32768. Run time 1 hours 12 min 39 sec CPU time 3 min 59 sec https://numberfields.asu.edu/NumberFields/result.php?resultid=263790057 GPU Summary String = [CAL|AMDRadeonGraphics(radeonsi,raphael_mendocino,LLVM20.1.8,DRM3.64,6.16.3-arch1-1)|1|2048MB||300][INTEL|Intel(R)Arc(TM)B580Graphics|1|11605MB||300]. Loading GPU lookup table from file. GPU found in lookup table: GPU Name = B580. numBlocks = 2560. threadsPerBlock = 4. polyBufferSize = 10240. Run time 9 min 45 sec CPU time 1 min 38 sec |
Eric DriverSend message Joined: 8 Jul 11 Posts: 1434 Credit: 807,036,129 RAC: 873,185 |
I looked into the Intel architecture documents and found that an SIMD Width (equivalent of warpsize on Nvidia) of 32 should be fine. So I started looking through the database for other intel GPUs to see if they also have this problem. The A580 does not have this problem. For example: https://numberfields.asu.edu/NumberFields/result.php?resultid=263290138 Also a bunch of UHDGraphicsXXX and IrisXe cards. Makes me wonder if there's something in the Intel driver that's affecting the newer B580??? Anyways, I can't find any other B580 cards in the database. In the short term, it looks like you have a work around by changing the lookup table, until I get around to updating it on the server side. |
|
Send message Joined: 25 Mar 22 Posts: 6 Credit: 2,335,914 RAC: 340 |
There may be some issues with the Intel graphics driver. In primegrid, other users' Arc A580s complete GFN-18 and GFN-19 tasks faster than mine and other users' Arc B580s. Other user's A580 GFN-18: Run time 678.70 https://www.primegrid.com/result.php?resultid=1972100672 Other user's A580 GFN-19: Run time 2,060.71 https://www.primegrid.com/result.php?resultid=1972095096 B580 GFN-18: Run time 813.03 https://www.primegrid.com/result.php?resultid=1971845683 B580 GFN-19: Run time 2,336.39 https://www.primegrid.com/result.php?resultid=1971829718 Other user's B580 GFN-18: Run time 866.14 https://www.primegrid.com/result.php?resultid=1972316750 |
Keith MyersSend message Joined: 14 May 23 Posts: 18 Credit: 307,979,559 RAC: 531,736 |
Set the cc_config.xml option <dont_check_file_sizes>0</dont_check_file_sizes> to <dont_check_file_sizes>1</dont_check_file_sizes> in [Options] section to ignore your file size change in the lookup table file so you can test with the smaller value. |