Running slow on Intel GPU

Message boards : Number crunching : Running slow on Intel GPU
Message board moderation

To post messages, you must log in.

AuthorMessage
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3933 - Posted: 23 Aug 2025, 12:14:25 UTC

On B580, it takes about 7 times as long as 2080ti to complete a workunit.
https://imgur.com/a/hti7VZj

GPU Vector Engine
XVE Arrays
Active: 1.0%
Idle: 7.5%
Stalled: 91.5%

GPU Computing Threads Dispatch
XVE Threads Occupancy
80.0%
Thread Dispatcher Active
0.0%

GPU L3 Cache Bandwidth and Misses
Average Bandwidth, GB/sec
L3 Read: 289.754 /sec
L3 Write: 269.693 /sec
L3 Misses, Misses/sec
1,906,192,273 /sec
L3 Input Available
24.8%
L3 Output Ready
11.6%
L3 Busy
100.0%
L3 Stalled
14.6%
SQ Full
2.2%

GPU Memory Access
Average Bandwidth, GB/sec
Read: 237.404 /sec
Write: 74.790 /sec
GPU Memory Active
99.8%
ID: 3933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1434
Credit: 807,036,129
RAC: 873,185
Message 3934 - Posted: 23 Aug 2025, 14:47:13 UTC - in response to Message 3933.  

Hard to say why. I'm not familiar with the Intel B580, so I did a quick search and verified that it should be comparable to the 2080Ti.

I noticed that the CPU time was also very high for the system with the B580. The CPU assembles batches of polynomials to test and then hands them over to the GPU. It does this in parallel as the GPU is working on a previous batch. So if a CPU is very slow or is under heavy load then the GPU could be idling as it waits on the CPU. The GPU could also be under heavy load itself from other processes you might be running on it. Only you can determine if one of these things are the cause.

The problem could also be a poor Intel driver, in particular the openCL implementation. Finally, note that not all WUs have the same runtime. I have seen swings as high as a factor of 2. So it's really better to run a bunch of WUs and look at averages.
ID: 3934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3935 - Posted: 23 Aug 2025, 15:50:42 UTC - in response to Message 3934.  

I noticed that this caused high memory reads on Intel GPUs, but not on NVIDIA GPUs. Even on older NVIDIA GPUs with smaller caches, bandwidth bottlenecks weren't as severe. This could be one of the root causes of the problem.

VTune showed that memory activity was close to 100%, XVE SBID stalls were close to 100%, and XVE array activity and XVE pipelines were very low.By comparison with other projects, this may be abnormal.

Intel may have some issues with OpenCL, causing a severe bandwidth bottleneck. However, Intel GPUs also tend to be slower in other projects.

As for CPU time, this may be a common issue for Intel. In GFN-18, Intel also fully occupies a CPU thread, but reducing its CPU time by running a large number of CPU tasks and lowering the priority of GPU tasks does not result in longer task execution times.

On Nvidia GPUs, larger GFN tasks have a lower CPU time ratio. On Intel GPUs, the CPU may be waiting rather than sleeping while the GPU performs computations. However, on Amicable Numbers and AP27, the CPU time is smaller and does not always occupy an entire thread.

GPU XVE Stall Reasons
XVE Instruction Fetch Stall
0.9%
XVE Barrier Stall
0.0%
XVE Dist or Acc Stall
48.8%
XVE Send Stall
4.0%
XVE Pipe Stall
0.1%
XVE SBID Stall
99.1%
XVE Control Stall
1.1%
XVE Other Stall
0.0%

GPU XVE Pipelines
ALU0 and ALU1 Utilization
0.0%
ALU0 and XMX Utilization
0.0%
Multiple Pipe Utilization
0.0%
XVE ALU0 pipeline active
0.1%
XVE ALU1 pipeline active
1.2%
XVE XMX pipeline active
0.0%
ID: 3935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 4 Jan 25
Posts: 39
Credit: 175,259,059
RAC: 819,859
Message 3936 - Posted: 23 Aug 2025, 18:51:29 UTC

From Phoronix's B580 review
One of the compute driver bugs with the Intel Compute Runtime on Battlemage appears to be around having much higher latency than Alchemist and the NVIDIA/AMD competition. Here's a look at the OpenCL kernel latency being many times higher on the Arc B580 than the Alchemist A-Series. Hopefully this Intel Compute Runtime issue can be quickly worked out.
Are you running the most recent video driver?
While these are LINUX benchmarks, there's a very high likelyhood that this issue is shared with the Windows drivers as well.
If so, it would appear that Intel have yet to sort out this issue, even though it's been over 9 months since it's release.
Grant
Darwin NT, Australia.
ID: 3936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3937 - Posted: 24 Aug 2025, 0:38:27 UTC - in response to Message 3936.  

I'm using the latest driver, 32.0.101.7026.
I've found that lowering threadsPerBlock to smaller values ​​like 4 or 8 can improve performance on Intel GPU.
ID: 3937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3938 - Posted: 24 Aug 2025, 2:06:59 UTC

Setting a small threadsPerBlock reduces the task's runtime from over 3000 seconds to around 400 seconds.
https://numberfields.asu.edu/NumberFields/result.php?resultid=263760673
GPU was not found in the lookup table.  Using default values:
  numBlocks = 1024.
  threadsPerBlock = 32.
  polyBufferSize = 32768.
Run time	55 min 58 sec
CPU time	55 min 58 sec

https://numberfields.asu.edu/NumberFields/result.php?resultid=263760666
GPU found in lookup table:
  GPU Name = B580.
  numBlocks = 2560.
  threadsPerBlock = 4.
  polyBufferSize = 10240.
Run time	13 min 31 sec
CPU time	13 min 31 sec

On Linux, modifications will cause the file size to change and the program will not run.
https://numberfields.asu.edu/NumberFields/result.php?resultid=263769261
<core_client_version>8.0.4</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file gpuLookupTable_v402.txt: file has the wrong size</message>
]]>
ID: 3938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1434
Credit: 807,036,129
RAC: 873,185
Message 3939 - Posted: 24 Aug 2025, 5:00:51 UTC - in response to Message 3938.  

Setting a small threadsPerBlock reduces the task's runtime from over 3000 seconds to around 400 seconds.
https://numberfields.asu.edu/NumberFields/result.php?resultid=263760673
GPU was not found in the lookup table.  Using default values:
  numBlocks = 1024.
  threadsPerBlock = 32.
  polyBufferSize = 32768.
Run time	55 min 58 sec
CPU time	55 min 58 sec

https://numberfields.asu.edu/NumberFields/result.php?resultid=263760666
GPU found in lookup table:
  GPU Name = B580.
  numBlocks = 2560.
  threadsPerBlock = 4.
  polyBufferSize = 10240.
Run time	13 min 31 sec
CPU time	13 min 31 sec

On Linux, modifications will cause the file size to change and the program will not run.
https://numberfields.asu.edu/NumberFields/result.php?resultid=263769261
<core_client_version>8.0.4</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file gpuLookupTable_v402.txt: file has the wrong size</message>
]]>

So it could be an architectural thing with Intel GPUs. The threadsPerBlock is the number of threads to run in lockstep. 32 is the optimal number for Nvidia's architecture and it's equal to what they call the warp size.

It sounds like I will need to update the lookup table at some point in the near future, so everyone can benefit from this. But I'd like to see some more B580 hosts verify this before I change it.
ID: 3939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3940 - Posted: 24 Aug 2025, 7:17:46 UTC - in response to Message 3939.  

There are WUs runtime under Linux. Lowering threadsPerBlock value reduces the run time on the Intel GPU.

Also, the x86_64-pc-linux-gnu application doesn't have the problem with the entire thread being occupied; the cpu time is significantly shorter than the run time. There may be some issue with the windows_x86_64 application.

Arch Linux [6.16.3-arch1-1|libc 2.42]
https://numberfields.asu.edu/NumberFields/result.php?resultid=263784218
GPU Summary String = [CAL|AMDRadeonGraphics(radeonsi,raphael_mendocino,LLVM20.1.8,DRM3.64,6.16.3-arch1-1)|1|2048MB||300][INTEL|Intel(R)Arc(TM)B580Graphics|1|11605MB||300].
Loading GPU lookup table from file.
GPU was not found in the lookup table.  Using default values:
  numBlocks = 1024.
  threadsPerBlock = 32.
  polyBufferSize = 32768.
Run time 	1 hours 12 min 39 sec
CPU time 	3 min 59 sec 

https://numberfields.asu.edu/NumberFields/result.php?resultid=263790057
GPU Summary String = [CAL|AMDRadeonGraphics(radeonsi,raphael_mendocino,LLVM20.1.8,DRM3.64,6.16.3-arch1-1)|1|2048MB||300][INTEL|Intel(R)Arc(TM)B580Graphics|1|11605MB||300].
Loading GPU lookup table from file.
GPU found in lookup table:
  GPU Name = B580.
  numBlocks = 2560.
  threadsPerBlock = 4.
  polyBufferSize = 10240.
Run time 	9 min 45 sec
CPU time 	1 min 38 sec 
ID: 3940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1434
Credit: 807,036,129
RAC: 873,185
Message 3941 - Posted: 24 Aug 2025, 16:10:53 UTC - in response to Message 3940.  

I looked into the Intel architecture documents and found that an SIMD Width (equivalent of warpsize on Nvidia) of 32 should be fine.

So I started looking through the database for other intel GPUs to see if they also have this problem.

The A580 does not have this problem. For example:
https://numberfields.asu.edu/NumberFields/result.php?resultid=263290138
Also a bunch of UHDGraphicsXXX and IrisXe cards.
Makes me wonder if there's something in the Intel driver that's affecting the newer B580???

Anyways, I can't find any other B580 cards in the database. In the short term, it looks like you have a work around by changing the lookup table, until I get around to updating it on the server side.
ID: 3941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
esek

Send message
Joined: 25 Mar 22
Posts: 6
Credit: 2,335,914
RAC: 340
Message 3942 - Posted: 24 Aug 2025, 22:58:11 UTC - in response to Message 3941.  
Last modified: 24 Aug 2025, 23:03:55 UTC

There may be some issues with the Intel graphics driver. In primegrid, other users' Arc A580s complete GFN-18 and GFN-19 tasks faster than mine and other users' Arc B580s.

Other user's A580 GFN-18: Run time 678.70
https://www.primegrid.com/result.php?resultid=1972100672
Other user's A580 GFN-19: Run time 2,060.71
https://www.primegrid.com/result.php?resultid=1972095096
B580 GFN-18: Run time 813.03
https://www.primegrid.com/result.php?resultid=1971845683
B580 GFN-19: Run time 2,336.39
https://www.primegrid.com/result.php?resultid=1971829718
Other user's B580 GFN-18: Run time 866.14
https://www.primegrid.com/result.php?resultid=1972316750
ID: 3942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 14 May 23
Posts: 18
Credit: 307,979,559
RAC: 531,736
Message 3943 - Posted: 25 Aug 2025, 16:41:45 UTC - in response to Message 3938.  


<core_client_version>8.0.4</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file gpuLookupTable_v402.txt: file has the wrong size</message>
]]>


Set the cc_config.xml option <dont_check_file_sizes>0</dont_check_file_sizes> to <dont_check_file_sizes>1</dont_check_file_sizes> in [Options] section to ignore your file size change in the lookup table file so you can test with the smaller value.
ID: 3943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Running slow on Intel GPU


Main page · Your account · Message boards


Copyright © 2025 Arizona State University