Consisting Crashing on GPU

Message boards : Number crunching : Consisting Crashing on GPU
Message board moderation

To post messages, you must log in.

AuthorMessage
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3605 - Posted: 10 Nov 2023, 11:20:40 UTC

The day before yesterday I tried the NV GPU tasks instead of CPU tasks, but problems arise:
From time to time the screen crash to completely black for a second and when the system tried to recover the desktop, all jobs currently running fail (in appconfig I set multitasking on GPU), including these not from NF. And the power, frequency and voltage settings on the GPU are reset to default. It's really bothering!
The environment: Win11 Home, 4060 Laptop, 13500H.
Besides, originally I set the GPU overclock by 225 MHz. Apparently NF cannot afford such overclocking and always fail(I'm wondering why EAH and Asteroids no not). Now I set 168 MHz and it doesn't fail so much, but still a little. Is that relavent?
ID: 3605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1324
Credit: 413,372,406
RAC: 259,058
Message 3606 - Posted: 10 Nov 2023, 15:41:25 UTC - in response to Message 3605.  

Sorry for the frustration. I'm not sure what the problem is. I saw similar behavior years ago when overclocking the cpu - the system would overheat and then shut itself down. Maybe something similar is happening with the GPU?
ID: 3606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3607 - Posted: 11 Nov 2023, 3:00:52 UTC - in response to Message 3606.  

I don't quite think so, because the GPU temperature don't even reach 80℃. Besides, each time the anti blue light settings fail too.
Still, I notice that such situation sometimes happen immediately after I change app_config.xml. I'm not sure if previous crashes are the same.
ID: 3607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3608 - Posted: 11 Nov 2023, 15:05:56 UTC - in response to Message 3606.  

Besides, I don't see any energy efficiency increase when switching to GPU as expected. Is that normal?
ID: 3608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3609 - Posted: 11 Nov 2023, 15:08:05 UTC

Sorry for being in UTC+8.
ID: 3609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1324
Credit: 413,372,406
RAC: 259,058
Message 3610 - Posted: 11 Nov 2023, 16:00:28 UTC - in response to Message 3608.  

Besides, I don't see any energy efficiency increase when switching to GPU as expected. Is that normal?


I'm not sure exactly what you mean. I see about 25x speedup on my 3070 Ti compared to a single cpu core, but it also uses a bunch more power, so not sure if it's any more energy efficient.
ID: 3610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3611 - Posted: 12 Nov 2023, 3:37:48 UTC - in response to Message 3610.  

Yes. I see about 8x speedup on 4060 Laptop than the CPU when running 3 tasks in parallel, but with 60W power consumption. When on CPU, it only consumes 25W with 16 tasks in parallel. That's awkward because heterogeneous computing normally increases power efficiency by nearly a magnitude.
ID: 3611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1324
Credit: 413,372,406
RAC: 259,058
Message 3612 - Posted: 12 Nov 2023, 17:00:04 UTC - in response to Message 3611.  

Yes. I see about 8x speedup on 4060 Laptop than the CPU when running 3 tasks in parallel, but with 60W power consumption. When on CPU, it only consumes 25W with 16 tasks in parallel. That's awkward because heterogeneous computing normally increases power efficiency by nearly a magnitude.


Where is this power measurement coming from? Is it the GPU only or the whole system?

Another thing to keep in mind is the GPU app also uses a portion of a CPU core, probably somewhere between 20% to 50% depending on the speed of the GPU. The CPU generates the list of polynomials to test and the GPU does the actual testing; when the GPU is really fast, the CPU has to work harder to keep up feeding it, hence the CPU usage goes up.
ID: 3612 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3613 - Posted: 13 Nov 2023, 4:53:24 UTC - in response to Message 3612.  

The number is from GPU-Z so I suppose it's only the power from the GPU. Then I'll consider switching back to CPU-only tasks. The VRAM and its controller seems to consume a lot of power, about a half of the total GPU chip power draw. Thkx.
ID: 3613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tamago_W

Send message
Joined: 18 Apr 23
Posts: 2
Credit: 4,251,836
RAC: 0
Message 3614 - Posted: 13 Nov 2023, 9:01:59 UTC

I think it's the overclocking. Nvidia has set some fairly strict voltage limits for the RTX 40 series GPUs on laptops compare to previous generations. Previously, Nvidia's GPU would also automatically increase the voltage within the allowed range after overclock, until it hit the voltage limit or power limit, but the RTX 40 series GPU on laptops today don't do that, they will only increase the frequency at the voltage limit (about 0.9v for 4060), cause the voltage limit is way too restricted compare to the power limit, even the default boost clock is already hit the voltage limit. So overclocked settings may not hold up well under sustained high-stress tasks (because of the relatively low voltage for that frequency).

Additionally, I had the same issue about half a year ago on an overclocked GTX 1070. It was an MSI Gaming Z, configured with some very good cooling and I never had any problem with it in game. I ended up canceling the overclock and never had a problem again, running stable for a week straight.
ID: 3614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3615 - Posted: 13 Nov 2023, 12:07:01 UTC - in response to Message 3614.  

OK, I'll try.
By the way, my voltage limit is 1.0100V, which is way higher than 0.9V. Perhaps that's part of the reason: my CPU runs at 2.5GHz with about 0.7850V, but my GPU reaches the voltage limit. But from the top computers one can discover that GPU does have little advantage than CPU on NF.
ID: 3615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tamago_W

Send message
Joined: 18 Apr 23
Posts: 2
Credit: 4,251,836
RAC: 0
Message 3616 - Posted: 14 Nov 2023, 2:23:13 UTC - in response to Message 3615.  

You are right, the 0.9v limit is on RTX 4070 Laptop, I remembered it wrong.
ID: 3616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DEV

Send message
Joined: 28 Aug 22
Posts: 8
Credit: 32,292,333
RAC: 52,710
Message 3617 - Posted: 17 Nov 2023, 7:32:17 UTC - in response to Message 3616.  

I suppose that NF is FP64-intense which is not friendly to NVIDIA cards. Am I right?
ID: 3617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1324
Credit: 413,372,406
RAC: 259,058
Message 3618 - Posted: 17 Nov 2023, 16:53:41 UTC - in response to Message 3617.  

I suppose that NF is FP64-intense which is not friendly to NVIDIA cards. Am I right?


Actually, it's integer intensive.
ID: 3618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Consisting Crashing on GPU


Main page · Your account · Message boards


Copyright © 2024 Arizona State University