New and improved apps coming soon

Message boards : News : New and improved apps coming soon
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 239,311,034
RAC: 89,002
Message 2975 - Posted: 4 Jan 2021, 22:40:59 UTC - in response to Message 2974.  

No need to apologise - it was my decision to upgrade. The machine was largely idle because GPUGrid has come to the end of its current research run, and I'm not really enthused by make-work projects like Collatz and PrimeGrid. I'd been thinking of upgrading for a while, and you gave me an excuse to get my brain into gear after the holidays.

It's generally running smoothly, but I hit two tasks today which seemed to get stuck in an endless loop.

Task 105827986 (from wu_sf3_DS-16x270_Grp 3738460 of 3932160)
Task 105983915 (from wu_sf3_DS-16x270_Grp 3639160 of 3932160)

I don't monitor the GPU loadings, but I do run a monitor which displays the 'CPU efficiency' of the task - %age of time the CPU is under load. That's typically 20% for the Linux GPU tasks, but dropped well down below 10% for these: elapsed time continued to rise, but task progress froze at 90.880% and 90.629% respectively. I paused them a couple of times each (which for GPU tasks removes them completely from memory): they re-started from checkpoint OK, but froze again at the same point. In the end, I aborted them.

I noticed the lookup file, but didn't explore it in detail. BOINC itself concentrates on the "compute capability", which is invariant for each iteration of the NVidia architecture - it determines such things as the number of shaders per multiplex, and the minimum CUDA level required in the driver. Do you plan to document your usage anywhere?
ID: 2975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2976 - Posted: 5 Jan 2021, 0:34:56 UTC - in response to Message 2975.  

No need to apologise - it was my decision to upgrade. The machine was largely idle because GPUGrid has come to the end of its current research run, and I'm not really enthused by make-work projects like Collatz and PrimeGrid. I'd been thinking of upgrading for a while, and you gave me an excuse to get my brain into gear after the holidays.

It's generally running smoothly, but I hit two tasks today which seemed to get stuck in an endless loop.

Task 105827986 (from wu_sf3_DS-16x270_Grp 3738460 of 3932160)
Task 105983915 (from wu_sf3_DS-16x270_Grp 3639160 of 3932160)

I don't monitor the GPU loadings, but I do run a monitor which displays the 'CPU efficiency' of the task - %age of time the CPU is under load. That's typically 20% for the Linux GPU tasks, but dropped well down below 10% for these: elapsed time continued to rise, but task progress froze at 90.880% and 90.629% respectively. I paused them a couple of times each (which for GPU tasks removes them completely from memory): they re-started from checkpoint OK, but froze again at the same point. In the end, I aborted them.

I noticed the lookup file, but didn't explore it in detail. BOINC itself concentrates on the "compute capability", which is invariant for each iteration of the NVidia architecture - it determines such things as the number of shaders per multiplex, and the minimum CUDA level required in the driver. Do you plan to document your usage anywhere?


I did see the problem with the app sometimes freezing. I spent what little time I had today debugging that, but I found the problem and have a fix for it. Unfortunately, I have not had the time to fix the GLIBC version problem yet. That should come soon and then I'll package both fixes in one update.

At some point I'll document the lookup table, probably when I migrate the code to github.
ID: 2976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2977 - Posted: 5 Jan 2021, 8:03:55 UTC - in response to Message 2976.  

I got both fixes in. The new cuda app is version 4.01. According to objdump, it requires version 2.17 or later of glibc, so I think it should be good.
ID: 2977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2978 - Posted: 5 Jan 2021, 8:13:42 UTC - in response to Message 2969.  

I've converted the GPU tasks back to run on the CPU, so they won't be wasted or need to be resent.


Richard,

I forgot to ask you what you meant by the above statement. Is this a newer feature of the client that I am not aware of? There have been times when I wished I could do such a conversion.
ID: 2978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 239,311,034
RAC: 89,002
Message 2979 - Posted: 5 Jan 2021, 8:45:57 UTC - in response to Message 2978.  

I've converted the GPU tasks back to run on the CPU, so they won't be wasted or need to be resent.
Richard,

I forgot to ask you what you meant by the above statement. Is this a newer feature of the client that I am not aware of? There have been times when I wished I could do such a conversion.
No, it's not a public feature of the client - it's just making use of the way BOINC stores the data defining the tasks in the cache. For each task, there's a <workunit> and a <result> xml chunk in client_state.xml. At the time I wrote that, the only difference between a GPU task and a CPU task was the <plan_class> line in the <result> chunk. Text editor, search'n'replace, and 'cuda30' turned into 'default'. Voila.

The new cuda app is version 4.01
Now, we'll have to tweak the <version_num> lines in both <workunit> and <result>, but it's still doable - but it takes care and understanding. As we used to say, "For advanced users only. At your own risk."

But I've still got a Mint 19 machine, so I can try the new app..
ID: 2979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 239,311,034
RAC: 89,002
Message 2980 - Posted: 5 Jan 2021, 11:19:52 UTC - in response to Message 2977.  

Yes, the new version 4.01 is running fine on 'Linux Mint 19.1 Tessa' (host 1697845)
ID: 2980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2983 - Posted: 5 Jan 2021, 21:23:54 UTC - in response to Message 2979.  

Now, we'll have to tweak the < version_num > lines in both < workunit > and < result > , but it's still doable - but it takes care and understanding. As we used to say, "For advanced users only. At your own risk."

Ah, so your trick also allows you to change version numbers. Very cool. Last night I had to abort over 100 tasks that were assigned to the older version. Not sure how feasible it would be to implement in general, but that functionality would be a nice addition to the client.
ID: 2983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 239,311,034
RAC: 89,002
Message 2984 - Posted: 5 Jan 2021, 21:55:50 UTC - in response to Message 2983.  

We used to use that trick a lot at SETI@Home, which was open-source from the very beginning. The volunteers (collectively) had far more time for optimising the various apps than the project staff, and the staff actively encouraged volunteers to help speed things up - provided quality and accuracy were maintained.

There are four main tags to watch for:

<app_name>GetDecics</app_name>
<version_num>400</version_num>
<platform>windows_x86_64</platform>
<plan_class>default</plan_class>

Two appear in <workunit>, and three appear in <result> (<version_num> appears in both). The complete set of four has to be consistent for each task, and they have to match an <app_version> which is already defined on your system. Apart from that, it's easy....

Some people even devised scripts or applications to automate the process - look for the term 'reschedule' at SETI.
ID: 2984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2986 - Posted: 6 Jan 2021, 4:59:08 UTC - in response to Message 2984.  
Last modified: 6 Jan 2021, 5:02:53 UTC

We used to use that trick a lot at SETI@Home, which was open-source from the very beginning. The volunteers (collectively) had far more time for optimising the various apps than the project staff, and the staff actively encouraged volunteers to help speed things up - provided quality and accuracy were maintained.

There are four main tags to watch for:

< app_name > GetDecics < /app_name >
< version_num > 400 < /version_num >
< platform > windows_x86_64 < /platform >
< plan_class > default < /plan_class >

Two appear in < workunit >, and three appear in < result > (< version_num > appears in both). The complete set of four has to be consistent for each task, and they have to match an < app_version > which is already defined on your system. Apart from that, it's easy....

Some people even devised scripts or applications to automate the process - look for the term 'reschedule' at SETI.


I will have to start using this trick. Thanks!
ID: 2986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
klepel

Send message
Joined: 31 Oct 18
Posts: 2
Credit: 17,788,613
RAC: 15,443
Message 2987 - Posted: 6 Jan 2021, 18:04:54 UTC

I tested at this computer: https://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=2674300

and received immediately this error:
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
GPU Summary String = [CUDA|GeForceGTX1070|1|4095MB|45545|102].
Loading GPU lookup table from file.
GPU found in lookup table:
  GPU Name = GTX1070.
  numBlocks = 1600.
  threadsPerBlock = 512.
  polyBufferSize = 819200.
Setting GPU device number 0.
Cuda initialization was successful.
CHECKPOINT_FILE = wu_sf3_DS-16x270_Grp3896209of3932160_checkpoint.
Checkpoint Flag = 0.
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-16x270_Grp3896209of3932160.dat
    K = x^2 - 2
    S = [2, 5]
    Disc Bound = 80000000000000000
    Skip = (P^2)*(Q^7)
    Num Congruences = 5
    SCALE = 1.000000
    |dK| = 8
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-16x270_Grp3896209of3932160_0_r526772531_0
Now starting the targeted Martinet search:
  Num Cvecs = 5.
    Doing Cvec 1.
Error code 701: too many resources requested for launch file polDiscTest_gpuCuda.cu line 2330.
polDisc Test had an error. Aborting.

</stderr_txt>
]]>
ID: 2987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2988 - Posted: 6 Jan 2021, 21:28:59 UTC - in response to Message 2987.  
Last modified: 6 Jan 2021, 21:29:51 UTC

I tested at this computer: https://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=2674300

and received immediately this error:
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
GPU Summary String = [CUDA|GeForceGTX1070|1|4095MB|45545|102].
Loading GPU lookup table from file.
GPU found in lookup table:
  GPU Name = GTX1070.
  numBlocks = 1600.
  threadsPerBlock = 512.
  polyBufferSize = 819200.
Setting GPU device number 0.
Cuda initialization was successful.
CHECKPOINT_FILE = wu_sf3_DS-16x270_Grp3896209of3932160_checkpoint.
Checkpoint Flag = 0.
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-16x270_Grp3896209of3932160.dat
    K = x^2 - 2
    S = [2, 5]
    Disc Bound = 80000000000000000
    Skip = (P^2)*(Q^7)
    Num Congruences = 5
    SCALE = 1.000000
    |dK| = 8
    Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-16x270_Grp3896209of3932160_0_r526772531_0
Now starting the targeted Martinet search:
  Num Cvecs = 5.
    Doing Cvec 1.
Error code 701: too many resources requested for launch file polDiscTest_gpuCuda.cu line 2330.
polDisc Test had an error. Aborting.

</stderr_txt>
]]>


I think the lookup table has settings that are too high for the 1070. In the projects directory (where the executable is) you will find gpuLookupTable_v401.txt. If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem. You could even raise numBlocks to 4096 and you should still be good. What's important is that threadsPerBlock must be a multiple of 32.

The current settings for the 1070 were optimal for another user who helped test that card (I dont have a 1070). His card may have had more RAM than yours, and I subsequently made some changes to the code which require more RAM than before.
ID: 2988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CosminZ

Send message
Joined: 20 Jun 12
Posts: 3
Credit: 51,772,719
RAC: 0
Message 2989 - Posted: 6 Jan 2021, 21:46:11 UTC
Last modified: 6 Jan 2021, 21:52:08 UTC

Same error on a RTX 2070m tested with the settings from GTX 1070 and after reducing the numblocks to 1024 got the same error.


<core_client_version>7.16.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
GPU Summary String = [CUDA|GeForceRTX2070|1|4095MB|45545|102].
Loading GPU lookup table from file.
GPU found in lookup table:
GPU Name = RTX2070.
numBlocks = 1024.
threadsPerBlock = 512.
polyBufferSize = 524288.
Setting GPU device number 0.
Cuda initialization was successful.
CHECKPOINT_FILE = wu_sf3_DS-16x271-1_Grp17910of2000000_checkpoint.
Checkpoint Flag = 0.
Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-16x271-1_Grp17910of2000000.dat
K = x^2 - 2
S = [2, 5]
Disc Bound = 2000000000000000000
Skip = (P^3)*(Q^5)
Num Congruences = 25
SCALE = 1.000000
|dK| = 8
Signature = [2,0]
Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-16x271-1_Grp17910of2000000_0_r1623407575_0
Now starting the targeted Martinet search:
Num Cvecs = 25.
Doing Cvec 1.
Doing Cvec 2.
Doing Cvec 3.
Doing Cvec 4.
Doing Cvec 5.
Doing Cvec 6.
Doing Cvec 7.
Doing Cvec 8.
Doing Cvec 9.
Doing Cvec 10.
Error code 701: too many resources requested for launch file polDiscTest_gpuCuda.cu line 2298.
polDisc Test had an error. Aborting.

</stderr_txt>
]]>


ThreadsPerBlock must be a multiple of 32, what about numBlocks? What is the relation with the cuda cores or other value.

ThreadsPerBlock tested at 64, 128, 256 and all working.
Is a way to check the same workunit to see any improvement when changing a setting, because of the run time difference between workunits.
ID: 2989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2990 - Posted: 6 Jan 2021, 23:51:50 UTC - in response to Message 2989.  

ThreadsPerBlock must be a multiple of 32, what about numBlocks? What is the relation with the cuda cores or other value.

ThreadsPerBlock tested at 64, 128, 256 and all working.
Is a way to check the same workunit to see any improvement when changing a setting, because of the run time difference between workunits.


The optimal threads per block is a multiple of 32 because that is the warp size, which is the number of threads that are run in lock-step.

numBlocks is not as easy, but in general should be as large as possible in order to get higher utilization. I found that setting numBlocks to a multiple of the cuda cores is a good starting point.

As you guessed, the key to optimization is running the same WU with various settings to see which works best. And so we are not tuning to a specific WU, I usually run several WUs and find the setting that minimizes the sum of their run times. In the near future (3 to 5 days) I will make the github project public which will make testing much easier for volunteers. In the meantime, if you set threadPerBlock to a multiple of 32 and numBlocks to a multiple of cuda cores; and your utilization is >95%, then you are probably close to optimal.
ID: 2990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
klepel

Send message
Joined: 31 Oct 18
Posts: 2
Credit: 17,788,613
RAC: 15,443
Message 2991 - Posted: 8 Jan 2021, 1:43:32 UTC - in response to Message 2988.  

If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem.

This did the trick!

It runs with 256 without any problem. If I change to 512, it will error immediatly.
ID: 2991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1341
Credit: 484,221,081
RAC: 575,124
Message 2992 - Posted: 8 Jan 2021, 6:52:29 UTC - in response to Message 2991.  

If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem.

This did the trick!

It runs with 256 without any problem. If I change to 512, it will error immediatly.


Good to hear!
ID: 2992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : News : New and improved apps coming soon


Main page · Your account · Message boards


Copyright © 2024 Arizona State University