Message boards :
News :
New and improved apps coming soon
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 28 Oct 11 Posts: 180 Credit: 239,311,034 RAC: 89,002 |
No need to apologise - it was my decision to upgrade. The machine was largely idle because GPUGrid has come to the end of its current research run, and I'm not really enthused by make-work projects like Collatz and PrimeGrid. I'd been thinking of upgrading for a while, and you gave me an excuse to get my brain into gear after the holidays. It's generally running smoothly, but I hit two tasks today which seemed to get stuck in an endless loop. Task 105827986 (from wu_sf3_DS-16x270_Grp 3738460 of 3932160) Task 105983915 (from wu_sf3_DS-16x270_Grp 3639160 of 3932160) I don't monitor the GPU loadings, but I do run a monitor which displays the 'CPU efficiency' of the task - %age of time the CPU is under load. That's typically 20% for the Linux GPU tasks, but dropped well down below 10% for these: elapsed time continued to rise, but task progress froze at 90.880% and 90.629% respectively. I paused them a couple of times each (which for GPU tasks removes them completely from memory): they re-started from checkpoint OK, but froze again at the same point. In the end, I aborted them. I noticed the lookup file, but didn't explore it in detail. BOINC itself concentrates on the "compute capability", which is invariant for each iteration of the NVidia architecture - it determines such things as the number of shaders per multiplex, and the minimum CUDA level required in the driver. Do you plan to document your usage anywhere? |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
No need to apologise - it was my decision to upgrade. The machine was largely idle because GPUGrid has come to the end of its current research run, and I'm not really enthused by make-work projects like Collatz and PrimeGrid. I'd been thinking of upgrading for a while, and you gave me an excuse to get my brain into gear after the holidays. I did see the problem with the app sometimes freezing. I spent what little time I had today debugging that, but I found the problem and have a fix for it. Unfortunately, I have not had the time to fix the GLIBC version problem yet. That should come soon and then I'll package both fixes in one update. At some point I'll document the lookup table, probably when I migrate the code to github. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
I got both fixes in. The new cuda app is version 4.01. According to objdump, it requires version 2.17 or later of glibc, so I think it should be good. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
I've converted the GPU tasks back to run on the CPU, so they won't be wasted or need to be resent. Richard, I forgot to ask you what you meant by the above statement. Is this a newer feature of the client that I am not aware of? There have been times when I wished I could do such a conversion. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 239,311,034 RAC: 89,002 |
No, it's not a public feature of the client - it's just making use of the way BOINC stores the data defining the tasks in the cache. For each task, there's a <workunit> and a <result> xml chunk in client_state.xml. At the time I wrote that, the only difference between a GPU task and a CPU task was the <plan_class> line in the <result> chunk. Text editor, search'n'replace, and 'cuda30' turned into 'default'. Voila.I've converted the GPU tasks back to run on the CPU, so they won't be wasted or need to be resent.Richard, The new cuda app is version 4.01Now, we'll have to tweak the <version_num> lines in both <workunit> and <result>, but it's still doable - but it takes care and understanding. As we used to say, "For advanced users only. At your own risk." But I've still got a Mint 19 machine, so I can try the new app.. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 239,311,034 RAC: 89,002 |
Yes, the new version 4.01 is running fine on 'Linux Mint 19.1 Tessa' (host 1697845) |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
Now, we'll have to tweak the < version_num > lines in both < workunit > and < result > , but it's still doable - but it takes care and understanding. As we used to say, "For advanced users only. At your own risk." Ah, so your trick also allows you to change version numbers. Very cool. Last night I had to abort over 100 tasks that were assigned to the older version. Not sure how feasible it would be to implement in general, but that functionality would be a nice addition to the client. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 239,311,034 RAC: 89,002 |
We used to use that trick a lot at SETI@Home, which was open-source from the very beginning. The volunteers (collectively) had far more time for optimising the various apps than the project staff, and the staff actively encouraged volunteers to help speed things up - provided quality and accuracy were maintained. There are four main tags to watch for: <app_name>GetDecics</app_name> <version_num>400</version_num> <platform>windows_x86_64</platform> <plan_class>default</plan_class> Two appear in <workunit>, and three appear in <result> (<version_num> appears in both). The complete set of four has to be consistent for each task, and they have to match an <app_version> which is already defined on your system. Apart from that, it's easy.... Some people even devised scripts or applications to automate the process - look for the term 'reschedule' at SETI. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
We used to use that trick a lot at SETI@Home, which was open-source from the very beginning. The volunteers (collectively) had far more time for optimising the various apps than the project staff, and the staff actively encouraged volunteers to help speed things up - provided quality and accuracy were maintained. I will have to start using this trick. Thanks! |
Send message Joined: 31 Oct 18 Posts: 2 Credit: 17,788,613 RAC: 15,443 |
I tested at this computer: https://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=2674300 and received immediately this error: <core_client_version>7.8.3</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> GPU Summary String = [CUDA|GeForceGTX1070|1|4095MB|45545|102]. Loading GPU lookup table from file. GPU found in lookup table: GPU Name = GTX1070. numBlocks = 1600. threadsPerBlock = 512. polyBufferSize = 819200. Setting GPU device number 0. Cuda initialization was successful. CHECKPOINT_FILE = wu_sf3_DS-16x270_Grp3896209of3932160_checkpoint. Checkpoint Flag = 0. Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-16x270_Grp3896209of3932160.dat K = x^2 - 2 S = [2, 5] Disc Bound = 80000000000000000 Skip = (P^2)*(Q^7) Num Congruences = 5 SCALE = 1.000000 |dK| = 8 Signature = [2,0] Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-16x270_Grp3896209of3932160_0_r526772531_0 Now starting the targeted Martinet search: Num Cvecs = 5. Doing Cvec 1. Error code 701: too many resources requested for launch file polDiscTest_gpuCuda.cu line 2330. polDisc Test had an error. Aborting. </stderr_txt> ]]> |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
I tested at this computer: https://numberfields.asu.edu/NumberFields/show_host_detail.php?hostid=2674300 I think the lookup table has settings that are too high for the 1070. In the projects directory (where the executable is) you will find gpuLookupTable_v401.txt. If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem. You could even raise numBlocks to 4096 and you should still be good. What's important is that threadsPerBlock must be a multiple of 32. The current settings for the 1070 were optimal for another user who helped test that card (I dont have a 1070). His card may have had more RAM than yours, and I subsequently made some changes to the code which require more RAM than before. |
Send message Joined: 20 Jun 12 Posts: 3 Credit: 51,772,719 RAC: 0 |
Same error on a RTX 2070m tested with the settings from GTX 1070 and after reducing the numblocks to 1024 got the same error. <core_client_version>7.16.14</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> GPU Summary String = [CUDA|GeForceRTX2070|1|4095MB|45545|102]. Loading GPU lookup table from file. GPU found in lookup table: GPU Name = RTX2070. numBlocks = 1024. threadsPerBlock = 512. polyBufferSize = 524288. Setting GPU device number 0. Cuda initialization was successful. CHECKPOINT_FILE = wu_sf3_DS-16x271-1_Grp17910of2000000_checkpoint. Checkpoint Flag = 0. Reading file ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-16x271-1_Grp17910of2000000.dat K = x^2 - 2 S = [2, 5] Disc Bound = 2000000000000000000 Skip = (P^3)*(Q^5) Num Congruences = 25 SCALE = 1.000000 |dK| = 8 Signature = [2,0] Opening output file ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-16x271-1_Grp17910of2000000_0_r1623407575_0 Now starting the targeted Martinet search: Num Cvecs = 25. Doing Cvec 1. Doing Cvec 2. Doing Cvec 3. Doing Cvec 4. Doing Cvec 5. Doing Cvec 6. Doing Cvec 7. Doing Cvec 8. Doing Cvec 9. Doing Cvec 10. Error code 701: too many resources requested for launch file polDiscTest_gpuCuda.cu line 2298. polDisc Test had an error. Aborting. </stderr_txt> ]]> ThreadsPerBlock must be a multiple of 32, what about numBlocks? What is the relation with the cuda cores or other value. ThreadsPerBlock tested at 64, 128, 256 and all working. Is a way to check the same workunit to see any improvement when changing a setting, because of the run time difference between workunits. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
ThreadsPerBlock must be a multiple of 32, what about numBlocks? What is the relation with the cuda cores or other value. The optimal threads per block is a multiple of 32 because that is the warp size, which is the number of threads that are run in lock-step. numBlocks is not as easy, but in general should be as large as possible in order to get higher utilization. I found that setting numBlocks to a multiple of the cuda cores is a good starting point. As you guessed, the key to optimization is running the same WU with various settings to see which works best. And so we are not tuning to a specific WU, I usually run several WUs and find the setting that minimizes the sum of their run times. In the near future (3 to 5 days) I will make the github project public which will make testing much easier for volunteers. In the meantime, if you set threadPerBlock to a multiple of 32 and numBlocks to a multiple of cuda cores; and your utilization is >95%, then you are probably close to optimal. |
Send message Joined: 31 Oct 18 Posts: 2 Credit: 17,788,613 RAC: 15,443 |
If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem. This did the trick! It runs with 256 without any problem. If I change to 512, it will error immediatly. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 484,221,081 RAC: 575,124 |
If you edit the line for GTX1070 and set threadPerBlock=32, I think it will fix your problem. Good to hear! |