GPU app status update

Author	Message
Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2360 - Posted: 5 Apr 2019, 23:40:01 UTC So there have been some new developments over the last week. It's both good and bad. First of all, some history. The reason I waited so long to develop a GPU app is because the calculation was heavily dependent on multi-precision libraries (gmp) and number theoretic libraries (pari/gp). Both of these use dynamically allocated memory which is a big no-no in GPUs. I found a multi-precision library online that I could use by hard coding the precision to the maximum required (about 750 bits), thereby removing the dependence on memory allocations. The next piece of the puzzle was to code up a polynomial discriminant function. After doing this, I could finally compile a kernel for the GPU. That is the history for the current GPU app. It is about 20 to 30 times faster than the current cpu version (depends on WU and cpu/gpu speeds). But then I got thinking... my GPU polynomial discriminant algorithm is different from the one in the PARI library (theirs works for any degree and mine is specialized to degree 10). So to do a true apples-to-apples comparison, I replaced the PARI algorithm with mine in the cpu version of the code. I was shocked by what I found... the cpu version was now about 10x faster than it used to be. I never thought I was capable of writing an algorithm that would be 10x faster than a well established library function. WTF? Now I'm kicking myself in the butt for not having done this sooner! This brings mixed emotions. On one side, it is great that I now have a cpu version that is 10x faster. But it also means that my GPU code is total crap. With all the horsepower in a present day GPU I would expect it to be at least 10x faster than the equivalent cpu version. Compared with the new cpu version, the gpu is only 2 to 3 times faster. That is unacceptable. So the new plan is as follows: 1. Deploy new cpu executables. Since it's 10x faster, I will need to drop the credit by a factor of 10. (Credits/hour will remain the same for the cpu but will obviously drop for the GPU) 2. Develop new and improved GPU kernels. I don't blame the GPU users for jumping ship at this point. Frankly, the inefficiency of the current GPU app just makes it not worth it (for them or the project). For what it's worth, I did have openCL versions built. Nvidia version works perfectly. The AMD version is buggy for some reason, as is the windows version. Since I will be changing the kernels anyways, there is no point in debugging them yet. ID: 2360 · Rating: 0 · rate: / Reply Quote

nedmanjo Send message Joined: 10 Sep 17 Posts: 2 Credit: 3,220,977 RAC: 0	Message 2361 - Posted: 5 Apr 2019, 23:56:41 UTC - in response to Message 2360. Last modified: 5 Apr 2019, 23:58:55 UTC Actually, that's great news! An optimized CPU app and GPU app as well. Can I infer a AMD app will be available in time as well? That would be great! By the way, any sort of time line for deploying the new apps? ID: 2361 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2362 - Posted: 6 Apr 2019, 0:30:47 UTC - in response to Message 2361. Actually, that's great news! An optimized CPU app and GPU app as well. Can I infer a AMD app will be available in time as well? That would be great! By the way, any sort of time line for deploying the new apps? I just deployed the new cpu apps. Version 3.00. Feel free to abort any WUs associated with the older versions (2.xx). Not sure the best way to transition the credit value. If I change it now then late returns are penalized. If I wait then quick turn arounds will be overly rewarded. And new GPU apps are weeks away. ID: 2362 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2363 - Posted: 6 Apr 2019, 0:52:06 UTC - in response to Message 2362. I am temporarily going back to credit based on runtime. Once everyone has had a chance to settle in with the super fast cpu app, I will go back to fixed credit per wu. I think this is the fairest way to handle credits during the transition period. ID: 2363 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 30 Apr 18 Posts: 12 Credit: 38,505,377 RAC: 14,961	Message 2364 - Posted: 6 Apr 2019, 7:48:22 UTC - in response to Message 2363. Last modified: 6 Apr 2019, 7:49:52 UTC To my knowledge there is not a single project where the credit system had been adapted after deploying an improved client - if I was you, I wouldn't change anything. The credits inbetween projects are anyway not comparable and deployment of a new client version affects all project participants in the same way. Morevoer, there are several reasons to argue that even within projects CPU vs. GPU credits and even credits generated by different types of CPU architectures (ARM / AMD / Intel / ...) pose an issue. So, please focus on the research results and further (GPU?) client improvements. Michael. President of Rechenkraft.net ID: 2364 · Rating: 0 · rate: / Reply Quote

Julien Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0	Message 2365 - Posted: 6 Apr 2019, 7:48:47 UTC Hello, I didn't find the answer in FAQ so just asking here: did you plan to put code (for Cpu and Gpu Nvidia or Amd) on github/gitlab or similar so people may contribute? Eg: I use cppcheck (a C/C++ static analyzer) to find some bugs. ID: 2365 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 30 Apr 18 Posts: 12 Credit: 38,505,377 RAC: 14,961	Message 2366 - Posted: 6 Apr 2019, 7:54:57 UTC - in response to Message 2365. Hello, I didn't find the answer in FAQ so just asking here: did you plan to put code (for Cpu and Gpu Nvidia or Amd) on github/gitlab or similar so people may contribute? Eg: I use cppcheck (a C/C++ static analyzer) to find some bugs. This indeed is a good idea - and again highlights a problem with the credits: Just check out how many projects already have optimized clients coded by third party people that are producing more credits/hour compared to the projects own software. Following Erics arguments above, even here credit system adaptations would be required - to my knowledge, again, nowhere this is put into practice... Michael. President of Rechenkraft.net ID: 2366 · Rating: 0 · rate: / Reply Quote

UBT - Timbo Send message Joined: 30 Dec 13 Posts: 4 Credit: 10,576,898 RAC: 648	Message 2368 - Posted: 6 Apr 2019, 14:05:36 UTC - in response to Message 2360. Last modified: 6 Apr 2019, 14:06:01 UTC So the new plan is as follows: 1. Deploy new cpu executables. Since it's 10x faster, I will need to drop the credit by a factor of 10. (Credits/hour will remain the same for the cpu but will obviously drop for the GPU) Hi In the past, I think that the NumberFields tasks took my PCs about 4 hours to complete and from my notes a while back the fixed credits were about 370 per completed task - so that's about 1.5 credits per minute.(I can't see any of my old results on the project so I'm not 100% sure of this). Using the v3.0 CPU app, the 2 PCs I've run Numberfields on (since last night) are earning at around 0.667 and 0.835 credits per minute (respectively). Am I using the wrong "old" data and making an incorrect assumption that the credits per hour are now less than the v2.x app? If so, then that would be a shame. On the other hand, I do hope your server(s) can cope with the increased number of tasks being downloaded as well as more frequent uploads being made. regards Tim ID: 2368 · Rating: 0 · rate: / Reply Quote

bcavnaugh Send message Joined: 4 Aug 14 Posts: 5 Credit: 6,721,320 RAC: 0	Message 2369 - Posted: 6 Apr 2019, 16:18:05 UTC - in response to Message 2368. Last modified: 6 Apr 2019, 16:20:13 UTC So the new plan is as follows: 1. Deploy new cpu executables. Since it's 10x faster, I will need to drop the credit by a factor of 10. (Credits/hour will remain the same for the cpu but will obviously drop for the GPU) Why? Other Projects that add GPU Apps gives us the Higher Credit for running GPU Tasks over CPU Tasks. ID: 2369 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 349,366,259 RAC: 244,212	Message 2370 - Posted: 6 Apr 2019, 17:05:00 UTC I was a bit taken aback to see the much shorter estimated runtime when I first saw my task list this morning, but once I'd focused on the version number and read this thread, all was explained. As it happens, I'd started a spreadsheet to measure the performance of my Windows machines in BOINC credit terms. I have three identical i5-4690 CPU @ 3.50GHz CPUs running Windows 7/64, but with different software loaded for different purposes: with version 2.12, they were recording 68, 70, 72 credits per hour with minuscule variation (st_dev down to 0.00073). Under version 3.00 - exactly the same! I don't know how you managed it, but that's the smoothest version upgrade I've ever seen. No problems with runtime estimates and over/under fetching, no interruption to work flow, no messy credit adjustments. The only thing I haven't checked yet is whether the more efficient application increases the power consumption of the CPU, but I'll check that later - I haven't got the watt-meter in circuit at the moment. I'd say that was a fair result. We are contributing the same hardware and (subject to checking) the same power, and we've done nothing to optimise our systems. You've done the work, and you've got the benefit in the form of a much increased result rate. Bravo, and well done. :-) ID: 2370 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 349,366,259 RAC: 244,212	Message 2371 - Posted: 6 Apr 2019, 17:26:00 UTC To reassure people with different recollections, I took version 2.12 credit readings from between 21 March and 25 March, before the first adjustments for the GPU release. There were between 47 and 75 results visible for the three machines. For version 3.00, I had between 27 and 40 results available per machine when I started updating the spreadsheet. Here are the raw figures, expressed as average credits per hour. Host v2.12 v3.00 1288 70.5627 70.9940 1290 68.0019 68.0024 1291 72.1462 72.1432 ID: 2371 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2372 - Posted: 6 Apr 2019, 18:41:49 UTC - in response to Message 2365. Hello, I didn't find the answer in FAQ so just asking here: did you plan to put code (for Cpu and Gpu Nvidia or Amd) on github/gitlab or similar so people may contribute? Eg: I use cppcheck (a C/C++ static analyzer) to find some bugs. I hadn't thought about that. Up until now, I have just emailed a tarball to anybody that wanted to help develop. I have extensive testing scripts that I run before deploying new executables and these include running on a private BOINC server. I dont know how all that would work in a github environment, and there would need to be some changes. Do any other projects develop in this way? I will checkout cppcheck when I get a chance. The bugs I was referring to are caused by different OpenCL implementations, because I coded as if it were normal C, but there are special rules that need to be followed for "OpenCL C". Nvidia's implementation seems to follow traditional C, so my code worked well there. As an example, I was passing an array of flags back to the host. I coded these as booleans. Nvidia had no problem, but to get it to work on AMD GPUs I had to change the booleans to chars. ID: 2372 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2373 - Posted: 6 Apr 2019, 19:01:21 UTC - in response to Message 2367. My 32-bit Linux machine is still using version 2.12. Are there plans to release version 3.00 apps for this platform (and 32-bit Windows)? So I just queried the database to get an idea of how many 32bit users there are. Here are the numbers over the last 5 days: Total WUs processed: 191,000 32 bit linux WUs: 141 (=.07%) 32 bit windows: 475 (=.25%) So I am not sure if its worth the effort to maintain these versions. I am adding it to my list of To-Dos, but it will be much lower priority. How old are these 32bit computers? Weren't the last 32bit machines back in the pentium days? Or are you running a VM in a newer machine? ID: 2373 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2374 - Posted: 6 Apr 2019, 19:09:34 UTC - in response to Message 2368. In the past, I think that the NumberFields tasks took my PCs about 4 hours to complete and from my notes a while back the fixed credits were about 370 per completed task - so that's about 1.5 credits per minute.(I can't see any of my old results on the project so I'm not 100% sure of this). Using the v3.0 CPU app, the 2 PCs I've run Numberfields on (since last night) are earning at around 0.667 and 0.835 credits per minute (respectively). Am I using the wrong "old" data and making an incorrect assumption that the credits per hour are now less than the v2.x app? If so, then that would be a shame. On the other hand, I do hope your server(s) can cope with the increased number of tasks being downloaded as well as more frequent uploads being made. regards Tim I recall seeing a small credit drop in January after upgrading the server, so there is probably some truth to your memories. It looks like Richard started recording data after that, so he wouldn't show the drop. ID: 2374 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2375 - Posted: 6 Apr 2019, 19:25:51 UTC - in response to Message 2370. I was a bit taken aback to see the much shorter estimated runtime when I first saw my task list this morning, but once I'd focused on the version number and read this thread, all was explained. As it happens, I'd started a spreadsheet to measure the performance of my Windows machines in BOINC credit terms. I have three identical i5-4690 CPU @ 3.50GHz CPUs running Windows 7/64, but with different software loaded for different purposes: with version 2.12, they were recording 68, 70, 72 credits per hour with minuscule variation (st_dev down to 0.00073). Under version 3.00 - exactly the same! I don't know how you managed it, but that's the smoothest version upgrade I've ever seen. No problems with runtime estimates and over/under fetching, no interruption to work flow, no messy credit adjustments. The only thing I haven't checked yet is whether the more efficient application increases the power consumption of the CPU, but I'll check that later - I haven't got the watt-meter in circuit at the moment. I'd say that was a fair result. We are contributing the same hardware and (subject to checking) the same power, and we've done nothing to optimise our systems. You've done the work, and you've got the benefit in the form of a much increased result rate. Bravo, and well done. :-) The reason the credit rates are so similar is the credit_from_runtime option. We may just have to stick with that, at least until the new GPU apps come out. I will be interested in seeing your power consumption analysis. My cpu monitor shows temps about 10 deg F higher. This may explain why the cpu version is so much more efficient than it used to be. In my version of the algorithm, I use gmp which is supposed to be highly efficient; the old version, being a PARI function, used PARI's built in multi-precision, which is probably less efficient. By efficiency, I mean keeping more of the data in on-chip cache instead of RAM. I have heard that RAM access is an order of magnitude slower than cache. ID: 2375 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 349,366,259 RAC: 244,212	Message 2376 - Posted: 6 Apr 2019, 20:05:22 UTC - in response to Message 2375. I will be interested in seeing your power consumption analysis. I'll dig them up, but it may take a while. I posted them on a message board, but I think it was SETI - which has crashed hard this weekend. And I tidied away my notes when I had a visitor last month: that's fatal, of course. ID: 2376 · Rating: 0 · rate: / Reply Quote

Julien Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0	Message 2377 - Posted: 7 Apr 2019, 6:44:15 UTC Hello again, Thank you for your feedback. Also, perhaps it could be interesting you send your code about GPU polynomial discriminant algorithm to PARI authors. Indeed, it could help them, perhaps they could find some flaws but also they may have some idea to improve it even more! ID: 2377 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Jean-Luc Send message Joined: 16 May 12 Posts: 7 Credit: 318,661,689 RAC: 805,347	Message 2378 - Posted: 7 Apr 2019, 8:53:45 UTC - in response to Message 2377. Congratulations to Eric Driver for making the search much faster. This is incredible ! For credits, it's excellent right now. But when the GPU tasks come out, it will certainly be necessary to give a fixed credit for the tasks. A good solution might be to average all the credits given for the tasks currently and take this average as a fixed credit for the tasks. This should be very fair... There is no longer any hope now that it is possible to make GPU calculations significantly more efficient than CPU calculations. However, one thing worries me. If the calculations become 30 or 100 times more efficient, a GPU task will then take about 80 or 25 seconds, which is very short ! This will generate a lot of traffic ! ID: 2378 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 28 Oct 11 Posts: 182 Credit: 349,366,259 RAC: 244,212	Message 2379 - Posted: 7 Apr 2019, 13:20:53 UTC OK, SETI is back up, so I've recovered my readings from mid-February, and put my watt meter back into the same circuit. This is what I posted back then: I've long had a theory that it doesn't just matter whether you're using the CPU: it matters what you're doing with it, too. Since we're testing, I though I'd try to demonstrate that. My host 8121358 has an i5-6500 CPU @ 3.20GHz - a couple of generations old now. I plugged it into a Killa-watt meter when I first got it, and never got round to unplugging it again. Today's figures are: Idle - BOINC not running: 22 watts Running NumberFields on 4 cores: 55 watts Running SETI x64 AVX on 4 cores: 69 watts ditto at VHAR: 71 watts So, there's a significant difference between NumberFields@Home (primarily integer arithmetic) and the heavy use of the specialist floating point hardware by SETI. I've listed VHAR separately, because last time I tested this (about 10 years ago), I could see that VHAR put an extra load on the memory controller, too. (I kept both GPUs idle while I did that test) The computer is known as host 33342 here - CPU details as above. Obviously, I was running v2.12 back then: today's readings are Idle - BOINC not running: 22 watts Running NumberFields 3.00 on 4 cores: 60 watts There's been a BIOS update since the previous test, but the idle value didn't change - that's reassuring. As we expected, the new app is drawing more power, but not nearly as much as the SETI app. SETI is heavily into floating point maths, and like here uses a specialist external maths library - in their case, FFTW, or "The Fastest Fourier Transform in the West". SETI supplies this as an external library (DLL for Windows), and my understanding is that the library alone can detect host capability, and utilise SIMD instructions up to AVX if available, even if the calling application hasn't been compiled to use them in the main body of the program. The specific variant I tested does use AVX in both components, though. It might be worth reporting back your findings to the PARI people, and suggesting that they compare notes with FFTW to see if similar techniques could be employed. ID: 2379 · Rating: 0 · rate: / Reply Quote

Eric Driver Project administrator Project developer Project tester Project scientist Send message Joined: 8 Jul 11 Posts: 1453 Credit: 1,046,605,489 RAC: 3,011,030	Message 2380 - Posted: 7 Apr 2019, 19:01:47 UTC So there has been rampant abuse again by GPUs, so the credit system is in need of modification again. I feel like history is repeating itself here, but I am going back to CreditNew. The problem last time is CreditNew paid almost nothing to the GPUs, but I believe credits were fair for CPUs. This will put a stop to the obscene credits paid to GPUs for tasks that are only 2 to 3 times faster than the CPU. There was also an issue with outliers last time, which inhibited CreditNew from measuring proper stats and this has now been fixed. ID: 2380 · Rating: 0 · rate: / Reply Quote