Message boards :
Science :
Availability of source code
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Hello, Would it be possible to make available source code (CPU and GPU related programs) so people may try to propose some changes to optimize? Julien |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Hello, Yes, thanks for reminding me. I have a private github repo that I've been meaning to make public. I'll try to do that later today. |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Ok, I have made the source code public. Here is the link to the github repo: https://github.com/drivere/get-decics-numberfields Please let me know if there are any questions or problems accessing it. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Hello, Thank you for the quick feedback! No pb to access it. I hadn't expected 2.4GB to download ! :-) I don't know if you received my message where I proposed a patch. (I haven't forked yet your repo to use merge request feature) |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Hello, It's probably big because I put the binaries into the repo. I did this to make it easy for novice users to just grab the binary and test it on their GPU without having to build it. I didn't see any message. Not sure where you messaged me. My inbox on github is empty and there are no new private messages on this site. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Hello, Ok I don't know what happened about my message. I wonder if extracting invariants in some loops may help a bit + add a bit of redundancy to avoid n tests. Here's a patch I must recognize I haven't even built since I haven't taken a look at the build mechanism for the moment. diff --git a/GetDecicsSrc/TgtMartinet.cpp b/GetDecicsSrc/TgtMartinet.cpp index e62c06a..014feaa 100755 --- a/GetDecicsSrc/TgtMartinet.cpp +++ b/GetDecicsSrc/TgtMartinet.cpp @@ -228,22 +228,45 @@ int TgtMartinet(char *FilenameIn, char *FilenameOut, int ChkPntFlag, pari_long C // sig1w[1] = sig1(w1) and sig1w[2] = sig1(w2) // Since K is quadratic, w1=1, so sig1w[1]=1 (the code below is left over from a more general version) // The same remarks also apply to sig2w[:]. - for(a11=0;a11<=2;++a11) + bool b4DividesdK((dK%4)==0); + if (b4DividesdK) { - for(a12=0;a12<=2;++a12) + for(a11=0;a11<=2;++a11) { - if(a11==2 && a12==2 && (dK%4)!=0) // a1=2+2w and 4 does not divide dK + auto tmp_gmulsg_a11_sig1w = gmulsg(a11,(GEN)sig1w[1]); + auto tmp_gmulsg_a11_sig2w = gmulsg(a11,(GEN)sig2w[1]); + for(a12=0;a12<=2;++a12) { - sig1a1 = gadd( gmulsg(-1,(GEN)sig1w[1]),gmulsg(a12,(GEN)sig1w[2]) ); - sig2a1 = gadd( gmulsg(-1,(GEN)sig2w[1]),gmulsg(a12,(GEN)sig2w[2]) ); + sig1a1 = gadd( tmp_gmulsg_a11_sig1w, gmulsg(a12,(GEN)sig1w[2]) ); + sig2a1 = gadd( tmp_gmulsg_a11_sig2w,gmulsg(a12,(GEN)sig2w[2]) ); + Ca1_pre[3*a11+a12] = ( gtodouble(gsqr(gabs(sig1a1,DEFAULTPREC))) + + gtodouble(gsqr(gabs(sig2a1,DEFAULTPREC))) )/5.0; } - else + } + } + else + { + auto tmp_gmulsg_sig1w = gmulsg(-1,(GEN)sig1w[1]); + auto tmp_gmulsg_sig2w = gmulsg(-1,(GEN)sig2w[1]); + for(a11=0;a11<=2;++a11) + { + auto tmp_gmulsg_a11_sig1w = gmulsg(a11,(GEN)sig1w[1]); + auto tmp_gmulsg_a11_sig2w = gmulsg(a11,(GEN)sig2w[1]); + for(a12=0;a12<=2;++a12) { - sig1a1 = gadd( gmulsg(a11,(GEN)sig1w[1]),gmulsg(a12,(GEN)sig1w[2]) ); - sig2a1 = gadd( gmulsg(a11,(GEN)sig2w[1]),gmulsg(a12,(GEN)sig2w[2]) ); + if(a11==2 && a12==2) // a1=2+2w and 4 does not divide dK + { + sig1a1 = gadd( tmp_gmulsg_sig1w, gmulsg(a12,(GEN)sig1w[2]) ); + sig2a1 = gadd( tmp_gmulsg_sig2w, gmulsg(a12,(GEN)sig2w[2]) ); + } + else + { + sig1a1 = gadd( tmp_gmulsg_a11_sig1w, gmulsg(a12,(GEN)sig1w[2]) ); + sig2a1 = gadd( tmp_gmulsg_a11_sig2w, gmulsg(a12,(GEN)sig2w[2]) ); + } + Ca1_pre[3*a11+a12] = ( gtodouble(gsqr(gabs(sig1a1,DEFAULTPREC))) + + gtodouble(gsqr(gabs(sig2a1,DEFAULTPREC))) )/5.0; } - Ca1_pre[3*a11+a12] = ( gtodouble(gsqr(gabs(sig1a1,DEFAULTPREC))) + - gtodouble(gsqr(gabs(sig2a1,DEFAULTPREC))) )/5.0; } } |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Hello, I should mention the vast majority of the compute time is spent inside the polynomial discriminant calculation. So the biggest improvement will be found inside the polDiscTest_*.cpp files (there are 3 versions: cpu, openCL, and cuda). The 2nd best place to look for improvement is in the function Mart52Engine_Tgt(), especially inside the multitude of nested loops. Also inside the function testPolys where a final set of tests are performed on the remaining polynomials. Note these final tests involve computing the field discriminant which is even more complex than the poly discriminant. This is so complex, that I didn't even attempt to do it on the GPU - the good news is the polDisc test filters out more than 99.99% of the polynomials so that the CPU can quickly test the few that remain. So unfortunately your patch would have almost no effect on the timing (that part of the code finishes instantaneously). But that was a good first try! Feel free to send me a private message to chat more. No need to bore the casual user with code discussion. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Hello, Perhaps it's have been already tried but thought that auto vectorization may help. Indeed, it's typically the kind of calculus where it could be applied here. See: - https://en.wikipedia.org/wiki/Automatic_vectorization - https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization - https://locklessinc.com/articles/vectorize/ (on my tablet it works but on my pc it's blank page thought for both I use Firefox) If you already knew this and tried to implement it by tuning a bit code and/or using compilator flags, sorry for the noise. Regards, Julien |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Hello, The compiler is configured to do the highest level of optimization, so it may have done some form of vectorization. If that's not the case, I guess it may be possible to refactor the code in order to take advantage of vectorization. But I don't know enough about it and it seems like a lot or work with no guarantee of success. Maybe you know a guru that does this kind of stuff? |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
I know no guru about it. I wanted to try to build because I noticed the option -fopt-info-vec-missed ("Detailed info about loops not being vectorized, and a lot of other detailed information"). So after building pari-2.8-1711-ge5c317c (make all failed on doc part but at least libs were ok) and make install, then make install for boinc, I failed to build get-decics-numberfields just at the beginning. I tried: make -f ./GetDecicsSrc/Makefile.cpu_linux64 => make: *** No rule to make target 'GetDecics.cpp', needed by 'Olinux-x86_64/GetDecics.o'. Stop Then I also tried: cd GetDecicsSrc make Makefile.cpu_linux64 => g++ -O3 -Wall -fmax-errors=4 -I/home/eric/BOINC_Project/boinc -I/home/eric/BOINC_Project/boinc/lib -I/home/eric/BOINC_Project/boinc/api -I/home/eric/BOINC_Project/boinc/sched -I/home/eric/BOINC_Project/boinc/db -I/home/eric/BOINC_Project/Pari/pari-2.8-1711-ge5c317c/src/headers -I/home/eric/BOINC_Project/Pari/pari-2.8-1711-ge5c317c/Olinux-x86_64 -DAPP_VERSION_CPU_STD -c -o Olinux-x86_64/GetDecics.o GetDecics.cpp GetDecics.cpp:8:12: fatal error: config.h: No such file or directory 8 | # include "config.h" | ^~~~~~~~~~ compilation terminated. make: *** [Makefile.cpu_linux64:50: Olinux-x86_64/GetDecics.o] Error 1 Then I read this afterwards: "The makefiles will need to be modified so that the compiler can find the source files (both pari and boinc) and the libraries. There is also a dependency on the gmp library, so that will need to be installed. And obviously, openCL headers and libOpenCL will need to be present (and makefile changed accordingly). " I'm more accustomed to configure/autogen (which complain because of missing libs that we must install to make them happy) then make. Giving my little knowledge about building config, I can't go on. Anyway, hopefully other people knowing building conf would be able to build this and try some options.[/quote] |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
The change needed for the makefile is actually pretty simple. At the top there are several hard coded directories that need to be changed for your specific system. You'll notice all the -I/home/eric/... references in the build line - those should be pointing to directories on your system. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
I finally succeeded in building it. subdir "boinc" was empty, so I removed it and did a soft link with: ln -s /home/julien/projects/boinc/ Then I replaced in Makefile.cpu_linux64 (since it's the one I use) -CXXFLAGS = -O3 -Wall -fmax-errors=4 +CXXFLAGS = -O3 -Wall -fmax-errors=4 -fopt-info-vec-missed and it built ! I could retrieve this kind of messages: polDiscTest_cpu.cpp:87:23: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized polDiscTest_cpu.cpp:571:19: missed: couldn't vectorize loop polDiscTest_cpu.cpp:604:26: missed: couldn't vectorize loop polDiscTest_cpu.cpp:604:26: missed: not vectorized: control flow in loop. polDiscTest_cpu.cpp:613:11: missed: couldn't vectorize loop polDiscTest_cpu.cpp:613:11: missed: not vectorized: latch block not empty. polDiscTest_cpu.cpp:504:47: missed: statement clobbers memory: __gmpz_mul (&hPow, &h, &hPow); polDiscTest_cpu.cpp:500:23: missed: couldn't vectorize loop ... For the moment, I saw nothing which could be fixed but am not autovectorize expert. Then I thought about putting this flag for Pari lib and noticed that version used was Pari 2.8 and https://pari.math.u-bordeaux.fr/download.html proposed 2.13 version. Perhaps there have been some improvements? (Of course, I suppose it needs some time/work to retrieve it, build it, test if there's no regression and if it indeed improves perfs). |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
That's very interesting! It looks like it doesn't like having too many nested loops or if statements within loops. Not sure how to get around that. I wonder if it's vectorizing the innermost loops and how much that helps. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Hello, It certainly won't change perf but here's a slight simplification for initialization that can't be bad: diff --git a/GetDecicsSrc/polDiscTest_cpu.cpp b/GetDecicsSrc/polDiscTest_cpu.cpp index 6fa7280..7e0ee3c 100755 --- a/GetDecicsSrc/polDiscTest_cpu.cpp +++ b/GetDecicsSrc/polDiscTest_cpu.cpp @@ -72,12 +72,9 @@ polDiscTest_cpu(long long *polBuf, int numPolys, char *polGoodFlag, char *polyMa #ifdef COEFF_SIZE_TEST // maxBits[i][j] = max bits required when starting an iteration with degA=i and degB=j. - int maxBits[11][10]; - int maxBitsGH[11][10]; // A separate measurement for g and h. - long freq[11][10]; - for(int i=0; i<=10; ++i) { - for(int j=0; j<=9; ++j) { maxBits[i][j]=0; maxBitsGH[i][j]=0; freq[i][j]=0; } - } + int maxBits[11][10] = {0}; + int maxBitsGH[11][10]= {0}; // A separate measurement for g and h. + long freq[11][10] = {0}; mpz_t absB, absR, absGH; mpz_inits(absB, absR, absGH, NULL); #endif (from here: https://slaystudy.com/initialize-2d-array-c-to-zero/) |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Another slight patch: diff --git a/GetDecicsSrc/polDiscTest_cpu.cpp b/GetDecicsSrc/polDiscTest_cpu.cpp index 6fa7280..27b66f3 100755 --- a/GetDecicsSrc/polDiscTest_cpu.cpp +++ b/GetDecicsSrc/polDiscTest_cpu.cpp @@ -100,7 +100,7 @@ polDiscTest_cpu(long long *polBuf, int numPolys, char *polGoodFlag, char *polyMa // Compute the derivative of A. Call it B. // NOTE: B = 10*x^9 + 9*A[9]*x^8 + 8*A[8]*x^7 + ... + 2*A[2]*x + A[1] int64_t B[10]; - for(int k = 1; k <= 10; k++) B[k-1] = k*A[k]; + for(int k = 0; k < 10; k++) B[k] = (k+1)*A[k+1]; // The discriminant of a monic decic is -1*Resultant(A,B). // We use Algorithm 3.3.7 in Cohen's Book (p.122). Here's a small snippet of code to test this: //***C++11 Style:*** #include <chrono> #include <iostream> void test1() { int64_t A[11]={0}; int64_t B[10]={0}; std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now(); for(int k = 1; k <= 10; k++) B[k-1] = k*A[k]; std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now(); std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::nanoseconds> (end - begin).count() << "[ns]" << std::endl; } void test2() { int64_t A[11]={0}; int64_t B[10]={0}; std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now(); for(int k = 0; k < 10; k++) B[k] = (k+1)*A[k+1]; std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now(); std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::nanoseconds> (end - begin).count() << "[ns]" << std::endl; } int main() { test1(); test2(); } ~ Testing this, I get: julien@debianamd:~/test/cpp$ ./a.out Time difference = 130[ns] Time difference = 70[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 160[ns] Time difference = 140[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 140[ns] Time difference = 60[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 220[ns] Time difference = 80[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 130[ns] Time difference = 70[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 150[ns] Time difference = 120[ns] julien@debianamd:~/test/cpp$ ./a.out Time difference = 160[ns] Time difference = 80[ns] So I know that with caching mechanism the difference between methods varies but the second one seems better. |
Send message Joined: 14 Sep 13 Posts: 12 Credit: 2,980,366 RAC: 0 |
Still in the same idea: diff --git a/GetDecicsSrc/polDiscTest_cpu.cpp b/GetDecicsSrc/polDiscTest_cpu.cpp index 6fa7280..f152cd2 100755 --- a/GetDecicsSrc/polDiscTest_cpu.cpp +++ b/GetDecicsSrc/polDiscTest_cpu.cpp @@ -95,7 +95,7 @@ polDiscTest_cpu(long long *polBuf, int numPolys, char *polGoodFlag, char *polyMa // (i.e. A[0]*x^10 + A[1]*x^9 + ... + A[9]*x + A[10]) // BUT FOR IMPLEMENTATION REASONS, WE REVERSE THE ORDER. int64_t A[11]; - for(int col = 0; col < 11; col++) A[10-col] = polBuf[pIdx*11+col]; + for(int col = 0; col < 11; col++) A[col] = polBuf[pIdx*11 + 10 - col]; // Compute the derivative of A. Call it B. // NOTE: B = 10*x^9 + 9*A[9]*x^8 + 8*A[8]*x^7 + ... + 2*A[2]*x + A[1] (remark: I've increased the number of elements in array in my tests and repeated the same test several times, it happens the modified version may sometimes be a bit slower, hope I'm not in the wrong way :-(). |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
Hi Julien, Those are all good attempts, however I think the improvement is down in the noise. Recall an average job takes about an hour, so an improvement in nano-seconds is not even noticeable. That subroutine is called less that 1 million times per job, so the overall improvement is still in the milli-seconds. As you can see, improving the performance is not an easy task... |
Send message Joined: 1 Feb 12 Posts: 5 Credit: 11,790,180 RAC: 17,715 |
I am not an expert in computational number theory, but upon reviewing a bit the present code, I think that there is not a lot of improvement we can do without algorithmical advance or changing GMP for the CPU code. For the algorithmic part, it is clear. For GMP, it is because most of the computation is arithmetic using GMP, which means that the speed mostly depends on how GMP is doing. However, GMP is a generic arbitrary precision library, meaning that there are some internal managements that are done to check which algorithm to apply. This in general gives an overhead, and probably also forbids any kind of vectorisation except those already in GMP. One idea is, from the CUDA code we can see that there is a natural bound on numbers that are used in the computation. If they all fit in 128 bits, then gcc supports __int128. Otherwise, it might (!) help the speed by using or writing an emulated type in which the numbers fit, like the one used in the CUDA code. In this way, at least gcc would have more clues to vectorize (or not!). fwjmath. |
Send message Joined: 8 Jul 11 Posts: 1342 Credit: 514,009,131 RAC: 581,148 |
I am not an expert in computational number theory, but upon reviewing a bit the present code, I think that there is not a lot of improvement we can do without algorithmical advance or changing GMP for the CPU code. For the algorithmic part, it is clear. For GMP, it is because most of the computation is arithmetic using GMP, which means that the speed mostly depends on how GMP is doing. However, GMP is a generic arbitrary precision library, meaning that there are some internal managements that are done to check which algorithm to apply. This in general gives an overhead, and probably also forbids any kind of vectorisation except those already in GMP. Thanks for the analysis. One thing I will point out is, the numbers grow to almost 1000 bits in size in the later passes through the loop. |