Posts by Greg Tucker

21) Message boards : Number crunching : outch! (Message 603)
Posted 11 Apr 2012 by Profile Greg Tucker
Post:
Looks like the boinc client is bombing. Can you run under strace or gdb?
22) Message boards : Number crunching : OpenCL/GPU app? (Message 579)
Posted 23 Mar 2012 by Profile Greg Tucker
Post:

well, at least GMP-ECM has a working CUDA-branch, you don't happen to be using that lib?

primegrid recently started a world-record hunt for Generalized Fermat-primes using CUDA-apps. current runtimes are up into several days on fastest GPU's, but would take months most likely on current high-end CPU's.

that app allready owns the top 4 spots. ;)


No, we need the standard GMP. And even if GMP had cuda support, we still have the pari/gp library to deal with. We use pari for computing integral bases, checking irreducibility, computing field discriminants, and also factoring large integers. It's a large amount of code that would need to be "cudified". It would be a fun undertaking, if only I had the time.


Well... i'm not sure why you don't upload a copy of the code somewhere for users to have a look at it. Maybe one will have time to look at it an port it to use a GPU.

Actually, since you're using GPL'ed GMP and PARI, you'll have to release your source code upone request (cosider this post a request ;) ), or otherwhise you're violating GPL.

TIA


You can link userspace apps to GPL libraries and not encumber under the GPL.

I'm all for posting the source though.

23) Message boards : Number crunching : abort? (Message 567)
Posted 19 Mar 2012 by Profile Greg Tucker
Post:
Wiki updated for the confusion over units, though I don't think that's the real underlying problem.


Thanks Richard. That was quick.

I am pretty sure I was using the "report_grace_period", otherwise I wouldn't have questioned the units. I cant ssh into the project server from work, so I have a call into Greg to take a look at the config file. He will be able to tell us exactly what is going on there.


It was report_grace_period. I changed to grace_period_hours.
24) Message boards : Number crunching : FUBAR! (Message 450)
Posted 5 Mar 2012 by Profile Greg Tucker
Post:
i only want to run Get Bounded Decics.

i am not getting a single WU for days now.

should i stay or should i go?


Sorry about that. I think the weights were messed up again and confusing the server with fractional values. I just changed them to something that looks reasonable. It should be possible to just run the Get Bounded Decics app if you want to.
25) Message boards : Number crunching : ARIZONA - we got a problem.. (Message 444)
Posted 28 Feb 2012 by Profile Greg Tucker
Post:
I won't be getting home until very late tonight, so I decided to turn the bounded app back on. This time I set the weights to 100-to-1. We will see if that has any effect.

I made the above change via the admin web interface. But I might need to restart the project daemons in order for the change to take effect, and I cant do this from work (no ssh allowed, dont ask me why). Anyways, I have a call into the other admin Greg so that he can do it. Should happen relatively soon...

Eric


OK, I restarted.

26) Message boards : Number crunching : ETA for project (Message 433)
Posted 23 Jan 2012 by Profile Greg Tucker
Post:
Good day everybody.

Is there a page with ETA for project?


According to Eric we could potentially go on forever. Check out his response in the science topics.

http://numberfields.asu.edu/NumberFields/forum_thread.php?id=5#8

27) Message boards : Number crunching : Runnitme discrepancy (Message 333)
Posted 13 Nov 2011 by Profile Greg Tucker
Post:
Hi,
I had a few task, that were running, for example, for 30h.
When validated, the time credited was only 32000 sec, which is less than 9h.
Why is that?


Do you have a workunit number for these? The boinc manager measures the runtime and calculates the requested credit but we also print some times that we could cross check against.

28) Message boards : Number crunching : FATAL: Kernel too old (Message 330)
Posted 10 Nov 2011 by Profile Greg Tucker
Post:
I have 2.6.32.

There shouldn't be any special kernel requirements for our project.

29) Message boards : Number crunching : Had to abort some WUs (Message 320)
Posted 4 Nov 2011 by Profile Greg Tucker
Post:
Greg,

Here are the docs regarding resends (retries).

I would suggest setting the <reliable_reduced_delay_bound> to 1 rather than less than 1. A value less than 1 reduces the deadline on the resend to something less than the normal 3 day deadline. If the task is being resent because it's a 100 hour task that missed deadline it might not be wise to reduce the deadline, even if it will be sent to a fast-reliable host. You might even consider setting <reliable_reduced_delay_bound> to 1.33 to increase the resend's deadline from 3 days to 3.99.


Thanks, this looks like a reasonable thing to do. I'm a little confused by the documentation though. What would I put for <reliable_on_priority> to enable: 1? If so I will add the following.

<reliable_on_priority>1</reliable_on_priority>
<reliable_reduced_delay_bound>1.33</reliable_reduced_delay_bound>

30) Message boards : Number crunching : Had to abort some WUs (Message 318)
Posted 3 Nov 2011 by Profile Greg Tucker
Post:

One of the sad things about that WU is that it bounced around from 1 slow computer to another before ending up on a host with a fairly fast processor and short turnaround time. With the possibility of 100 hour run times and 3 day deadlines, it might be wise to start using the "issue task resends to fast-reliable hosts" mechanism but not reduce the deadline for resends. It might even make sense to extend the deadline for resends because the initial deadline is rather short and resends seem to be happening because the user aborts the task because it has no hope of finishing on time or the task simply times out.

Unfortunately the server documentation pages are unavailable at this time. I'll post a link later when the pages are back online.


I am currently out of town and have limited access to the internet. When I return I will look into the "resend to fast-reliable hosts" mechanism, which sounds like a good idea. Unless Greg wants to take a stab at it while I am gone...

Interesting that this seems to be the only wu having the problem out of almost 60000 for that set. Of course there were other slow ones, but they managed to make it through and get assimilated without drawing any attention.


I'll try to hold down the fort Eric. I'll look into it tonight.
31) Message boards : Number crunching : Had to abort some WUs (Message 311)
Posted 2 Nov 2011 by Profile Greg Tucker
Post:
wu_12E10_SF161-1_Idx3_Grp57438of59290
getting stuck. aborted after 70 hours


Looks like this wu has some problems.
http://numberfields.asu.edu/NumberFields/result.php?resultid=258591

[/url]
32) Message boards : News : Only Linux and Windows currently supported (Message 307)
Posted 1 Nov 2011 by Profile Greg Tucker
Post:
Sorry about it taking so long. If I had login access to a mac with XCode installed it probably wouldn't take that long. I had an older version running fine at one time but my mac access went away. I'm about to give up on the VM method. Anyone know where I could get login access to a mac with XCode installed?

33) Message boards : Number crunching : Process got signal 11 (Message 219)
Posted 21 Sep 2011 by Profile Greg Tucker
Post:
I am having similar problems on my RHEL5 x64 hosts,
...
% ./GetBoundedDecics_1.07_x86_64-pc-linux-gnu
FATAL: kernel too old
Segmentation fault (core dumped)


We fixed the issue that was causing some Suse distros to fail. However, it looks like we are using a syscall that was added since kernel 2.6.18. I'm not sure which one or if we can avoid it. Looking through the kernel git log for the syscall table I think the last one was added in 2008. Perhaps you could run an strace on your RHEL5 system to see which syscall fails and I can look up when it was added.

34) Message boards : Number crunching : Massive drop of credits per CPU hour (Message 206)
Posted 16 Sep 2011 by Profile Greg Tucker
Post:
Well I can't work out why my faster computer (by only 200 MHz, an AMD Phenom II 1100T @ 3.3 GHz, my other is AMD Phenom II 955 @ 3.2 GHz), is getting consistently much lower results than my slower machine.
...


I really don't understand how the credit system works but I gather is is all based on these benchmarks that run initially. Perhaps there was something else running during the benchmark phase which is lowering the scores on your Phenom. I believe you can force the manager to re-run them. Can you give that a try before you bail on your AMD?

35) Message boards : Number crunching : Error in the PARI system (Message 181)
Posted 11 Sep 2011 by Profile Greg Tucker
Post:
I'm getting PARI errors too.

http://stat.la.asu.edu/NumberFields/result.php?resultid=59267
*** overflow in t_INT-->t_INT assignment.
*** Error in the PARI system. End of program.

http://stat.la.asu.edu/NumberFields/result.php?resultid=59281
*** bug in PARI/GP (Segmentation Fault), please report
*** Error in the PARI system. End of program.

If you browse through my errored tasks you'll also see I've been getting "Maximum disk size exceeded", exit code -177, but I haven't seen anymore of those since alloting more disk space to BOINC so ignore them.


I am aware of the pari seg faults. It seems like every time I find a bug and fix it, another one appears. Needless to say, that's on the top of my list of to-do's.

Out of curiosity, what did you need to set your maximum disk usage to? I just checked and mine was set to 20GB which I'm sure is overkill. I've noticed a few other users with exit code -177, so this information might help them set their preferences.

but we should have very low disk requirements. We don't have big data sets to scan like Einsteinand the work units are tiny. I don't see how we ever triggered that.
36) Message boards : Number crunching : Comp errors on 1.05 for Gentoo Linux (Message 177)
Posted 7 Sep 2011 by Profile Greg Tucker
Post:
Thanks to the beta testers we confirmed the fix. It was the second and hopefully last bug we found in the pari source. Thanks again guys.
37) Message boards : Number crunching : Comp errors on 1.05 for Gentoo Linux (Message 162)
Posted 6 Sep 2011 by Profile Greg Tucker
Post:
I don't think it has anything to do with lib versions on the client. I think something early is ill conditioned or relies on a failed syscall that leads to the exception. Any chance someone can run under a debugger? You would do #gdb <app name> then r for run. After the exception it should trap and give you a line number. That would be great since we have no way to reproduce. Any takers?


Well, this doesn't look real helpful, but here it is.

This GDB was configured as "x86_64-pc-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.gentoo.org/>...
Reading symbols from /tmp/GetBoundedDecics_1.06_x86_64-pc-linux-gnu...done.
(gdb) r
Starting program: /tmp/GetBoundedDecics_1.06_x86_64-pc-linux-gnu

Program received signal SIGFPE, Arithmetic exception.
0x00000031fee72b4c in ?? () from /lib64/libc.so.6


Perhaps there's a USE flag I need to enable in my glibc?

emerge -vp glibc

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild R ] sys-libs/glibc-2.12.2 USE="(multilib) nls -debug -gd -glibc-omitfp (-hardened) -profile (-selinux) -vanilla" 0 kB


This is helpful thanks. Looks like the same issue as Bok is having. We need to patch the pari source again and see if we can eliminate this outdated reference. Perhaps you can beta test for us when we do?
38) Message boards : Number crunching : Comp errors on 1.05 for Gentoo Linux (Message 161)
Posted 6 Sep 2011 by Profile Greg Tucker
Post:
Sure thing. System has 16Gb Ram with almost all of it totally free right now.


(gdb) bt
#0  0x0000003a76277e13 in _int_free () from /lib64/libc.so.6
#1  0x0000003a76265eed in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007ffff7de1f3a in _nss_files_getpwuid_r () from /lib64/libnss_files.so.2
#3  0x000000000085345d in getpwuid_r ()
#4  0x0000000000853029 in getpwuid ()
#5  0x000000000068a5e5 in pari_get_homedir ()
#6  0x000000000068abea in path_expand ()
#7  0x000000000068af29 in gp_expand_path ()
#8  0x000000000069cca0 in pari_init_opts ()
#9  0x000000000041150e in MartinetSearch(char*, char*, int, long, long, long*) ()
#10 0x0000000000403196 in main ()


Bok, this is great. Just what we need. I think we know what the problem is from this. The pari library (a collection of source for numberfields math) is using an outdated version of getpwuid(). Some systems have a workaround but others don't deal with it so well. We need to patch it so that it doesn't call this function and tell the pari people.

Thanks for the effort on this. You rock Bok!

--Greg
39) Message boards : Number crunching : Comp errors on 1.05 for Gentoo Linux (Message 158)
Posted 6 Sep 2011 by Profile Greg Tucker
Post:
Tried the debug-info command but it didn't find anything to install. Looks like the debuginfo repos for centos 6 are not updated yet.. Do I need a version compiled with a -g ?

Reading symbols from /root/BOINC/projects/stat.la.asu.edu_NumberFields/GetBoundedDecics_1.06_x86_64-pc-linux-gnu...done.
(gdb) r
Starting program: /root/BOINC/projects/stat.la.asu.edu_NumberFields/GetBoundedDecics_1.06_x86_64-pc-linux-gnu

Program received signal SIGFPE, Arithmetic exception.
0x0000003a76277e13 in _int_free () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.7.el6.x86_64



OK thanks. Looks to me like a memory issue. _int_free() I believe is in the malloc path. You already have symbol information so don't worry about adding more. If you could do a backtrace (type bt at the gdb prompt) after the fail this may confirm it. Is your system heavily loaded on memory? Perhaps you could add a vmstat output.

Thanks for your help on this.
40) Message boards : Number crunching : Comp errors on 1.05 for Gentoo Linux (Message 146)
Posted 6 Sep 2011 by Profile Greg Tucker
Post:
I don't think it has anything to do with lib versions on the client. I think something early is ill conditioned or relies on a failed syscall that leads to the exception. Any chance someone can run under a debugger? You would do #gdb <app name> then r for run. After the exception it should trap and give you a line number. That would be great since we have no way to reproduce. Any takers?


Previous 20 · Next 20


Main page · Your account · Message boards


Copyright © 2022 Arizona State University