Posts by Richard Haselgrove

41) Message boards : Number crunching : This project is using an old URL. When convenient, remove the project, then add http://numberfields...... (Message 2789)
Posted 30 May 2020 by Richard Haselgrove
Post:
Eric - please see message 2788 - looks like these two issues might be related.
42) Message boards : Number crunching : Upload server down? (Message 2788)
Posted 30 May 2020 by Richard Haselgrove
Post:
I'm getting this when I try to ipload:

30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: Connected to numberfields.asu.edu (129.219.51.76) port 80 (#1791)
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Received header from server: HTTP/1.1 301 Moved Permanently
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: Issue another request to this URL: 'https://numberfields.asu.edu/NumberFields_cgi/file_upload_handler/'
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: Connected to numberfields.asu.edu (129.219.51.76) port 443 (#1792)
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: found 133 certificates in ca-bundle.crt
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: found 402 certificates in /etc/ssl/certs
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: ALPN, offering http/1.1
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384
30/05/2020 14:12:06 | NumberFields@home | [http] [ID#3878] Info: server certificate verification failed. CAfile: ca-bundle.crt CRLfile: none
30/05/2020 14:12:06 | NumberFields@home | [http] HTTP error: Peer certificate cannot be authenticated with given CA certificates
30/05/2020 14:12:07 | NumberFields@home | Temporarily failed upload of wu_sf3_DS-15x271_Grp4812969of6553600_0_r194046698_0: transient HTTP error
Seems like a switch from http to https hasn't gone smoothly.
43) Message boards : Number crunching : very weak credit (Message 2759)
Posted 10 May 2020 by Richard Haselgrove
Post:
I am considering using the credit per wu option, instead of creditNew. Does anyone know what a good credit/hour is on a single threaded CPU core (at 3.5 to 4 GHz)?
Let's be careful to research this properly - some people have, in the past, provided what must have been deliberate over-generous examples.

I would personally nominate Einstein@Home as a well resourced, long-standing, thoughtful project that uses fixed credit and makes a conscientious effort to honour the cobblestone definition. I'll try to find some figures tomorrow.
44) Message boards : Number crunching : Discrepancy in the task deadline between server and BOINC manager (Message 2742)
Posted 8 May 2020 by Richard Haselgrove
Post:
The server always reports in UTC, whereas the Manager reports in your computer's local time zone. So there's still a small mental correction to overcome. But that's probably commoner for users to encounter and adjust for.

I think this project is the only one I've encountered to actually use the grace period, although it's available for all project administrators to use if they wish. I think it's quote a good idea, given the differing run-times of different batches of tasks. The local BOINC on volunteers' computers will do its best to meet the deadline displayed on the local machine, but when the runtime changes, it may get caught out. The grace period allows that to happen without wastefully sending a replacement task to another computer when it's not needed.
45) Message boards : News : New GPU OpenCL versions available (Message 2661)
Posted 23 Feb 2020 by Richard Haselgrove
Post:
A problem has been reported with the Adrenalin 2020 drivers affecting the third-party SETI@Home OpenCL application, especially on RX 57xx cards. That problem was acknowledged and fixed in Beta version 20.1.1, and release version 20.1.3
46) Message boards : News : New GPU OpenCL versions available (Message 2583)
Posted 4 Nov 2019 by Richard Haselgrove
Post:
You may find you need to check the matching box for the right venue on https://numberfields.asu.edu/NumberFields/prefs.php?subset=project and update twice - once to update the client settings, and again to actually use the new ones.
47) Message boards : News : Batch plan (Message 2573)
Posted 10 Oct 2019 by Richard Haselgrove
Post:
Which raises the problem of the transition back to long tasks after our machines have adapted to the short runs from subfield 7. You may get a blip of runtime exceeded errors again, followed by another blip when the first resends try to run in half the time. Might be wise to cut down the 'maximum tasks in progress' for a while when the first long tasks are ready to flow out.
48) Message boards : News : Batch plan (Message 2570)
Posted 9 Oct 2019 by Richard Haselgrove
Post:
seti@home graphics are not working....what am i doing wrong?
Posting in the wrong forum.

Try https://setiathome.berkeley.edu/forum_help_desk.php - but next time, please give details of your operating system, BOINC version, and what you did prior to the change in behaviour.
49) Message boards : Number crunching : large swing in ETA (Message 2556)
Posted 8 Sep 2019 by Richard Haselgrove
Post:
The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.
Workunit 58533257 shows the timetabling quite well.

On this website, deadlines are displayed with the additional 3 day grace period included. The server uses these extended deadlines for sending replacement tasks after a timeout.

On our home computers, BOINC Manager displays the short deadline, and the BOINC client acts on the short deadline. The machine that I'm processing that resend on displays the deadline as 10 September, 23:51 in my local timezone (UTC+1).
50) Message boards : Number crunching : large swing in ETA (Message 2554)
Posted 7 Sep 2019 by Richard Haselgrove
Post:
You are correct, there are a higher number of resent tasks than usual. So getting back to Nick's original question, are you suggesting that this abundance of resent tasks are responsible for the decrease in GFLOPS (= increase in ETA)? Note it has been this way for several weeks now, and only recently has the GFLOPS dropped, so that's why I attributed it to a drop in volunteers.
Not directly, but I think they may be separate symptoms of the same underlying cause.

I've looked at a few of my _1 tasks, to see why they needed to be resent. The usual range of sporadic errors and problems, but a non-trivial number of the ones I was talking about. They show up task lists as

Not started by deadline - canceled
and in individual task pages as

Exit status	200 (0x000000C8) EXIT_UNSTARTED_LATE
They're also described as 'Aborted by user', which is false - they are aborted by the client, with no intervention by the user.

This is actually a very minor problem, which is why I haven't mentioned it before. No processing time (no CPU cycles) has been wasted - the machines have probably been busy on other jobs. It just looks a bit ugly.

So, what's the underlying common cause? BOINC is not well instrumented: there are very few things it can measure. It can't measure speed (GFlops) directly, because that depends on the hardware (which can be benchmarked - though not for GPUs) and also on the programming efficiency (as you found when you ditched the library in favour of custom code). BOINC ignores the programming efficiency - no way of measuring that.

So the only measure of speed that BOINC has is 'work done per unit time'. BOINC can measure time - it's one of the few things it's good at - but (again), it can't measure work done. So, the fallback 'work done' figure is our old friend <rsc_fpops_est> - declared for each workunit. And, as we've discussed before, the only way of measuring fpops here is post hoc - by running the workunit. Which defeats the object of the project.

The only possible mitigating action would be to sample each new batch of workunits before distribution starts and declare <rsc_fpops_est> for that batch as some sort of ballpark kludge. But that's a lot of work, still not very accurate, and I can't realistically recommend it. We're muddling through well enough.

In that scenario, GFlops is falling because "same work in longer time" means we're working more slowly. It's taken weeks for BOINC to notice that (which is far too slow): it's taken the same number of weeks for BOINC to adjust our runtime estimates (so we don't fetch more work than we can handle), because that's based on the same underlying data. I think we should just buckle in and enjoy the ride. It might be bumpy at times.
51) Message boards : Number crunching : large swing in ETA (Message 2550)
Posted 6 Sep 2019 by Richard Haselgrove
Post:
More to it than that. The DS16x271 sequence of tasks is much slower (longer runtime) than its immediate predecessors. Ever since it started, I've been noticing a much higher proportion than usual of '_1' (or higher) resent tasks. And because you have accelerated processing (shorter deadlines) for resent tasks, they are under even greater time pressure.

My suspicion is that if you examine the database, you will find a higher than usual proportion of tasks aborted by the client for "not started by deadline" - I can't remember the error number offhand, but I can look it up tomorrow.

And that is compounded by the very slow adjustment of Estimated Runtime under CreditNew. We had a good run of 'loose ends', which mostly ran quickly: estimates had sufficient time to adapt, and with the short runtime, work requests resulted in a large number of tasks allocated and cached. When these became the later, slower, tasks, caches were overfilled, and couldn't be processed in time.
52) Message boards : Number crunching : First 6,000 credits, now 60 credits? (Message 2397)
Posted 8 Apr 2019 by Richard Haselgrove
Post:
As I posted in the first reply to that thread,

The equivalent reference document hosted locally is http://boinc.berkeley.edu/trac/wiki/CreditNew. Because that's held as a Wiki, you can look back through the history, timeline, and (sole) authorship of how it evolved.
I suggest you include reading the original document, whenever discussing the follow-up implications.
53) Message boards : News : GPU app status update (Message 2387)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Richard - are there any more configuration items I need to modify for CreditNew to work better? For example, how sensitive is it to rsc_flops_est in the WU template?
To be honest, I don't think so, but I don't know for certain. My personal opinion is that <rsc_flops_est> must be buried in there somewhere, but people I've spoken to who claim to understand the control-feedback algorithm involved (apparently it's a well-known tool in control engineering) don't agree with me - they use words like Kalman filter, or PI-controller, which I don't understand. <rsc_flops_est> by batch would be a nightmare for you to implement here - don't do anything until I can speak to others tomorrow.
54) Message boards : News : GPU app status update (Message 2384)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Credit "backwards" steps what you do now...wow... are most ****** thing i see in history of boinc projects ... you really dont need volunters and you acting like that..
I realy dont need another collatz" here..xx millions per day... but 100-200 per task from this project now is - laugh to volunter eyes..
You are running the Windows application, on CPUs only (hosts 1465177 and 1488499). The credits are normal by the BOINC definition of work done - even a little high.

This thread is about the GPU experiment, and the problems it's causing for everyone in establishing a fair reward for both types of application. Please have patience - we're trying to sort something out.
55) Message boards : News : GPU app status update (Message 2382)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Hmmm. The documentation says that credit from runtime shouldn't be used for GPUs, because they aren't benchmarked. So, what speed is the validator assuming in calculating those obscene credits? My guess would be that it's using the raw advertising claim ('GFLOPS peak'), which - coupled with the extended runtimes caused by the current optimisation state of the Beta app - would be a double whammy.

I'm in regular contact with the BOINC developers, but I'd like to check the figures behind that assumption, before raising the subject with them. Eric, could you let me know - by PM might be best - one or two example HostIDs exhibiting this problem, so I can run the numbers?

Edit - never mind, I found some. They do stick out rather, don't they?

Yes, that checks out. Device peak FLOPS (overstated by marketing) x raw elapsed time (extended by inefficient app) x the same fiddle factor as my CPUs were getting = awarded credit

But in the last hour and a half, the credit awarded to my own CPUs has gone up nearly tenfold. That can't be sustainable, either.
56) Message boards : News : GPU app status update (Message 2379)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
OK, SETI is back up, so I've recovered my readings from mid-February, and put my watt meter back into the same circuit.

This is what I posted back then:

I've long had a theory that it doesn't just matter whether you're using the CPU: it matters what you're doing with it, too. Since we're testing, I though I'd try to demonstrate that. My host 8121358 has an i5-6500 CPU @ 3.20GHz - a couple of generations old now. I plugged it into a Killa-watt meter when I first got it, and never got round to unplugging it again. Today's figures are:

Idle - BOINC not running:		22 watts
Running NumberFields on 4 cores:	55 watts
Running SETI x64 AVX on 4 cores:	69 watts
ditto at VHAR:				71 watts
So, there's a significant difference between NumberFields@Home (primarily integer arithmetic) and the heavy use of the specialist floating point hardware by SETI. I've listed VHAR separately, because last time I tested this (about 10 years ago), I could see that VHAR put an extra load on the memory controller, too.

(I kept both GPUs idle while I did that test)
The computer is known as host 33342 here - CPU details as above. Obviously, I was running v2.12 back then: today's readings are

Idle - BOINC not running:		22 watts
Running NumberFields 3.00 on 4 cores:	60 watts
There's been a BIOS update since the previous test, but the idle value didn't change - that's reassuring.

As we expected, the new app is drawing more power, but not nearly as much as the SETI app. SETI is heavily into floating point maths, and like here uses a specialist external maths library - in their case, FFTW, or "The Fastest Fourier Transform in the West". SETI supplies this as an external library (DLL for Windows), and my understanding is that the library alone can detect host capability, and utilise SIMD instructions up to AVX if available, even if the calling application hasn't been compiled to use them in the main body of the program. The specific variant I tested does use AVX in both components, though.

It might be worth reporting back your findings to the PARI people, and suggesting that they compare notes with FFTW to see if similar techniques could be employed.
57) Message boards : News : GPU app status update (Message 2376)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
I will be interested in seeing your power consumption analysis.
I'll dig them up, but it may take a while. I posted them on a message board, but I think it was SETI - which has crashed hard this weekend. And I tidied away my notes when I had a visitor last month: that's fatal, of course.
58) Message boards : News : GPU app status update (Message 2371)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
To reassure people with different recollections, I took version 2.12 credit readings from between 21 March and 25 March, before the first adjustments for the GPU release. There were between 47 and 75 results visible for the three machines. For version 3.00, I had between 27 and 40 results available per machine when I started updating the spreadsheet.

Here are the raw figures, expressed as average credits per hour.

Host	v2.12		v3.00
1288	70.5627		70.9940
1290	68.0019		68.0024
1291	72.1462		72.1432
59) Message boards : News : GPU app status update (Message 2370)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
I was a bit taken aback to see the much shorter estimated runtime when I first saw my task list this morning, but once I'd focused on the version number and read this thread, all was explained.

As it happens, I'd started a spreadsheet to measure the performance of my Windows machines in BOINC credit terms. I have three identical i5-4690 CPU @ 3.50GHz CPUs running Windows 7/64, but with different software loaded for different purposes: with version 2.12, they were recording 68, 70, 72 credits per hour with minuscule variation (st_dev down to 0.00073).

Under version 3.00 - exactly the same! I don't know how you managed it, but that's the smoothest version upgrade I've ever seen. No problems with runtime estimates and over/under fetching, no interruption to work flow, no messy credit adjustments. The only thing I haven't checked yet is whether the more efficient application increases the power consumption of the CPU, but I'll check that later - I haven't got the watt-meter in circuit at the moment.

I'd say that was a fair result. We are contributing the same hardware and (subject to checking) the same power, and we've done nothing to optimise our systems. You've done the work, and you've got the benefit in the form of a much increased result rate.

Bravo, and well done. :-)
60) Message boards : News : GPU app - beta version for linux nvidia (Message 2358)
Posted 4 Apr 2019 by Richard Haselgrove
Post:
Yes, I mean the same credit whether done on CPU or GPU. But let's be clear: I mean the same credit PER TASK. If you use a GPU, you will complete many more tasks, so your overall figures - RAC, total credit - will be much larger: there's your reward. But you'll be doing the same work, and that's what the credit system is designed to reflect.

I say 'designed', and I agree the design is flawed: I am particularly concerned that no attempt has been made internally to assess and correct its behaviour since it was launched in 2010. But I would like to go back to a situation where it was expected that all projects paid as near as dammit the same level of credit, so that decisions on what to crunch were made on other factors - scientific interest, publication rate, or whatever else tickles your fancy.


Previous 20 · Next 20


Main page · Your account · Message boards


Copyright © 2024 Arizona State University