Posts by Richard Haselgrove

1) Message boards : Number crunching : large swing in ETA (Message 2556)
Posted 11 days ago by Richard Haselgrove
Post:
The deadline is 1 week and then there is a 3 day grace period before it is reissued (or maybe the grace period only applies to receiving credit?). The client is usually pretty good at aborting tasks that haven't started before the deadline. If you're contacting the server regularly (once per day or more) I don't think you have anything to worry about.

The "reduced delay bound" is set to .5 which means reissued tasks have 3.5 days to complete. I would assume the 3 day grace period still applies on accelerated retries, so in effect you would have 6.5 days on retries.
Workunit 58533257 shows the timetabling quite well.

On this website, deadlines are displayed with the additional 3 day grace period included. The server uses these extended deadlines for sending replacement tasks after a timeout.

On our home computers, BOINC Manager displays the short deadline, and the BOINC client acts on the short deadline. The machine that I'm processing that resend on displays the deadline as 10 September, 23:51 in my local timezone (UTC+1).
2) Message boards : Number crunching : large swing in ETA (Message 2554)
Posted 12 days ago by Richard Haselgrove
Post:
You are correct, there are a higher number of resent tasks than usual. So getting back to Nick's original question, are you suggesting that this abundance of resent tasks are responsible for the decrease in GFLOPS (= increase in ETA)? Note it has been this way for several weeks now, and only recently has the GFLOPS dropped, so that's why I attributed it to a drop in volunteers.
Not directly, but I think they may be separate symptoms of the same underlying cause.

I've looked at a few of my _1 tasks, to see why they needed to be resent. The usual range of sporadic errors and problems, but a non-trivial number of the ones I was talking about. They show up task lists as

Not started by deadline - canceled
and in individual task pages as

Exit status	200 (0x000000C8) EXIT_UNSTARTED_LATE
They're also described as 'Aborted by user', which is false - they are aborted by the client, with no intervention by the user.

This is actually a very minor problem, which is why I haven't mentioned it before. No processing time (no CPU cycles) has been wasted - the machines have probably been busy on other jobs. It just looks a bit ugly.

So, what's the underlying common cause? BOINC is not well instrumented: there are very few things it can measure. It can't measure speed (GFlops) directly, because that depends on the hardware (which can be benchmarked - though not for GPUs) and also on the programming efficiency (as you found when you ditched the library in favour of custom code). BOINC ignores the programming efficiency - no way of measuring that.

So the only measure of speed that BOINC has is 'work done per unit time'. BOINC can measure time - it's one of the few things it's good at - but (again), it can't measure work done. So, the fallback 'work done' figure is our old friend <rsc_fpops_est> - declared for each workunit. And, as we've discussed before, the only way of measuring fpops here is post hoc - by running the workunit. Which defeats the object of the project.

The only possible mitigating action would be to sample each new batch of workunits before distribution starts and declare <rsc_fpops_est> for that batch as some sort of ballpark kludge. But that's a lot of work, still not very accurate, and I can't realistically recommend it. We're muddling through well enough.

In that scenario, GFlops is falling because "same work in longer time" means we're working more slowly. It's taken weeks for BOINC to notice that (which is far too slow): it's taken the same number of weeks for BOINC to adjust our runtime estimates (so we don't fetch more work than we can handle), because that's based on the same underlying data. I think we should just buckle in and enjoy the ride. It might be bumpy at times.
3) Message boards : Number crunching : large swing in ETA (Message 2550)
Posted 12 days ago by Richard Haselgrove
Post:
More to it than that. The DS16x271 sequence of tasks is much slower (longer runtime) than its immediate predecessors. Ever since it started, I've been noticing a much higher proportion than usual of '_1' (or higher) resent tasks. And because you have accelerated processing (shorter deadlines) for resent tasks, they are under even greater time pressure.

My suspicion is that if you examine the database, you will find a higher than usual proportion of tasks aborted by the client for "not started by deadline" - I can't remember the error number offhand, but I can look it up tomorrow.

And that is compounded by the very slow adjustment of Estimated Runtime under CreditNew. We had a good run of 'loose ends', which mostly ran quickly: estimates had sufficient time to adapt, and with the short runtime, work requests resulted in a large number of tasks allocated and cached. When these became the later, slower, tasks, caches were overfilled, and couldn't be processed in time.
4) Message boards : Number crunching : First 6,000 credits, now 60 credits? (Message 2397)
Posted 8 Apr 2019 by Richard Haselgrove
Post:
As I posted in the first reply to that thread,

The equivalent reference document hosted locally is http://boinc.berkeley.edu/trac/wiki/CreditNew. Because that's held as a Wiki, you can look back through the history, timeline, and (sole) authorship of how it evolved.
I suggest you include reading the original document, whenever discussing the follow-up implications.
5) Message boards : News : GPU app status update (Message 2387)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Richard - are there any more configuration items I need to modify for CreditNew to work better? For example, how sensitive is it to rsc_flops_est in the WU template?
To be honest, I don't think so, but I don't know for certain. My personal opinion is that <rsc_flops_est> must be buried in there somewhere, but people I've spoken to who claim to understand the control-feedback algorithm involved (apparently it's a well-known tool in control engineering) don't agree with me - they use words like Kalman filter, or PI-controller, which I don't understand. <rsc_flops_est> by batch would be a nightmare for you to implement here - don't do anything until I can speak to others tomorrow.
6) Message boards : News : GPU app status update (Message 2384)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Credit "backwards" steps what you do now...wow... are most ****** thing i see in history of boinc projects ... you really dont need volunters and you acting like that..
I realy dont need another collatz" here..xx millions per day... but 100-200 per task from this project now is - laugh to volunter eyes..
You are running the Windows application, on CPUs only (hosts 1465177 and 1488499). The credits are normal by the BOINC definition of work done - even a little high.

This thread is about the GPU experiment, and the problems it's causing for everyone in establishing a fair reward for both types of application. Please have patience - we're trying to sort something out.
7) Message boards : News : GPU app status update (Message 2382)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
Hmmm. The documentation says that credit from runtime shouldn't be used for GPUs, because they aren't benchmarked. So, what speed is the validator assuming in calculating those obscene credits? My guess would be that it's using the raw advertising claim ('GFLOPS peak'), which - coupled with the extended runtimes caused by the current optimisation state of the Beta app - would be a double whammy.

I'm in regular contact with the BOINC developers, but I'd like to check the figures behind that assumption, before raising the subject with them. Eric, could you let me know - by PM might be best - one or two example HostIDs exhibiting this problem, so I can run the numbers?

Edit - never mind, I found some. They do stick out rather, don't they?

Yes, that checks out. Device peak FLOPS (overstated by marketing) x raw elapsed time (extended by inefficient app) x the same fiddle factor as my CPUs were getting = awarded credit

But in the last hour and a half, the credit awarded to my own CPUs has gone up nearly tenfold. That can't be sustainable, either.
8) Message boards : News : GPU app status update (Message 2379)
Posted 7 Apr 2019 by Richard Haselgrove
Post:
OK, SETI is back up, so I've recovered my readings from mid-February, and put my watt meter back into the same circuit.

This is what I posted back then:

I've long had a theory that it doesn't just matter whether you're using the CPU: it matters what you're doing with it, too. Since we're testing, I though I'd try to demonstrate that. My host 8121358 has an i5-6500 CPU @ 3.20GHz - a couple of generations old now. I plugged it into a Killa-watt meter when I first got it, and never got round to unplugging it again. Today's figures are:

Idle - BOINC not running:		22 watts
Running NumberFields on 4 cores:	55 watts
Running SETI x64 AVX on 4 cores:	69 watts
ditto at VHAR:				71 watts
So, there's a significant difference between NumberFields@Home (primarily integer arithmetic) and the heavy use of the specialist floating point hardware by SETI. I've listed VHAR separately, because last time I tested this (about 10 years ago), I could see that VHAR put an extra load on the memory controller, too.

(I kept both GPUs idle while I did that test)
The computer is known as host 33342 here - CPU details as above. Obviously, I was running v2.12 back then: today's readings are

Idle - BOINC not running:		22 watts
Running NumberFields 3.00 on 4 cores:	60 watts
There's been a BIOS update since the previous test, but the idle value didn't change - that's reassuring.

As we expected, the new app is drawing more power, but not nearly as much as the SETI app. SETI is heavily into floating point maths, and like here uses a specialist external maths library - in their case, FFTW, or "The Fastest Fourier Transform in the West". SETI supplies this as an external library (DLL for Windows), and my understanding is that the library alone can detect host capability, and utilise SIMD instructions up to AVX if available, even if the calling application hasn't been compiled to use them in the main body of the program. The specific variant I tested does use AVX in both components, though.

It might be worth reporting back your findings to the PARI people, and suggesting that they compare notes with FFTW to see if similar techniques could be employed.
9) Message boards : News : GPU app status update (Message 2376)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
I will be interested in seeing your power consumption analysis.
I'll dig them up, but it may take a while. I posted them on a message board, but I think it was SETI - which has crashed hard this weekend. And I tidied away my notes when I had a visitor last month: that's fatal, of course.
10) Message boards : News : GPU app status update (Message 2371)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
To reassure people with different recollections, I took version 2.12 credit readings from between 21 March and 25 March, before the first adjustments for the GPU release. There were between 47 and 75 results visible for the three machines. For version 3.00, I had between 27 and 40 results available per machine when I started updating the spreadsheet.

Here are the raw figures, expressed as average credits per hour.

Host	v2.12		v3.00
1288	70.5627		70.9940
1290	68.0019		68.0024
1291	72.1462		72.1432
11) Message boards : News : GPU app status update (Message 2370)
Posted 6 Apr 2019 by Richard Haselgrove
Post:
I was a bit taken aback to see the much shorter estimated runtime when I first saw my task list this morning, but once I'd focused on the version number and read this thread, all was explained.

As it happens, I'd started a spreadsheet to measure the performance of my Windows machines in BOINC credit terms. I have three identical i5-4690 CPU @ 3.50GHz CPUs running Windows 7/64, but with different software loaded for different purposes: with version 2.12, they were recording 68, 70, 72 credits per hour with minuscule variation (st_dev down to 0.00073).

Under version 3.00 - exactly the same! I don't know how you managed it, but that's the smoothest version upgrade I've ever seen. No problems with runtime estimates and over/under fetching, no interruption to work flow, no messy credit adjustments. The only thing I haven't checked yet is whether the more efficient application increases the power consumption of the CPU, but I'll check that later - I haven't got the watt-meter in circuit at the moment.

I'd say that was a fair result. We are contributing the same hardware and (subject to checking) the same power, and we've done nothing to optimise our systems. You've done the work, and you've got the benefit in the form of a much increased result rate.

Bravo, and well done. :-)
12) Message boards : News : GPU app - beta version for linux nvidia (Message 2358)
Posted 4 Apr 2019 by Richard Haselgrove
Post:
Yes, I mean the same credit whether done on CPU or GPU. But let's be clear: I mean the same credit PER TASK. If you use a GPU, you will complete many more tasks, so your overall figures - RAC, total credit - will be much larger: there's your reward. But you'll be doing the same work, and that's what the credit system is designed to reflect.

I say 'designed', and I agree the design is flawed: I am particularly concerned that no attempt has been made internally to assess and correct its behaviour since it was launched in 2010. But I would like to go back to a situation where it was expected that all projects paid as near as dammit the same level of credit, so that decisions on what to crunch were made on other factors - scientific interest, publication rate, or whatever else tickles your fancy.
13) Message boards : News : GPU app - beta version for linux nvidia (Message 2354)
Posted 2 Apr 2019 by Richard Haselgrove
Post:
I disagree. Payment should be for the work done in searching - the same number of credits for the search task, whatever the device used.

GPUs will win out vastly in the number of credits awarded per unit of time - per second, per hour, per day, however you choose to measure it. Your own statement that you've found more candidate values since the GPU app was released confirms that you're conducting more searches. Good on you - that's your reward. You don't need to be compensated twice - once for doing more searches, and again for doing searches on a different device.
14) Message boards : Number crunching : Too much credit ? (Message 2353)
Posted 2 Apr 2019 by Richard Haselgrove
Post:
I disagree. Payment should be for the work done in searching - the same number of credits for the search task, whatever the device used.

GPUs will win out vastly in the number of credits awarded per unit of time - per second, per hour, per day, however you choose to measure it. Your own statement that you've found more candidate values since the GPU app was released confirms that you're conducting more searches. Good on you - that's your reward. You don't need to be compensated twice - once for doing more searches, and again for doing searches on a different device.
15) Message boards : Number crunching : Too much credit ? (Message 2347)
Posted 31 Mar 2019 by Richard Haselgrove
Post:
Sure, I can send you my observations via PM if that works for you. Or would the BOINC projects mailing list be better?
Just seen that you've found the 'outlier' setting :-)

CreditNew doesn't kick in until host_app_version.pfc_avg and app_version.pfc_avg have usable values, so you'll probably have to unleash the outliers for a few days to fill up the tables. But there are still genuine outliers in the data, so we'll have to - eventually - find a way of distinguishing them: not by an absolute time value, but something relative.

You'd probably fell able to speak more freely in PM, but I think we need to get this on the record sometime, so I think the time has come for a formal report to the projects mailing list. In the meantime - being Windows only - my GPUs haven't joined the party, but my CPUs are plodding along exactly as normal. I request that the evidence in

https://boincstats.com/en/stats/122/user/detail/1969/charts

be taken into account.
16) Message boards : News : GPU app - beta version for linux nvidia (Message 2346)
Posted 31 Mar 2019 by Richard Haselgrove
Post:
How does the current credit/hour for the GPU compare to other projects?
If they are the same WUs, they should get the same credit - whatever hardware they are run on.

The concept of "credit/hour for the GPU" is a movable feast. I run the same GPUs for both SETI and GPUGrid: the credit awarded is roughly in the ratio 1::20. SETI pays too low, GPUGrid pays much too high. I usually reckon that the staff at Einstein make the most reasonable attempt to follow the definition of the cobblestone (section Claimed and granted credit). The 'cobblestone' is the sunday best formal name for the credit, named after Jeff Cobb of SETI. A 1 GFlop device will earn 200 cobblestones per day.

More coming in the other thread.
17) Message boards : Number crunching : Too much credit ? (Message 2337)
Posted 30 Mar 2019 by Richard Haselgrove
Post:
... and the message boards were quiet about it (until now).

It seems like a fixed credit per WU is fair. If a GPU is 20x faster it will get 20x the credit per hour. From what people have said, it seems like other projects pay a disproportionately higher number of credits for GPUs. Is there a good reason for this or is it just to attract more users?
It's a generally observed bit of psychology. If credit is too low, everyone complains. If credit is too high, there's an unspoken (and instinctive) conspiracy of silence.

The definition of credit is 'work done' - the number of floating point operations computed during the course of the work. So, equal credit per task for the same WU is right - until somebody thinks up a more efficient algorithm and computes the task with less work, at which point there's no fair answer.

Incidentally, notice those weasel words 'floating point operations'. Would I be right in thinking that this project mostly utilises integer arithmetic? And would GPUs be especially efficient - i.e. 'fast' - when processing integers? BOINC makes no effort to assess the real speed of GPUs - unlike CPUs, which are benchmarked (badly) in both integer and floating point mode.

Eric, would it be possible for you to make notes on your credit experiences during this transition? I really think that BOINC should - belatedly - address the real effectiveness of these various credit schemes (especially at times of transition), and your cool-headed observations would be most helpful.
18) Message boards : Number crunching : GPU Error (Message 2276)
Posted 26 Mar 2019 by Richard Haselgrove
Post:
ok, system log shows : API 7.5 will be used for the app.

Mean that, that CUDA 7.5 will be used ?
The app was built with cuda version 10.1 I believe.
API 7.5 sounds like a BOINC version number. That has no effect whatsoever on the CUDA verssion needed or used.
19) Message boards : News : GPU app - beta version for linux nvidia (Message 2244)
Posted 24 Mar 2019 by Richard Haselgrove
Post:
I wonder if I should be calling a function to set the device. I vaguely remember seeing something about that, but I completely forgot to follow up on it. Having only a single GPU, I was not perceptive to this "bug".
Yes, you should.

https://boinc.berkeley.edu/trac/wiki/AppCoprocessor#Deviceselection

Concentrate on boinc_get_init_data() - the older command line --device N is so old it can be relegated to an afterthought.
20) Message boards : News : Server upgrade (Message 2144)
Posted 25 Jan 2019 by Richard Haselgrove
Post:
It is not just a problem with the stats export. When I look at my account page here at NumberFields@home (but also at other projects) the information for NumberFields@home in the "Projects in which you are participating list is wrong (the NumberFields@home specific data earlier on the page is correct though). In my case the data in the list is a copy from another project but definitely not the NumberFields@home data.

Can you please have a look at this too ?

Thanks,

Tom
Likewise here. In my case, the cross-project line is copied from the immediately previous project line in alphabetical order.

I've looked at the current stats export for my account:

<user>
 <id>1969</id>
 <name>Richard Haselgrove</name>
 <create_time>1319792132</create_time>
 <total_credit>96705660.619374</total_credit>
 <expavg_credit>34201.016603</expavg_credit>
 <expavg_time>1548365625.070331</expavg_time>
 <cpid>68aa4b6077c3fe48975e530f6ad94ca5</cpid>
</user>
which looks fine, and is also displayed correctly on BOINCstats

So this looks like a processing error at netsoft online (which provides the aggregated cross-project stats for this list and the cross-project certificate). You may need to liaise with James Drews (contact details at GitHub) to work out what's going on.


Next 20


Main page · Your account · Message boards


Copyright © 2019 Arizona State University