Posts by Richard Haselgrove

1) Message boards : News : Database crash (Message 1939)
Posted 9 days ago by Richard Haselgrove
Post:
Yes, things getting back to normal here too - reported completed work and got a few new ones.

I'm getting some of those "permanent http error" too. They seem to be data files missing from the download storage area (HTTP 404), so of course the downloads fail for all replications of the workunit. You'll have a few jobs to re-issue when all this is over.
2) Message boards : News : Minor server overload problems (Message 1929)
Posted 17 days ago by Richard Haselgrove
Post:
It's fairly well known that a stopped feeder puts the server into a form of maintenance mode - nothing gets done, volunteer hosts are backed off for 1 hour (as my log shows). I presume this is deliberate to stop the situation getting worse until it can be inspected.

What that doesn't say is why the feeder stopped in the first place - your second post about the lack of temp file space seems as good an explanation as any.

At intervals throughout the day, I've seen the project come back up fully (so I could report and refill): then go into full maintenance mode with these boards down as well: then return to normal working (the current state: just reported and received new tasks). Which sounds like a good moment to go out and start celebrating the new year...
3) Message boards : News : Minor server overload problems (Message 1926)
Posted 17 days ago by Richard Haselgrove
Post:
Probably related to the other problems, I'm currently unable to report completed tasks: the messages I get are

31/12/2017 09:27:54 | NumberFields@home | Server error: feeder not running
31/12/2017 09:27:54 | NumberFields@home | Project requested delay of 3600 seconds
Not a problem - they can sit here until you're ready for them.
4) Message boards : Number crunching : No work 'ready to send' (Message 1903)
Posted 6 Nov 2017 by Richard Haselgrove
Post:
Yes, I'm topping up nicely. I guess the new batch ran more quickly than our machines are used to after DS12x271.
5) Message boards : Number crunching : No work 'ready to send' (Message 1901)
Posted 5 Nov 2017 by Richard Haselgrove
Post:
Although you opened up new search DS13x270 for us last week, we seem to be getting ahead of you ;-)
6) Message boards : News : implementing SSL on the server (Message 1784)
Posted 16 Nov 2016 by Richard Haselgrove
Post:
I tried out version 5.10.45, as it can be used portable on my USB drive, and received the message:
16-Nov-2016 08:51:21 [NumberFields@home] Scheduler request failed: Peer certificate cannot be authenticated with known CA certificates

I suppose that means no more portable operating for me.

You could try extracting the file 'ca-bundle.crt' from a newer BOINC download and replacing your old one with that. No promise that it will work, though.
7) Message boards : News : implementing SSL on the server (Message 1778)
Posted 11 Oct 2016 by Richard Haselgrove
Post:
I'd be interested to hear if someone found a simpler way to connect using an older machine/manager.

I see that Vitaly also has a Windows 7 machine attached to the project.

If he finds the file 'account_numberfields.asu.edu_NumberFields.xml' in the root of the BOINC data directory on that machine, and copies it to the equivalent location in the data directory of the Ubuntu machine, it *may* attach that machine to the project. You may need to restart the BOINC client/service.

I've used that method to attach a new Windows machine in the past, but I can't be sure that Linux will accept the Windows file format (CRLF line endings, instead of *nix LF only). Also, I did it before SSL came into widespread use: in fact, my own account file still has the http:// master url, and I keep getting nagged to detach and re-attach 'when convenient'. But since it's currently still working, I haven't bothered.
8) Message boards : Number crunching : Upload problems? (Message 1771)
Posted 1 Sep 2016 by Richard Haselgrove
Post:
I had that all day (starting about 07:00 UTC): the "transient HTTP error" in question was a server timeout, according to the http_debug log.

I had the same timeout error when attempting to access this website, but when this site came back to life again (about an hour and a half ago), the uploads resumed too. Worth giving them a prod with the Transfers::Retry Now button.
9) Message boards : Number crunching : Qsqrt421_DS3x8 15+ hour running time? (Message 1752)
Posted 6 Jul 2016 by Richard Haselgrove
Post:
I reckon I've got four of those:

wu_Qsqrt421_DS1x8_CV1_S1000_N2_21_N1_-6868to-4346 (the original _0 dated 14 June, still running on a slower machine)
wu_Qsqrt421_DS1x8_CV1_S1000_N2_41_N1_-7063to-4131
wu_Qsqrt421_DS1x8_CV1_S1000_N2_10_N1_-4391to-1809 (I'll abort that one, someone else has completed it)
wu_Qsqrt421_DS1x8_CV1_S1000_N2_28_N1_-4197to-1862

wu_Qsqrt421_DS1x8_CV1_S1000_N2_49_N1_-3971to-1938 had already slipped through - not as slow as the others - before I saw your post.
10) Message boards : Number crunching : Qsqrt421_DS3x8 15+ hour running time? (Message 1749)
Posted 27 Jun 2016 by Richard Haselgrove
Post:
Please don't feel you have to. I had

wu_Qsqrt421_DS1x8_CV1_S1000_N2_49_N1_-7142to-4045_0 for several days last week, but it finished fine and within the original deadline.

There does seem to be a class of workunit which spends a lot of time on the first few %age steps at the start of computation, and a similar slow phase at the end, but which runs through the middle section very quickly.
11) Message boards : News : Bounded App Final Tally of Results (Message 1731)
Posted 23 May 2016 by Richard Haselgrove
Post:
Yes. The original Hobbes link gave me, and is still giving,

This site can’t be reached

hobbes.la.asu.edu refused to connect.

ERR_CONNECTION_REFUSED

The use of the word "refused" implies something more specific than a simple failure.

The locally-hosted copy displays fine, though.
12) Message boards : Number crunching : Upload problems May 12 (Message 1723)
Posted 12 May 2016 by Richard Haselgrove
Post:
LOL - posting about it did the trick. All cleared now.
13) Message boards : Number crunching : Upload problems May 12 (Message 1722)
Posted 12 May 2016 by Richard Haselgrove
Post:
I'm getting upload failures on three machines so far.

12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Info:  Connected to numberfields.asu.edu (129.219.51.76) port 80 (#191)
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: POST /NumberFields_cgi/file_upload_handler/ HTTP/1.1
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Host: numberfields.asu.edu
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.6.22)
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Accept: */*
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Accept-Encoding: deflate, gzip
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Content-Type: application/x-www-form-urlencoded
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Accept-Language: en_GB
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server: Content-Length: 722
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Sent header to server:
12/05/2016 08:58:23 | NumberFields@home | [http] [ID#108] Info:  We are completely uploaded and fine
12/05/2016 08:59:28 | NumberFields@home | [http] [ID#108] Info:  Operation too slow. Less than 10 bytes/sec transferred the last 60 seconds
12/05/2016 08:59:28 | NumberFields@home | [http] [ID#108] Info:  Closing connection 191
12/05/2016 08:59:28 | NumberFields@home | [http] HTTP error: Timeout was reached
12/05/2016 08:59:28 | NumberFields@home | Temporarily failed upload of wu_sf3_DS-11x271_Grp158560of614400_0_0: transient HTTP error
12/05/2016 08:59:28 | NumberFields@home | Backing off 01:51:48 on upload of wu_sf3_DS-11x271_Grp158560of614400_0_0

The upload appears to complete, but I don't get the necessary acknowledgement that all is well.
14) Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run? (Message 1694)
Posted 19 Apr 2016 by Richard Haselgrove
Post:
This was explained to me once, many years ago - by Rom Walton, I think.

The action was

Exit status	203 (0xcb) EXIT_ABORTED_VIA_GUI

- user clicked the abort button.

They (the BOINC developers) deliberately coded it to throw a breakpoint on abort, to get some state debug information and feed it back to the developer. Perhaps, like in this thread, the user aborted it because it seemed to be in an infinite loop? If so, we'd like to know which code segment had the bug so it could be fixed.

There aren't many developer aids in the standard BOINC client: why they chose this one, I don't know. Maybe there were a lot of infinite loops in their test code?
15) Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run? (Message 1685)
Posted 18 Apr 2016 by Richard Haselgrove
Post:
I uploaded a fix for this 2 days ago, which you may not have picked up yet. The latest version of the app is 2.10.

Correct. He picked up WU 15561275 on 9 April, with application v2.08

So it'll be showing as deadline expired on his local machine, which may account for some of his impatience - but the server isn't expecting it back until 26 April.
16) Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run? (Message 1673)
Posted 16 Apr 2016 by Richard Haselgrove
Post:
... a handful of medium length Qsqrt_DS3x8 cases.

Them's big hands. I've got 11 of them, and I may have to toss back some of the ones that got stuck in a long queue on slower machines.
17) Message boards : Number crunching : Need more Time! (Message 1669)
Posted 16 Apr 2016 by Richard Haselgrove
Post:
Always glad to help. We tracked that one down just in time - task 16717469 has just finished and reported all by itself, taking the evidence with it.
18) Message boards : Number crunching : Long running wu_Qsqrt421_DS1x5 units - how long to let them run? (Message 1668)
Posted 16 Apr 2016 by Richard Haselgrove
Post:
I don't see the WUs you speak of. The earliest ones I see are due April 21; all other WUs on your task list appear to have completed successfully. Do you have a WU name or id?

Vik is possibly reading reading the 'due date' off BOINC Manager. The slow one we were discussing in the other thread is showing a due date of 16 Apr 2016, 23:56:47 locally, but 26 Apr 2016, 22:56:47 UTC on the website. The difference is made up of 1 hour for time zone offset, and 10 days grace period allowed by the project.
19) Message boards : Number crunching : Need more Time! (Message 1664)
Posted 15 Apr 2016 by Richard Haselgrove
Post:
Something else to be aware of...
I have seen the progress meter go to 100.000% and the WU still continues processing for another few hours. I believe what is happening is that the progress is really 99.9995% and the client is rounding it up to 100. No need to worry that it's stuck; the WU will eventually finish.

It's even worse than that. I have a current task which has been running over 5 days, and is displaying 100%. (Don't worry, I've seen several like that, and most of them have completed already - it can take as long as it wants. I think this one has already been showing 100% for well over a day.)

The key file to investigate is boinc_task_state.xml in the task's slot directory.

It says:

<active_task>
    <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url>
    <result_name>wu_Qsqrt421_DS3x8_CV1_S815_N2_-194161_N1_805982_k2_-1_0</result_name>
    <checkpoint_cpu_time>421149.300000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>499853.935949</checkpoint_elapsed_time>
    <fraction_done>0.000000</fraction_done>
    <peak_working_set_size>7725056</peak_working_set_size>
    <peak_swap_size>304140288</peak_swap_size>
    <peak_disk_usage>13256</peak_disk_usage>
</active_task>

<fraction_done> would normally be filled in by your application - I think you've said that the overheads of adding reporting at the very innermost loop would be too high, so you've left it a little further out. But that means that in this particular parameter space, all the progress comes from the BOINC client's attempt to reassure that all is well, by inventing its own pseudo-progress to report.

By design, pseudo-progress tends asymptotically to a fraction of 1 (100%), but never reaches it. Because this task has run so far beyond its initial estimate (probably 7 hours on this machine), the asymptotic limit has become indistinguishable from 1 (to three decimal places). It's not the first time that BOINC coding has failed to cope transparently with extreme cases.

The checkpoint (state) file for this task contains

0
-194161
805982
-1
91
3518541053
0
0
0
499851

and stderr's report on the Martinet search has reached

Now starting the targeted Martinet search:
    N2_L = -194161.
    N2_U = -194161.
      N2 = -194161.
        N1_L = 805982.
        N1_U = 805982.
          N1 = 805982.
            k2 range: -1 => -1.
            k2 = -1.
            k1 range: 76 => 136.
            k1 = 76.
            k1 = 77.
            k1 = 78.
            k1 = 79.
            k1 = 80.
            k1 = 81.
            k1 = 82.
            k1 = 83.
            k1 = 84.
            k1 = 85.
            k1 = 86.
            k1 = 87.
            k1 = 88.
            k1 = 89.
            k1 = 90.
            k1 = 91.

if that helps you track down what it's up to.
20) Message boards : Number crunching : Need more Time! (Message 1657)
Posted 14 Apr 2016 by Richard Haselgrove
Post:
Easier to find as WU 12731776


Next 20

Main page · Your account · Message boards


Copyright © 2018 Arizona State University