Extra Credit for GetDecics

Message boards : News : Extra Credit for GetDecics
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 505 - Posted: 12 Mar 2012, 14:36:58 UTC - in response to Message 475.  

I also found some config items that I wasn't previously aware of. One on them is the size of the job cache that the scheduler uses. I bumped this up from the default of 100 to 800 jobs. Let's hope that keeps it from running out of work.


could you bump up the cache further?

i have outages again and i don't want to run a local buffer of sevral days.
ID: 505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 506 - Posted: 12 Mar 2012, 16:23:58 UTC - in response to Message 505.  

I also found some config items that I wasn't previously aware of. One on them is the size of the job cache that the scheduler uses. I bumped this up from the default of 100 to 800 jobs. Let's hope that keeps it from running out of work.


could you bump up the cache further?

i have outages again and i don't want to run a local buffer of sevral days.


You might be confusing the scheduler job cache with the amount of unsent work. There is plenty of unsent work. The scheduler job cache is the number of jobs that the scheduler keeps on hand, ready to give out to any client that requests work; and this cache is constantly being refilled. The job cache is stored in a shared memory segment to increase communication speed with the clients.

I don't think I need to increase the cache anymore, nor would I want to since that would require using more system resources (RAM). If you are still having problems downloading work then I probably need to adjust something else.
ID: 506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 508 - Posted: 12 Mar 2012, 16:47:44 UTC - in response to Message 506.  

You might be confusing the scheduler job cache with the amount of unsent work. There is plenty of unsent work.

i know - whatever it is, i got WU's now - some time ago the server did not hand out bounded WU's again.

probably cache, buffer or something else was completely filled with get decis.
ID: 508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 514 - Posted: 13 Mar 2012, 22:24:46 UTC

FWIW, I cannot get any GetDecic tasks.
Reno, NV
Team: SETI.USA
ID: 514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 515 - Posted: 14 Mar 2012, 0:55:50 UTC - in response to Message 514.  

FWIW, I cannot get any GetDecic tasks.


Dang it! I thought we fixed that problem. The scheduler must still be getting hung up for some reason. I'll restart the daemons when I get home tonight, in a couple of hours from now. I think Greg is awol today, otherwise I'd ask him to do it.
ID: 515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 516 - Posted: 14 Mar 2012, 3:43:19 UTC - in response to Message 514.  

FWIW, I cannot get any GetDecic tasks.


Should be good now.
ID: 516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 517 - Posted: 14 Mar 2012, 14:04:56 UTC

It seems I keep missing the window to get tasks.

42738 NumberFields@home 3/14/2012 7:02:12 AM Scheduler request completed: got 0 new tasks
42739 NumberFields@home 3/14/2012 7:02:12 AM [sched_op] Server version 613
42740 NumberFields@home 3/14/2012 7:02:12 AM No tasks sent
42741 NumberFields@home 3/14/2012 7:02:12 AM No tasks are available for Get Decic Fields
42742 NumberFields@home 3/14/2012 7:02:12 AM No tasks are available for the applications you have selected.

Reno, NV
Team: SETI.USA
ID: 517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 518 - Posted: 14 Mar 2012, 14:31:18 UTC - in response to Message 517.  

I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts?
ID: 518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 520 - Posted: 14 Mar 2012, 17:39:48 UTC - in response to Message 518.  

I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts?


zombie67 does not have hosts here which can be considered really slow. and even then, afaik this would not result in "No tasks are available for Get Decic Fields ".
ID: 520 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 522 - Posted: 14 Mar 2012, 18:43:15 UTC - in response to Message 520.  

I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts?


zombie67 does not have hosts here which can be considered really slow. and even then, afaik this would not result in "No tasks are available for Get Decic Fields ".


I know they're not slow, but I thought the client kept track of certain metrics that might infer that. Remember during the whole Credit New debacle, some really fast hosts were getting very low credits because the client (or maybe the validator?) thought the host was extremely slow. This was all because of some faulty internal parameter that the client was calculating (I'm in too much of a hurry right now to look up the details). Anyways, it was just an idea...
ID: 522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
frankhagen

Send message
Joined: 19 Aug 11
Posts: 76
Credit: 2,002,860
RAC: 0
Message 523 - Posted: 14 Mar 2012, 19:15:26 UTC - in response to Message 522.  
Last modified: 14 Mar 2012, 19:30:57 UTC

I know they're not slow, but I thought the client kept track of certain metrics that might infer that. Remember during the whole Credit New debacle, some really fast hosts were getting very low credits because the client (or maybe the validator?) thought the host was extremely slow.


second guess.. ;)

we are talking about Space Sciences Laboratory..

remember that one: http://en.wikipedia.org/wiki/Mars_Surveyor_%2798_program ?

but this time it's not a faulty parameter, but faulty design that can never work on projects which have huge differences in runtimes of batches on a single app.

i'd say that APR-thing is as silly as it can get, but i'd not even bet a penny on DA not coming up with something even worse.. :(
ID: 523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 524 - Posted: 14 Mar 2012, 19:15:35 UTC

Yeah, all my hosts are getting this "no tasks available." They are all fast Core2 or better.
Reno, NV
Team: SETI.USA
ID: 524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 525 - Posted: 14 Mar 2012, 22:17:17 UTC - in response to Message 524.  

Yeah, all my hosts are getting this "no tasks available." They are all fast Core2 or better.


Yeah, from looking at your hosts, I can't see any reason you should have a problem. But the fact that if happens on all your hosts might be a clue into what the problem is.

One thought that occured to me, is the scheduler can sometimes deem a host as unreliable if the error rate is too high, which might occur from too many aborts. If that were the case, then the feeder would still issue new WUs (those ending in _0), but maybe it has none of those available in it's job cache. I temporarily increased the maximum error rate to guard against this type of thing.

The only other thing I can think of would be reseting the project. Maybe try that on one of your hosts and see if that helps.

If anyone else is having a problem downloading tasks, please let me know.
ID: 525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 3 Sep 11
Posts: 30
Credit: 7,701,817
RAC: 3,703
Message 526 - Posted: 14 Mar 2012, 22:40:10 UTC
Last modified: 14 Mar 2012, 22:43:47 UTC

I have been having the same problem on my Linux hosts only.
All my computers are the same AMD Phenom types and only the Windows ones get any work, every attempt for the last 3 days on the Linux machines has met with the "No Work Available" message.

Run times have not been a problem going from 33.74 hours to 89.9 hours.

The fastest times have been on one of the Linux machines.

Windows has thrown 2 results with 87 and 89 hour run times but all other results have been 60 hours or less.

Conan
ID: 526 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 527 - Posted: 14 Mar 2012, 23:32:31 UTC
Last modified: 14 Mar 2012, 23:36:32 UTC

FYI, I am now getting:

15214 NumberFields@home 3/14/2012 4:27:30 PM update requested by user
15219 3/14/2012 4:28:31 PM Project communication failed: attempting access to reference site
15220 3/14/2012 4:28:32 PM Internet access OK - project servers may be temporarily down.

And at this same time, the server status page is all green. No problem getting to other projects.

I wonder if this could have something to do with BAM? Maybe they are using the wrong URL to attach to the project, and it leading to all these issues?



Edit: Nope. Not a BAM issue. I just detached from BAM, detached from the project, and attached manually via BOINCmgr:


15531 3/14/2012 4:33:21 PM Removing account manager info
15538 NumberFields@home 3/14/2012 4:33:30 PM Resetting project
15539 NumberFields@home 3/14/2012 4:33:30 PM Detaching from project
15543 3/14/2012 4:34:16 PM Project communication failed: attempting access to reference site
15544 3/14/2012 4:34:18 PM Internet access OK - project servers may be temporarily down.
15547 3/14/2012 4:34:37 PM Fetching configuration file from http://numberfields.asu.edu/NumberFields/get_project_config.php
15552 3/14/2012 4:35:01 PM Project communication failed: attempting access to reference site
15553 3/14/2012 4:35:02 PM Internet access OK - project servers may be temporarily down.
Reno, NV
Team: SETI.USA
ID: 527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 179
Credit: 220,376,862
RAC: 128,125
Message 528 - Posted: 14 Mar 2012, 23:54:07 UTC - in response to Message 527.  

Enable <http_debug>, see what IP address it's trying to contact?

Compare notes with Eric, and see if you can reach that IP with normal tools - browser, ping, tracert?
ID: 528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 529 - Posted: 15 Mar 2012, 0:19:55 UTC
Last modified: 15 Mar 2012, 0:20:51 UTC

Well of course, now the server problem has gone away. But still am unable to get work. For the record, here is the HTTP info. What's up with all the "stat.la.asu.edu" still in there?


17977 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] Starting scheduler request
17978 NumberFields@home 3/14/2012 5:16:14 PM Sending scheduler request: To fetch work.
17979 NumberFields@home 3/14/2012 5:16:14 PM Requesting new tasks for CPU
17980 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] CPU work request: 86400.00 seconds; 0.40 devices
17981 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
17982 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] ATI work request: 0.00 seconds; 0.00 devices
17983 NumberFields@home 3/14/2012 5:16:14 PM [http] HTTP_OP::init_post(): http://stat.la.asu.edu/NumberFields_cgi/cgi
17984 NumberFields@home 3/14/2012 5:16:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
17985 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: About to connect() to stat.la.asu.edu port 80 (#3)
17986 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Trying 129.219.44.120...
17987 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Connected to stat.la.asu.edu (129.219.44.120) port 80 (#3)
17988 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: POST /NumberFields_cgi/cgi HTTP/1.1
17989 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.0.20)
17990 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Host: stat.la.asu.edu
17991 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Accept: */*
17992 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Accept-Encoding: deflate, gzip
17993 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Content-Type: application/x-www-form-urlencoded
17994 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Content-Length: 11523
17995 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Expect: 100-continue
17996 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server:
17997 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: HTTP/1.1 100 Continue
17998 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: HTTP/1.1 200 OK
17999 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Date: Thu, 15 Mar 2012 00:16:21 GMT
18000 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Server: Apache/2.2.14 (Ubuntu)
18001 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Vary: Accept-Encoding
18002 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Encoding: gzip
18003 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Length: 600
18004 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Type: text/xml
18005 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server:
18006 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Connection #3 to host stat.la.asu.edu left intact
18007 NumberFields@home 3/14/2012 5:16:15 PM Scheduler request completed: got 0 new tasks
18008 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Server version 613
18009 NumberFields@home 3/14/2012 5:16:15 PM No tasks sent
18010 NumberFields@home 3/14/2012 5:16:15 PM No tasks are available for Get Decic Fields

18011 NumberFields@home 3/14/2012 5:16:15 PM No tasks are available for the applications you have selected.
18012 NumberFields@home 3/14/2012 5:16:15 PM Project requested delay of 21 seconds
18013 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Deferring communication for 21 sec
18014 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Reason: requested by project
Reno, NV
Team: SETI.USA
ID: 529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 530 - Posted: 15 Mar 2012, 1:22:14 UTC - in response to Message 529.  

Well of course, now the server problem has gone away. But still am unable to get work. For the record, here is the HTTP info. What's up with all the "stat.la.asu.edu" still in there?


As a test, in the project preferences could you select the option that allows work for other apps if the selected apps have no work available (Or just select the Bounded Decic App). This will let me know if this is a problem specific to the GetDecics app, or with the entire project.

As far as the stat.la.asu.edu url, this is still being used inside the config file. Somebody can correct me if I am wrong, but I don't think this should matter, because the DNS should still resolve the url to the correct IP address (since the numberfields url and the stat url point to the same place).
ID: 530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 19 Aug 11
Posts: 31
Credit: 63,078,601
RAC: 27,106
Message 531 - Posted: 15 Mar 2012, 1:36:43 UTC

Here you go. It downloaded 11 tasks with both apps selected, all are "Get Decics with Bounded Discriminant v2.04". You can see here: http://numberfields.asu.edu/NumberFields/results.php?hostid=3286

22257 NumberFields@home 3/14/2012 6:32:34 PM Requesting new tasks for CPU
22258 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] CPU work request: 86220.41 seconds; 0.29 devices
22259 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
22260 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] ATI work request: 0.00 seconds; 0.00 devices
22261 NumberFields@home 3/14/2012 6:32:35 PM Scheduler request completed: got 11 new tasks
22262 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Server version 613
22263 NumberFields@home 3/14/2012 6:32:35 PM Project requested delay of 21 seconds
22264 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total CPU task duration: 85470 seconds
22265 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total NVIDIA task duration: 0 seconds
22266 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total ATI task duration: 0 seconds
22267 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Deferring communication for 21 sec
22268 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Reason: requested by project

Reno, NV
Team: SETI.USA
ID: 531 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1318
Credit: 403,722,178
RAC: 288,032
Message 532 - Posted: 15 Mar 2012, 5:39:56 UTC - in response to Message 531.  

Here you go. It downloaded 11 tasks with both apps selected, all are "Get Decics with Bounded Discriminant v2.04". You can see here: http://numberfields.asu.edu/NumberFields/results.php?hostid=3286


When you get a chance, try only the GetDecics again.

I disabled the "accelerating retries" mechanism; that's what determines if a host is reliable. The reason I did this, is that there have been so many timeouts and abortions that almost every WU being issued is a retry (name ends in _x where x>0) and these will only be issued to a "reliable" host. My guess is that, for whatever reason, the feeder has deemed your hosts unreliable. That's my best theory right now...
ID: 532 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : Extra Credit for GetDecics


Main page · Your account · Message boards


Copyright © 2024 Arizona State University