Message boards :
News :
Extra Credit for GetDecics
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 |
I also found some config items that I wasn't previously aware of. One on them is the size of the job cache that the scheduler uses. I bumped this up from the default of 100 to 800 jobs. Let's hope that keeps it from running out of work. could you bump up the cache further? i have outages again and i don't want to run a local buffer of sevral days. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
I also found some config items that I wasn't previously aware of. One on them is the size of the job cache that the scheduler uses. I bumped this up from the default of 100 to 800 jobs. Let's hope that keeps it from running out of work. You might be confusing the scheduler job cache with the amount of unsent work. There is plenty of unsent work. The scheduler job cache is the number of jobs that the scheduler keeps on hand, ready to give out to any client that requests work; and this cache is constantly being refilled. The job cache is stored in a shared memory segment to increase communication speed with the clients. I don't think I need to increase the cache anymore, nor would I want to since that would require using more system resources (RAM). If you are still having problems downloading work then I probably need to adjust something else. |
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 |
You might be confusing the scheduler job cache with the amount of unsent work. There is plenty of unsent work. i know - whatever it is, i got WU's now - some time ago the server did not hand out bounded WU's again. probably cache, buffer or something else was completely filled with get decis. |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
|
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
FWIW, I cannot get any GetDecic tasks. Dang it! I thought we fixed that problem. The scheduler must still be getting hung up for some reason. I'll restart the daemons when I get home tonight, in a couple of hours from now. I think Greg is awol today, otherwise I'd ask him to do it. |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
FWIW, I cannot get any GetDecic tasks. Should be good now. |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
It seems I keep missing the window to get tasks. 42738 NumberFields@home 3/14/2012 7:02:12 AM Scheduler request completed: got 0 new tasks 42739 NumberFields@home 3/14/2012 7:02:12 AM [sched_op] Server version 613 42740 NumberFields@home 3/14/2012 7:02:12 AM No tasks sent 42741 NumberFields@home 3/14/2012 7:02:12 AM No tasks are available for Get Decic Fields 42742 NumberFields@home 3/14/2012 7:02:12 AM No tasks are available for the applications you have selected. Reno, NV Team: SETI.USA |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts? |
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 |
I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts? zombie67 does not have hosts here which can be considered really slow. and even then, afaik this would not result in "No tasks are available for Get Decic Fields ". |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
I just checked the feeder log file and it has many GetDecic WUs in it's "slots". I wonder if something else is going on, like the feeder thinks your host is not fast enough to handle the tasks. Does this happen on all your hosts? I know they're not slow, but I thought the client kept track of certain metrics that might infer that. Remember during the whole Credit New debacle, some really fast hosts were getting very low credits because the client (or maybe the validator?) thought the host was extremely slow. This was all because of some faulty internal parameter that the client was calculating (I'm in too much of a hurry right now to look up the details). Anyways, it was just an idea... |
Send message Joined: 19 Aug 11 Posts: 76 Credit: 2,002,860 RAC: 0 |
I know they're not slow, but I thought the client kept track of certain metrics that might infer that. Remember during the whole Credit New debacle, some really fast hosts were getting very low credits because the client (or maybe the validator?) thought the host was extremely slow. second guess.. ;) we are talking about Space Sciences Laboratory.. remember that one: http://en.wikipedia.org/wiki/Mars_Surveyor_%2798_program ? but this time it's not a faulty parameter, but faulty design that can never work on projects which have huge differences in runtimes of batches on a single app. i'd say that APR-thing is as silly as it can get, but i'd not even bet a penny on DA not coming up with something even worse.. :( |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
Yeah, all my hosts are getting this "no tasks available." They are all fast Core2 or better. Reno, NV Team: SETI.USA |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
Yeah, all my hosts are getting this "no tasks available." They are all fast Core2 or better. Yeah, from looking at your hosts, I can't see any reason you should have a problem. But the fact that if happens on all your hosts might be a clue into what the problem is. One thought that occured to me, is the scheduler can sometimes deem a host as unreliable if the error rate is too high, which might occur from too many aborts. If that were the case, then the feeder would still issue new WUs (those ending in _0), but maybe it has none of those available in it's job cache. I temporarily increased the maximum error rate to guard against this type of thing. The only other thing I can think of would be reseting the project. Maybe try that on one of your hosts and see if that helps. If anyone else is having a problem downloading tasks, please let me know. |
Send message Joined: 3 Sep 11 Posts: 30 Credit: 10,789,219 RAC: 11,430 |
I have been having the same problem on my Linux hosts only. All my computers are the same AMD Phenom types and only the Windows ones get any work, every attempt for the last 3 days on the Linux machines has met with the "No Work Available" message. Run times have not been a problem going from 33.74 hours to 89.9 hours. The fastest times have been on one of the Linux machines. Windows has thrown 2 results with 87 and 89 hour run times but all other results have been 60 hours or less. Conan |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
FYI, I am now getting: 15214 NumberFields@home 3/14/2012 4:27:30 PM update requested by user 15219 3/14/2012 4:28:31 PM Project communication failed: attempting access to reference site 15220 3/14/2012 4:28:32 PM Internet access OK - project servers may be temporarily down. And at this same time, the server status page is all green. No problem getting to other projects. I wonder if this could have something to do with BAM? Maybe they are using the wrong URL to attach to the project, and it leading to all these issues? Edit: Nope. Not a BAM issue. I just detached from BAM, detached from the project, and attached manually via BOINCmgr: 15531 3/14/2012 4:33:21 PM Removing account manager info 15538 NumberFields@home 3/14/2012 4:33:30 PM Resetting project 15539 NumberFields@home 3/14/2012 4:33:30 PM Detaching from project 15543 3/14/2012 4:34:16 PM Project communication failed: attempting access to reference site 15544 3/14/2012 4:34:18 PM Internet access OK - project servers may be temporarily down. 15547 3/14/2012 4:34:37 PM Fetching configuration file from http://numberfields.asu.edu/NumberFields/get_project_config.php 15552 3/14/2012 4:35:01 PM Project communication failed: attempting access to reference site 15553 3/14/2012 4:35:02 PM Internet access OK - project servers may be temporarily down. Reno, NV Team: SETI.USA |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 259,785,097 RAC: 215,693 |
Enable <http_debug>, see what IP address it's trying to contact? Compare notes with Eric, and see if you can reach that IP with normal tools - browser, ping, tracert? |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
Well of course, now the server problem has gone away. But still am unable to get work. For the record, here is the HTTP info. What's up with all the "stat.la.asu.edu" still in there? 17977 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] Starting scheduler request 17978 NumberFields@home 3/14/2012 5:16:14 PM Sending scheduler request: To fetch work. 17979 NumberFields@home 3/14/2012 5:16:14 PM Requesting new tasks for CPU 17980 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] CPU work request: 86400.00 seconds; 0.40 devices 17981 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices 17982 NumberFields@home 3/14/2012 5:16:14 PM [sched_op] ATI work request: 0.00 seconds; 0.00 devices 17983 NumberFields@home 3/14/2012 5:16:14 PM [http] HTTP_OP::init_post(): http://stat.la.asu.edu/NumberFields_cgi/cgi 17984 NumberFields@home 3/14/2012 5:16:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set 17985 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: About to connect() to stat.la.asu.edu port 80 (#3) 17986 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Trying 129.219.44.120... 17987 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Connected to stat.la.asu.edu (129.219.44.120) port 80 (#3) 17988 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: POST /NumberFields_cgi/cgi HTTP/1.1 17989 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.0.20) 17990 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Host: stat.la.asu.edu 17991 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Accept: */* 17992 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Accept-Encoding: deflate, gzip 17993 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Content-Type: application/x-www-form-urlencoded 17994 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Content-Length: 11523 17995 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: Expect: 100-continue 17996 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Sent header to server: 17997 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: HTTP/1.1 100 Continue 17998 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: HTTP/1.1 200 OK 17999 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Date: Thu, 15 Mar 2012 00:16:21 GMT 18000 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Server: Apache/2.2.14 (Ubuntu) 18001 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Vary: Accept-Encoding 18002 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Encoding: gzip 18003 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Length: 600 18004 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: Content-Type: text/xml 18005 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Received header from server: 18006 NumberFields@home 3/14/2012 5:16:15 PM [http] [ID#1] Info: Connection #3 to host stat.la.asu.edu left intact 18007 NumberFields@home 3/14/2012 5:16:15 PM Scheduler request completed: got 0 new tasks 18008 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Server version 613 18009 NumberFields@home 3/14/2012 5:16:15 PM No tasks sent 18010 NumberFields@home 3/14/2012 5:16:15 PM No tasks are available for Get Decic Fields 18011 NumberFields@home 3/14/2012 5:16:15 PM No tasks are available for the applications you have selected. 18012 NumberFields@home 3/14/2012 5:16:15 PM Project requested delay of 21 seconds 18013 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Deferring communication for 21 sec 18014 NumberFields@home 3/14/2012 5:16:15 PM [sched_op] Reason: requested by project Reno, NV Team: SETI.USA |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
Well of course, now the server problem has gone away. But still am unable to get work. For the record, here is the HTTP info. What's up with all the "stat.la.asu.edu" still in there? As a test, in the project preferences could you select the option that allows work for other apps if the selected apps have no work available (Or just select the Bounded Decic App). This will let me know if this is a problem specific to the GetDecics app, or with the entire project. As far as the stat.la.asu.edu url, this is still being used inside the config file. Somebody can correct me if I am wrong, but I don't think this should matter, because the DNS should still resolve the url to the correct IP address (since the numberfields url and the stat url point to the same place). |
Send message Joined: 19 Aug 11 Posts: 31 Credit: 73,965,386 RAC: 8,118 |
Here you go. It downloaded 11 tasks with both apps selected, all are "Get Decics with Bounded Discriminant v2.04". You can see here: http://numberfields.asu.edu/NumberFields/results.php?hostid=3286 22257 NumberFields@home 3/14/2012 6:32:34 PM Requesting new tasks for CPU 22258 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] CPU work request: 86220.41 seconds; 0.29 devices 22259 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices 22260 NumberFields@home 3/14/2012 6:32:34 PM [sched_op] ATI work request: 0.00 seconds; 0.00 devices 22261 NumberFields@home 3/14/2012 6:32:35 PM Scheduler request completed: got 11 new tasks 22262 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Server version 613 22263 NumberFields@home 3/14/2012 6:32:35 PM Project requested delay of 21 seconds 22264 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total CPU task duration: 85470 seconds 22265 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total NVIDIA task duration: 0 seconds 22266 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] estimated total ATI task duration: 0 seconds 22267 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Deferring communication for 21 sec 22268 NumberFields@home 3/14/2012 6:32:35 PM [sched_op] Reason: requested by project Reno, NV Team: SETI.USA |
Send message Joined: 8 Jul 11 Posts: 1346 Credit: 552,728,324 RAC: 656,707 |
Here you go. It downloaded 11 tasks with both apps selected, all are "Get Decics with Bounded Discriminant v2.04". You can see here: http://numberfields.asu.edu/NumberFields/results.php?hostid=3286 When you get a chance, try only the GetDecics again. I disabled the "accelerating retries" mechanism; that's what determines if a host is reliable. The reason I did this, is that there have been so many timeouts and abortions that almost every WU being issued is a retry (name ends in _x where x>0) and these will only be issued to a "reliable" host. My guess is that, for whatever reason, the feeder has deemed your hosts unreliable. That's my best theory right now... |