Message boards :
Number crunching :
Team credit lost !
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 5 Credit: 17,061,620 RAC: 0 |
Hello, before crash, my team (BOINCBE) had 22,414,026 (confirmed by SETIBZH and BOINCSTATS) Now only 22,273,654 ! And this after 85 new WU (received and returned today) So my team lost 140K !? Then all my WU received on 09 jan, are finished, returned (today) but not appears. Look host 156555. All today uploaded, but seems not on my stats. Deadline extended till 30jan, but I do not have it anymore. Someone explanation ? Best regards |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
Unfortunately, there will be 2 days of lost credit, because the database was restored to a point 2 days before the crash. Furthermore, any task sent out within the 2 day window (between db resore and crash) will not be recognized by the server, so although it will upload the result, it doesn't know how to award credit. Sorry for the lost credit, but I don't think there is anything I can do about it. Even if I knew what the lost credit was, updating that manually for thousands of users would be a nightmare. |
Send message Joined: 23 Jun 17 Posts: 5 Credit: 42,264,426 RAC: 0 |
I have 301 tasks in progress, but All of those were finished before the crash. Were waiting to be reported, but seems non have been credited. Understand if nothing can be done, Will the tasks just get to deadline and abort?? Thanks Steve |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
I have 301 tasks in progress, but All of those were finished before the crash. If you hit update, they should get uploaded and disappear from your client's queue. You will get credit if the task was sent out 2 days prior to the crash since those are known to the database. If they were sent within 2 days of the crash they will be unknown to the database so the validator will not know what to do with them, hence no credit. The good news, from the project's perspective, is that the result still shows up in the upload directory and is perfectly valid, so your cpu's efforts are not lost in vain. The only negative is no one gets credit. You're probably wondering if it's possible to somehow backtrack where the result came from and award credit. I have thought about this, and technically there is a way - for each orphaned file in the upload directory, you could find an entry in the file_upload_handler log and use that to get the IP address of the sender, then from the database associate the IP address with the user. Unfortunately, developing such a script would be very time consuming. |
Send message Joined: 23 Jun 17 Posts: 5 Credit: 42,264,426 RAC: 0 |
I did as you suggested, but got no points, oh well, not a big deal. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
I did as you suggested, but got no points, oh well, not a big deal. I can see you have tasks that are "in progress" and are due soon, such as this one: https://numberfields.asu.edu/NumberFields/workunit.php?wuid=98279938 Is that one of the tasks that you are not getting credit for? I see it in the upload directory and the database knows its assigned to you, so I am not sure what the problem is. Maybe it's a validation problem. I will check the validator logs for clues. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
I did as you suggested, but got no points, oh well, not a big deal. About the WU I referenced above, the database sent it out before the crash and the uploaded file time stamp is in that 2 day window before the crash. So the database is in a state where it is "waiting" on you to return the result. This is just a guess, but I think the problem is your client thinks it already uploaded the file (and it would be correct) but the server doesn't know this, so it can't validate the result. Assuming this is a correct hypothesis, I could maybe push it along by manually changing it's state in the database to "needs validate". Just out of curiosity, I might try this, but I cant do this manually for every user since that would take forever. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 241,853,899 RAC: 145,292 |
Each of my machines is making an unusual report every time it requests new work: 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399197of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399198of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399199of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399200of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399201of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399202of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399347of2000000_0 (expired) 30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399689of2000000_0 (expired) They appear at the top of https://numberfields.asu.edu/NumberFields/results.php?hostid=1291&offset=0&show_names=1&state=1: judging by the dates, they're probably in a similar state. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
Each of my machines is making an unusual report every time it requests new work: I saw similar messages on my hosts too, but they have since disappeared, and your tasks still show "in progress". I may have hit the update button on my hosts before I changed the deadline, which could be the difference. So maybe we wait until they expire (including the grace period)? Edit: I just checked my error results and found several "Timed out - no response", but they timed out on Jan 29th. That's confusing since I changed the deadline to Jan 30 for every outstanding WU. |
Send message Joined: 28 Oct 11 Posts: 180 Credit: 241,853,899 RAC: 145,292 |
... your tasks still show "in progress"....If you click through to the workunit (as distinct form the task), they show as 'WU cancelled'. I've also just got a batch (across several machines, including this one) of 'download failed' - scheduler issued the work, but the data file couldn't be found. I think that's all part of the process of the database healing itself. I'm not worried about any of these - though it might be worth giving what remains of the database a good spring-clean at the end of this run. You might want to do that anyway to check that you've got all the results you were expecting. But if you've got a load of unmatched upload files, and a load of incomplete database records, a query might be able to flip the database state to 'needs validation', and recover them. |
Send message Joined: 8 Jul 11 Posts: 1341 Credit: 494,193,741 RAC: 559,842 |
... your tasks still show "in progress"....If you click through to the workunit (as distinct form the task), they show as 'WU cancelled'. I've also just got a batch (across several machines, including this one) of 'download failed' - scheduler issued the work, but the data file couldn't be found. I think that's all part of the process of the database healing itself. Ah yes, any WU with group number less than 400000 was cancelled since there were so few of them left (<1k), it wasn't worth letting the db try to resend them as they would have just resulted in download failures. I'm not worried about missing results - my collators will tell me if anything is missing and then I can go back and redo them manually (I have already done that for all grps <400k). But I will definitely do the spring cleaning after the dust settles. I did a test and flipped the validation switch on one WU, but all it did was marked it as valid and then it remained stuck. I think it requires flipping more switches, like server_state and client_state, and then triggering the transitioner... which is more than I can chew at the moment. |