Team credit lost !

Message boards : Number crunching : Team credit lost !
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile marsinph

Send message
Joined: 13 Apr 18
Posts: 5
Credit: 17,061,620
RAC: 0
Message 3004 - Posted: 27 Jan 2021, 16:16:33 UTC

Hello,
before crash, my team (BOINCBE) had 22,414,026 (confirmed by SETIBZH and BOINCSTATS)
Now only 22,273,654 ! And this after 85 new WU (received and returned today)
So my team lost 140K !?

Then all my WU received on 09 jan, are finished, returned (today) but not appears.
Look host 156555. All today uploaded, but seems not on my stats.
Deadline extended till 30jan, but I do not have it anymore.

Someone explanation ?
Best regards
ID: 3004 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3005 - Posted: 27 Jan 2021, 18:12:31 UTC - in response to Message 3004.  

Unfortunately, there will be 2 days of lost credit, because the database was restored to a point 2 days before the crash. Furthermore, any task sent out within the 2 day window (between db resore and crash) will not be recognized by the server, so although it will upload the result, it doesn't know how to award credit.

Sorry for the lost credit, but I don't think there is anything I can do about it. Even if I knew what the lost credit was, updating that manually for thousands of users would be a nightmare.
ID: 3005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HK-Steve

Send message
Joined: 23 Jun 17
Posts: 5
Credit: 42,264,426
RAC: 0
Message 3020 - Posted: 28 Jan 2021, 18:34:44 UTC

I have 301 tasks in progress, but All of those were finished before the crash.
Were waiting to be reported, but seems non have been credited.

Understand if nothing can be done, Will the tasks just get to deadline and abort??

Thanks
Steve
ID: 3020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3021 - Posted: 28 Jan 2021, 21:31:54 UTC - in response to Message 3020.  

I have 301 tasks in progress, but All of those were finished before the crash.
Were waiting to be reported, but seems non have been credited.

Understand if nothing can be done, Will the tasks just get to deadline and abort??

Thanks
Steve


If you hit update, they should get uploaded and disappear from your client's queue. You will get credit if the task was sent out 2 days prior to the crash since those are known to the database. If they were sent within 2 days of the crash they will be unknown to the database so the validator will not know what to do with them, hence no credit. The good news, from the project's perspective, is that the result still shows up in the upload directory and is perfectly valid, so your cpu's efforts are not lost in vain. The only negative is no one gets credit.

You're probably wondering if it's possible to somehow backtrack where the result came from and award credit. I have thought about this, and technically there is a way - for each orphaned file in the upload directory, you could find an entry in the file_upload_handler log and use that to get the IP address of the sender, then from the database associate the IP address with the user. Unfortunately, developing such a script would be very time consuming.
ID: 3021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HK-Steve

Send message
Joined: 23 Jun 17
Posts: 5
Credit: 42,264,426
RAC: 0
Message 3030 - Posted: 30 Jan 2021, 8:46:55 UTC

I did as you suggested, but got no points, oh well, not a big deal.
ID: 3030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3031 - Posted: 30 Jan 2021, 16:23:59 UTC - in response to Message 3030.  

I did as you suggested, but got no points, oh well, not a big deal.

I can see you have tasks that are "in progress" and are due soon, such as this one:
https://numberfields.asu.edu/NumberFields/workunit.php?wuid=98279938

Is that one of the tasks that you are not getting credit for? I see it in the upload directory and the database knows its assigned to you, so I am not sure what the problem is. Maybe it's a validation problem. I will check the validator logs for clues.
ID: 3031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3032 - Posted: 30 Jan 2021, 16:47:11 UTC - in response to Message 3031.  

I did as you suggested, but got no points, oh well, not a big deal.

I can see you have tasks that are "in progress" and are due soon, such as this one:
https://numberfields.asu.edu/NumberFields/workunit.php?wuid=98279938

Is that one of the tasks that you are not getting credit for? I see it in the upload directory and the database knows its assigned to you, so I am not sure what the problem is. Maybe it's a validation problem. I will check the validator logs for clues.


About the WU I referenced above, the database sent it out before the crash and the uploaded file time stamp is in that 2 day window before the crash. So the database is in a state where it is "waiting" on you to return the result. This is just a guess, but I think the problem is your client thinks it already uploaded the file (and it would be correct) but the server doesn't know this, so it can't validate the result. Assuming this is a correct hypothesis, I could maybe push it along by manually changing it's state in the database to "needs validate". Just out of curiosity, I might try this, but I cant do this manually for every user since that would take forever.
ID: 3032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 252,110,991
RAC: 179,991
Message 3033 - Posted: 30 Jan 2021, 19:20:18 UTC

Each of my machines is making an unusual report every time it requests new work:

30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399197of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399198of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399199of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399200of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399201of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399202of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399347of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399689of2000000_0 (expired)

They appear at the top of https://numberfields.asu.edu/NumberFields/results.php?hostid=1291&offset=0&show_names=1&state=1: judging by the dates, they're probably in a similar state.
ID: 3033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3034 - Posted: 30 Jan 2021, 20:14:55 UTC - in response to Message 3033.  
Last modified: 30 Jan 2021, 20:29:01 UTC

Each of my machines is making an unusual report every time it requests new work:

30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399197of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399198of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399199of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399200of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399201of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399202of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399347of2000000_0 (expired)
30/01/2021 18:42:13 | NumberFields@home | Didn't resend lost task wu_sf3_DS-16x271-1_Grp399689of2000000_0 (expired)

They appear at the top of https://numberfields.asu.edu/NumberFields/results.php?hostid=1291&offset=0&show_names=1&state=1: judging by the dates, they're probably in a similar state.


I saw similar messages on my hosts too, but they have since disappeared, and your tasks still show "in progress". I may have hit the update button on my hosts before I changed the deadline, which could be the difference. So maybe we wait until they expire (including the grace period)?

Edit: I just checked my error results and found several "Timed out - no response", but they timed out on Jan 29th. That's confusing since I changed the deadline to Jan 30 for every outstanding WU.
ID: 3034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 28 Oct 11
Posts: 180
Credit: 252,110,991
RAC: 179,991
Message 3035 - Posted: 30 Jan 2021, 20:33:59 UTC - in response to Message 3034.  

... your tasks still show "in progress"....
If you click through to the workunit (as distinct form the task), they show as 'WU cancelled'. I've also just got a batch (across several machines, including this one) of 'download failed' - scheduler issued the work, but the data file couldn't be found. I think that's all part of the process of the database healing itself.

I'm not worried about any of these - though it might be worth giving what remains of the database a good spring-clean at the end of this run. You might want to do that anyway to check that you've got all the results you were expecting.

But if you've got a load of unmatched upload files, and a load of incomplete database records, a query might be able to flip the database state to 'needs validation', and recover them.
ID: 3035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Eric Driver
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 8 Jul 11
Posts: 1344
Credit: 529,022,407
RAC: 564,075
Message 3036 - Posted: 30 Jan 2021, 22:54:54 UTC - in response to Message 3035.  

... your tasks still show "in progress"....
If you click through to the workunit (as distinct form the task), they show as 'WU cancelled'. I've also just got a batch (across several machines, including this one) of 'download failed' - scheduler issued the work, but the data file couldn't be found. I think that's all part of the process of the database healing itself.

I'm not worried about any of these - though it might be worth giving what remains of the database a good spring-clean at the end of this run. You might want to do that anyway to check that you've got all the results you were expecting.

But if you've got a load of unmatched upload files, and a load of incomplete database records, a query might be able to flip the database state to 'needs validation', and recover them.


Ah yes, any WU with group number less than 400000 was cancelled since there were so few of them left (<1k), it wasn't worth letting the db try to resend them as they would have just resulted in download failures.

I'm not worried about missing results - my collators will tell me if anything is missing and then I can go back and redo them manually (I have already done that for all grps <400k). But I will definitely do the spring cleaning after the dust settles.

I did a test and flipped the validation switch on one WU, but all it did was marked it as valid and then it remained stuck. I think it requires flipping more switches, like server_state and client_state, and then triggering the transitioner... which is more than I can chew at the moment.
ID: 3036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Team credit lost !


Main page · Your account · Message boards


Copyright © 2024 Arizona State University