Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cax skips actual rsync if error'ed DB cleared #109

Open
pdeperio opened this issue Jun 6, 2017 · 1 comment
Open

cax skips actual rsync if error'ed DB cleared #109

pdeperio opened this issue Jun 6, 2017 · 1 comment
Assignees

Comments

@pdeperio
Copy link
Contributor

pdeperio commented Jun 6, 2017

Running with "task_list": ["RetryStalledTransfer", "CopyPull"] clears the error successfully but does not actually continue to re-download it:

(pax_v6.6.5) [mklinton@midway2-login1 cax_pdp]$  HOSTNAME=midway-login1 cax --once --config cax_CopyPull.json  --log DEBUG  --run 6736
6736 midway-login1
root        : INFO     Using custom config file: cax_CopyPull.json
root        : INFO     Executing RetryStalledTransfer.
RetryStalledTransfer: ERROR    Transfer or process errored, retry.
RetryStalledTransfer: INFO     Deleting /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
RetryStalledTransfer: ERROR    did not exist, notify run database.
RetryStalledTransfer: INFO     Removed from run database: /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 170202_2248.root took 0 seconds

Then running the same command again, works:

(pax_v6.6.5) [mklinton@midway2-login1 cax_pdp]$  HOSTNAME=midway-login1 cax --once --config cax_CopyPull.json  --log DEBUG  --run 6736
6736 midway-login1
root        : INFO     Using custom config file: cax_CopyPull.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     downloading run 6736 to: midway-login1
CopyPull    : INFO     {'location': '/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root', 'creation_place': 'OSG', 'status': 'transferred', 'pax_version': 'v6.6.5', 'checksum': 'e1794299ba08041ebb150bc16cd75179468f9fbebaccc72c3889407f7d49c0cb6d8f8c5bbbef8f5d24ad90a0d2cd16d505df66e06fce2631d90dd4668e259b92', 'creation_time': [datetime.datetime(2017, 5, 26, 21, 5, 35, 407000)], 'host': 'login', 'type': 'processed'}
CopyPull    : INFO     Starting rsync
root        : INFO     download: login.ci-connect.uchicago.edu/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root to /project/lgrandi/xenon1t/processed/pax_v6.6.5/170202_2248.root
CopyPull    : INFO     time rsync -r --stats [email protected]:/xenon/xenon1t_processed/pax_v6.6.5/170202_2248.root /project/lgrandi/xenon1t/processed/pax_v6.6.5
CopyPull    : INFO         
Number of files: 1
Number of files transferred: 1
Total file size: 1854776707 bytes

This slows down transfer recovery by one periodic cycle of massive-cax.

@pdeperio
Copy link
Contributor Author

Still happening now, including a third stage AddChecksum, first try deletes the file and DB entry:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
RetryBadChecksumTransfer: ERROR    Bad checksum 14576, midway-login1, processed
RetryBadChecksumTransfer: ERROR    Bad checksum v6.8.0
RetryBadChecksumTransfer: INFO     Deleting /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
RetryBadChecksumTransfer: INFO     Removed from run database: /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 171118_0702.root took 0 seconds
root        : INFO     Executing AddChecksum.
root        : INFO     Executing SetPermission.
root        : INFO     Executing ProcessBatchQueueHax.

Second try to get CopyPull:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     downloading run 14576 to: midway-login1
CopyPull    : INFO     {'creation_time': datetime.datetime(2017, 11, 19, 3, 50, 14, 367000), 'status': 'transferred', 'type': 'processed', 'checksum': 'b678b2f66ac193e6feb58d0a11b4b9bb9bd294ac1ac98dbf4a5a22f05e48f483ff6ab0d098fed64009c6672fab4d74ca77186df3ad746df846a099ca1fe9d0be', 'host': 'login', 'location': '/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root', 'creation_place': 'OSG', 'pax_version': 'v6.8.0'}
CopyPull    : INFO     Starting rsync
root        : INFO     download: login.xenon.ci-connect.net/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root to /project2/lgrandi/xenon1t/processed/pax_v6.8.0/171118_0702.root
CopyPull    : INFO     time rsync -r --stats [email protected]:/xenon/xenon1t_processed/pax_v6.8.0/171118_0702.root /project2/lgrandi/xenon1t/processed/pax_v6.8.0
root        : INFO     End of download

CopyPull    : INFO     rsync download dataset 171118_0702.root took 153 seconds
root        : INFO     Executing AddChecksum.
root        : INFO     Executing SetPermission.

And third try to get AddChecksum:

$ HOSTNAME=midway-login1  cax --config /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json --once --run 14576
14576 midway-login1
root        : INFO     Using custom config file: /project/lgrandi/xenon1t/cax/cax_AddChecksum_only.json
root        : INFO     Executing RetryStalledTransfer.
root        : INFO     Executing RetryBadChecksumTransfer.
root        : INFO     Executing CopyPull.
CopyPull    : INFO     rsync download dataset 171118_0702.root took 0 seconds
root        : INFO     Executing AddChecksum.
AddChecksum : INFO     Adding a checksum to run 14576 processed
root        : INFO     Executing SetPermission.

This really slows things down, requiring 3 full cycles of massive-cax to actually get through the whole task list. It seems as if the DB entry is not being re-queried for each task, but @tunnell I thought you said it is? Could you help check please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants