Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MD5 and size checks on remote files #582

Merged
merged 16 commits into from
Apr 3, 2019
Merged

Add MD5 and size checks on remote files #582

merged 16 commits into from
Apr 3, 2019

Conversation

PaulWessel
Copy link
Member

We need e mechanism to refresh a users cache files when the file on the server changes. Also, we need to be aware if a download failed before completing. This PR is exploring solutions to these problems, as discussed in #549.

We need e mechanism to refresh a users cache files when the file on the server changes.  Also, we need to be aware if a download failed before completing.  This PR is exploring solutions to these problems.
@joa-quim
Copy link
Member

Tried to check it out but could not find the md5remotecheck branch. Tried with VSC, Tortoise and cmd git checkout origin/md5remotecheck but no way.

Dealing with different modification times on OSX and Linux.
@PaulWessel
Copy link
Member Author

Well, there it is. But a bit premature to play with this - I am debugging it to the point it does not crash. Details of the stat structure on OSX differs from Linux, for example, and possibly on Win32. Does your stat have st_mtime?

@joa-quim
Copy link
Member

Don't know but it was because of this kind of stuff that I wanted to test if it builds. But, as I said, can't switch to this branch.

Looks like Apple is the only one without st_mtime as it exists under Windows and Linux.  Must instead use st_mtimespec.tc_sec.
@PaulWessel
Copy link
Member Author

I don't know why you don't see it. I don't think I did anything different. I created the branch on my computer, then pushed to origin. Updates showing up on this page.

@seisman
Copy link
Member

seisman commented Mar 27, 2019

@joa-quim Did you run git fetch before git checkout origin/md5remotecheck?

Tested the time, hash, and size checks.  Need solution for corrupted M5D table itself.
When run with -Vd the whole sequence should print messages.
@PaulWessel
Copy link
Member Author

PaulWessel commented Mar 27, 2019

I think this branch is ready to be tested by others. I have implemented all the steps discussed in #549. I was able to test it in the debugger by doing the following steps.

  1. I ran gmt grdinfo earth_relief_10m.grd -Vd
  2. The first time ~/.gmt/server/gmt_md5_server.txt is not found so it got downloaded and we are done (nothing to compare to yet). Since I already had the 10m file in cache then grdinfo found that file.
  3. I then aged the file via touch -t 201903221010 gmt_md5_server.txt so it would appear to be more than one day old.
  4. Running the grdinfo command again it found the gmt_md5_server.txt so it proceeds to get its modification time. Comparing that time with current time revealed the file was > 1 day old.
  5. We rename the old gmt_md5_server.txt to gmt_md5_server.old and download a fresh copy of gmt_md5_server.txt.
  6. We now load the contents of the two MD5 tables into memory structures. If the new table has a mismatch between the number of entries stated in the header record and the actual number of records we must have had a download problem and we delete the file and give up the check.
  7. We compare the information in the old and new structures. Files not found in the new table but existing locally are removed (these are files no longer supported by the GMT service).
  8. If the server and cached files have different MD5 hash we remove the local file (so it will be downloaded). Note: We are not running md5sum but just comparing the MD5 hashes as stored in the MD5 table.
  9. If the server and cached files have different file sizes then we remove the local file (so it will be downloaded). This will catch aborted downloads since the file size will be different.
  10. We now proceed to look for the file locally. If not found it will be downloaded.

A few things for discussion:

  • Item 7 may seem too harsh to you: why delete a file the user has downloaded from the GMT service when that file no longer exist on the server? My thinking is that we are responsible for the quality and integrity of those files on the server and if we decide some file should not be offered we do not want casual users to continue to use them. Others can simply move that file to a safe place if they worry about our garbage collector.
  • What about off-line work? Right now any failure to curl will simply state that as an error and move on. However, I can imagine if you are working offline if will be annoying to get those error messages. We could move them to the warning category (requires -V to be visible) if you feel that is a better solution.

Please give it a try and give feedback.

Also print when file is found and age < 1 day.
@joa-quim
Copy link
Member

Now I can see the branch. I deduced from this that the branch name only shows up locally when the tests have finished (they hadn't yesterday when I tested). Also, by doing git fetch I could do the git checkout origin/md5remotecheck but the branch showed up only with its hash name.

Anyway, tested this only to the point to see that it builds with no errors.

Improve checks and add more debug messages.
Because of the way GMT registeres inputs etc the dm5_refresh function could be called twice.  Added a variable to prevent wasting time on a repeat call.
Just improve the comments and clean up some of the debug print statements.
@PaulWessel
Copy link
Member Author

I think we need the @GenericMappingTools/core group as well as @abanbj to test this branch out. For it to be a real test I suggest you build and use this branch for a few days. After building, please run

gmt info @App_O_transect.txt -Vd

Examine the verbose output. If this is the first time it will simply get the gmt_md5_server.txt and that is the end of the check. Then the file may be downloaded if you don't already have it.
Notify me know when you have done this. I will then make a tiny change to that file and update the MD5 table. As you run gmt info again it will (when your gmt_md5_server.txt is 24 hours or older) get an updated gmt_md5_server.txt and detect the change, remove your old App_O_transect.txt file and download it again.

I have done this here for testing but would be nice to see this working also under WIndows (@joa-quim) and LInux (@leouieda).

@anbj
Copy link
Contributor

anbj commented Mar 29, 2019

OK, did;

gmt info @App_O_transect.txt -Vd

@PaulWessel
Copy link
Member Author

I have make a tiny change to that file and updated the gmt_md5_server.txt file so when your file is > 24 hours and you rerun that command it should be clear from the output that it refreshed the file and found that data table had changed and thus deleted it and downloaded it anew. Let me know your experience.

@anbj
Copy link
Contributor

anbj commented Mar 30, 2019

md5_test_1st.txt - first download
md5_test_2nd.txt - second download - immediately after first download
md5_test_3trd.txt - just now - after file has been updated

$ diff md5_test_1st.txt md5_test_2nd.txt
44,47c44,45
< gmtinfo [DEBUG]: Download remote file http:https://www.soest.hawaii.edu/gmt/data/gmt_md5_server.txt for the first time
< gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/gmt_md5_server.txt ...
< gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/cache/App_O_transect.txt ...
< gmtinfo [INFORMATION]: Download complete [Got 46.093 kb].
---
> gmtinfo [DEBUG]: Local file /home/user/.gmt/server/gmt_md5_server.txt found
> gmtinfo [DEBUG]: File /home/user/.gmt/server/gmt_md5_server.txt less than 24 hours old, refresh is premature.
53c51
< gmtinfo [DEBUG]: Object ID 1 : Registered Data Table Stream 7f3c63b4c760 as an Output resource with geometry Point|Line|Poly [n_objects = 2]
---
> gmtinfo [DEBUG]: Object ID 1 : Registered Data Table Stream 7f49ebd4c760 as an Output resource with geometry Point|Line|Poly [n_objects = 2]

$ diff md5_test_2nd.txt md5_test_3trd.txt
45c45,56
< gmtinfo [DEBUG]: File /home/user/.gmt/server/gmt_md5_server.txt less than 24 hours old, refresh is premature.
---
> gmtinfo [DEBUG]: File /home/user/.gmt/server/gmt_md5_server.txt older than 24 hours, get latest from server.
> gmtinfo [DEBUG]: Rename /home/user/.gmt/server/gmt_md5_server.txt -> /home/user/.gmt/server/gmt_md5_server.txt.old
> gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/gmt_md5_server.txt ...
> gmtinfo [DEBUG]: Load contents from /home/user/.gmt/server/gmt_md5_server.txt
> gmtinfo [DEBUG]: Load contents from /home/user/.gmt/server/gmt_md5_server.txt.old
> gmtinfo [DEBUG]: Server and cache versions of earth_relief_30m.grd are identical - no need to download new file.
> gmtinfo [DEBUG]: Server and cache versions of App_O_transect.txt have different MD5 hash codes - must download new copy.
> gmtinfo [DEBUG]: Delete /home/user/.gmt/cache/App_O_transect.txt
> gmtinfo [DEBUG]: Remove outdated file /home/user/.gmt/server/gmt_md5_server.txt.old.
> gmtinfo [DEBUG]: Delete /home/user/.gmt/server/gmt_md5_server.txt.old
> gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/cache/App_O_transect.txt ...
> gmtinfo [INFORMATION]: Download complete [Got 46.093 kb].
51c62
< gmtinfo [DEBUG]: Object ID 1 : Registered Data Table Stream 7f49ebd4c760 as an Output resource with geometry Point|Line|Poly [n_objects = 2]
---
> gmtinfo [DEBUG]: Object ID 1 : Registered Data Table Stream 7f2e2294c760 as an Output resource with geometry Point|Line|Poly [n_objects = 2]
78c89
< ~/b$
\ No newline at end of file
---
> ~$
\ No newline at end of file

@PaulWessel
Copy link
Member Author

Looks like a successful test, @abanbj.

@PaulWessel
Copy link
Member Author

I would like to merge this one into master since it is passing the various tests I have tried. However, I would feel happier if more of the @GenericMappingTools/core could give it a try first.

@joa-quim
Copy link
Member

Did the gmtinfo step

@PaulWessel
Copy link
Member Author

OK, I updated the gmt_md5_server.txt file so wait 24 hours then try again.

@joa-quim
Copy link
Member

joa-quim commented Apr 2, 2019

I guess this shows it had worked.

gmtinfo [DEBUG]: Server and cache versions of App_O_transect.txt have different MD5 hash codes - must download new copy.
gmtinfo [DEBUG]: Delete C:/progs_cygw/GMTdev/gmt5/tests_data/App_O_transect.txt
...
gmtinfo [DEBUG]: Remove outdated file C:/j/.gmt/server/gmt_md5_server.txt.old.
gmtinfo [DEBUG]: Delete C:/j/.gmt/server/gmt_md5_server.txt.old
gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/cache/App_O_transect.txt ...
gmtinfo [INFORMATION]: Download complete [Got 46.093 kb].

@PaulWessel
Copy link
Member Author

Does @seisman or @leouieda have any objections to merging this into master?

@seisman
Copy link
Member

seisman commented Apr 2, 2019

This PR looks good to me.

@leouieda
Copy link
Member

leouieda commented Apr 2, 2019

Sorry, I missed this notification. I built the recent branch and tried running gmt info @App_O_transect.txt -Vd. I'm getting the download message (below) every time I run it but there is no gmt_md5_server.txt in my USERDIR. This message also shows up when I turn off my internet connection so there might be some issue with the error checking code in gmtmd5_get_url.

gmtinfo [DEBUG]: Download remote file http:https://www.soest.hawaii.edu/gmt/data/gmt_md5_server.txt for the first time
gmtinfo [INFORMATION]: Downloading file http:https://www.soest.hawaii.edu/gmt/data/gmt_md5_server.txt ...

@PaulWessel
Copy link
Member Author

The gmt_md5_server.txt should be in ~/.gmt/server directory - nothing there? Curl should return with an error code if Internet is down - perhaps after a suitable timeout - so what happens after those mesages? Do you get a libcurl error reported? Is this Linux command line or via python?

@PaulWessel PaulWessel changed the title WIP experimenting with MD5 checks on remote files Add MD5 and size checks on remote files Apr 3, 2019
@leouieda
Copy link
Member

leouieda commented Apr 3, 2019

This is all Linux command line. There is nothing in the server folder. As far as I can tell, the file is not actually being downloaded even when the connection is on. No errors from libcurl.

@PaulWessel
Copy link
Member Author

OK, I will build on my Ubuntu laptop and see if there is anything that changes for me.

If user has never used remote files we must check if server directory exists and if not create it first.
@PaulWessel
Copy link
Member Author

OK, please update the branch and try again. I had not added a check to see if you had the server directory or not in the refresh function and since I had used remote files for a while I already had that directory before the new MD5 refresh check. I made the changes and it now runs fine for me on Ubuntu in an account that had no such directory to start with.

@leouieda
Copy link
Member

leouieda commented Apr 3, 2019

Just tested and seems to working for these scenarios:

  1. Both cache and server missing
  2. File in cache but server missing
  3. No file in cache but server up to date
  4. Both cache and server up to date
  5. cache missing and no internet causes error

The only thing that didn't work as expected was the case when the data file is in cache and there is no server file and there is no internet. I was expecting this to cause an error saying that couldn't download the server md5 file but I got nothing. Don't know if this is the intended behavior, though.

@anbj
Copy link
Contributor

anbj commented Apr 3, 2019

Just a side note on the last post; is the cache dir not supposed to be deleted when doing a gmt clear all? But server is?

~/.gmt 07:55:42 $ tree
.
├── cache
│   └── App_O_transect.txt
└── server
    ├── earth_relief_15m.grd
    ├── earth_relief_30m.grd
    ├── earth_relief_60m.grd
    └── gmt_md5_server.txt
~/.gmt 07:55:43 $ gmt clear all
~/.gmt 07:55:47 $ tree
.
└── cache

PaulWessel added a commit that referenced this pull request Apr 3, 2019
For some reason I had clear cache remove cache and its content, then create an empty cache dir.  But that dir will be created when needed, so makes more sense to remove it like everything else.  Addresses #582.
PaulWessel added a commit that referenced this pull request Apr 3, 2019
For some reason I had clear cache remove cache and its content, then create an empty cache dir.  But that dir will be created when needed, so makes more sense to remove it like everything else.  Addresses #582.
@PaulWessel
Copy link
Member Author

I will try to turn off Internet and recreate that situation and see how I can catch the no-connection error. Presumably libcurl returns the correct error code and we can issues a message and suggest if they are working off-line to set GMT_AUTO_DOWNLOAD to off.

Report all errors when getting the gmt_md5_server.txt file fails.
If we cannot connect, then disable auto-download for the rest of the session.
@PaulWessel
Copy link
Member Author

Now, if the libcurl operation on the md5 file fails we complain and then stop checking for that session. This should catch your last case, @leouieda.

@leouieda
Copy link
Member

leouieda commented Apr 3, 2019

👍 looks good to me

@PaulWessel
Copy link
Member Author

Time to merge this puppy.

@PaulWessel PaulWessel merged commit 6206666 into master Apr 3, 2019
@PaulWessel PaulWessel deleted the md5remotecheck branch April 3, 2019 20:20
obaney pushed a commit to obaney/gmt that referenced this pull request Aug 18, 2021
GMT 6.1.1 is already released, and the 6.1 branch won't be updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants