Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmatched size of mirrored data while finishing bandersnatch mirror #1105

Open
r00t1900 opened this issue Apr 3, 2022 · 7 comments
Open
Labels
question Further information is requested

Comments

@r00t1900
Copy link

r00t1900 commented Apr 3, 2022

desc

I use bandersnatch to sync from pypi.org, for almost 10days. Today it finally comes to "generating global index page..." and then finish all its work, while I found that the size is only 8822G, which is not the desired size told in https://pypi.org/stats.

details

command: bandersnatch -c bs.conf mirror
bs.conf:

[mirror]
directory = /mnt/storage/data
master = https://pypi.org
json = true
timeout = 300
workers = 10
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
download-mirror = https://pypi.tuna.tsinghua.edu.cn
download-mirror-no-fallback = false

[plugins]
enabled = blocklist_project

[blocklist]
packages =
  tf-nightly
  tf-nightly-gpu
  tf-nightly-cpu
  tensorflow-io-nightly
  pyagrum-nightly

As is shown in the config file, I use an alternative download mirror, and also block serveral packages. But even I take the blocked packages in conclusion, the number still did not match:

item from size
size in pypi / 10.8T
size in tuna / 9.75T
size of blocked packages manually calc from pypi 1353G=1353/1024 T = 1.32T
size of mirrored df -h -B G 8822G=8822/1024 T = 8.61T

questions

  • The expected size of my mirrored data should be 10.8T-1.32T=9.48T, why I only get 8.61T? In what condition will bandersnatch come to this result?
  • According to pypi.sh in tunasync-scripts, the pypi mirror hosted by tuna is exactly the same configuration of mine, at least the [blocklist] part is. But why the size shown in tuna server status is 9.75T, not the 9.48T(as is calculated above)?

    @tuna

btw

Recent days when running to "generating global index page...", bandersnatch always come begin with an Response timeout error:
pic1:
image
pic2:
image

The command I use is bandersnatch -c bs.conf mirror as usual even for the incremental update.
Q: Should I run bandersnatch verify instead?

@r00t1900
Copy link
Author

r00t1900 commented Apr 5, 2022

something else

Today I found something more interesting:

  • With the data of a stable 8.61T size, everytime I rerun bandersnatch mirror -c bs.conf to try to fix this size, the console always output:
...
2022-04-05 22:25:46,539 INFO: Fetching metadata for package: zxw (serial 7551230) (package.py:57)
2022-04-05 22:25:46,722 INFO: zxw no longer exists on PyPI (package.py:65)
2022-04-05 22:25:46,723 INFO: Fetching metadata for package: zxycba (serial 11240564) (package.py:57)
2022-04-05 22:25:46,780 INFO: zxj-env no longer exists on PyPI (package.py:65)
2022-04-05 22:25:46,780 INFO: Fetching metadata for package: zzz-web (serial 3308206) (package.py:57)
2022-04-05 22:25:46,864 INFO: ztz no longer exists on PyPI (package.py:65)
2022-04-05 22:25:46,864 INFO: Fetching metadata for package: zzzzzzzzz (serial 1189504) (package.py:57)
2022-04-05 22:25:46,907 INFO: zxycba no longer exists on PyPI (package.py:65)
2022-04-05 22:25:47,111 INFO: zeffee no longer exists on PyPI (package.py:65)
2022-04-05 22:25:47,203 INFO: zx-core-backend no longer exists on PyPI (package.py:65)
2022-04-05 22:25:47,356 INFO: zzz-web no longer exists on PyPI (package.py:65)
2022-04-05 22:25:47,375 INFO: zoomeye-dev no longer exists on PyPI (package.py:65)
2022-04-05 22:25:48,226 INFO: zipkin-query no longer exists on PyPI (package.py:65)
2022-04-05 22:25:48,421 INFO: zonda no longer exists on PyPI (package.py:65)
2022-04-05 22:25:49,274 INFO: yw2-hello no longer exists on PyPI (package.py:65)
2022-04-05 22:25:50,373 INFO: zzzzzzzzz no longer exists on PyPI (package.py:65)
2022-04-05 22:25:53,122 INFO: zet no longer exists on PyPI (package.py:65)
2022-04-05 22:25:53,122 INFO: Generating global index page. (mirror.py:483)

Yes, always the same list.

Be noticed that I use an download-mirror option, see the main thread above. Will changing the download-mirror parameter works?

  • The todo file never changed:
13251373
jsii-native-python 7007874
pyrblx 12875777
cspm 7127497
tempremoverwin 7155021
nPhase-pkg-oakheart 7757323
metalearn-rl 5435216
gym-blocksudoku-drakeor 8623552
btrcommands 8100696
Headers 3922766
wintempmanager 7155034
peak-finder-gabepoel 7375478
...

And why the first line only has a number(serial?), unlike the other lines? Is this list match the previous no longer exist on PYPI?

  • If I delete the todo file, and then rerun bandersnatch -c bs.conf mirror, what would happen? Will my current 8.61T data get reset to 0? Because I see the console saying Sync all package... start from serial:0. So I cancel this action immediately. But if this won't make any further worse thing, I would like to try it again.

  • Do you need the todo file? How can I upload file in issue? Or which netdrive links is your recommending?

last

Looking forward for the reply. I am now backuping the whole disk image before I do any further.

@cooperlees
Copy link
Contributor

HI there,

The size on PyPI is a sum of the database metadata. I wouldn't be surprised of the deletions are not updating it correctly or something. Could be worth a check.

Usually when this happens it's 1 package causing issue. This file can be removed and bandersnatch will try sync again from the serial in the serial file along side the todo. So you should be safe to delete it and let it resume.

@cooperlees cooperlees added the question Further information is requested label Apr 8, 2022
@r00t1900
Copy link
Author

r00t1900 commented Apr 8, 2022 via email

@happyaron
Copy link
Contributor

According to pypi.sh in tunasync-scripts, the pypi mirror hosted by tuna is exactly the same configuration of mine, at least the [blocklist] part is. But why the size shown in tuna server status is 9.75T, not the 9.48T(as is calculated above)?

This might relate to the fact that bandersnatch does not automatically remove files that's gone upstream, so the mirror only does garbage collection when a full bandersnatch verify run is performed.

@cooperlees
Copy link
Contributor

Good call. This is 100% the sad state of bandersnatch. We don't have a good mechanism to know what files to delete as we keep the service stateless apart from the blob store (i.e. filesystem, s3 etc.). bandersnatch verify has to walk to whole filesystem .

Only options I see are:

  • Make new / extend PyPI API for this info (this will be a long slog with a PEP etc)
    • e.g. we could keep a "deleted-packages" in the metadata that bandersnatch could read and check to remove from the file store
  • Keep a local sqlite DB of all files per package and compare with JSON metadata as we update to removed yanked releases etc.
    • SQL query much cheaper than walking the filesystem (especially with a nice index)

@lxyeternal
Copy link

lxyeternal commented Jul 30, 2024

I have the same issue, I don't know why there are so many missing package files in the image. How can I make a complete mirror of pypi?

2024-07-30 20:31:59,734 INFO: Fetching metadata for package: zwero-brain-games1 (serial 14011926) (package.py:58)
2024-07-30 20:31:59,796 INFO: zutnlp no longer exists on PyPI (package.py:66)
2024-07-30 20:31:59,796 INFO: Fetching metadata for package: zx-core-backend (serial 3916140) (package.py:58)
2024-07-30 20:31:59,901 INFO: zwdata no longer exists on PyPI (package.py:66)

@cooperlees
Copy link
Contributor

If this is from a failed sync, go to the resume file and remove the packages from there. I don't have a better solution or time to try fix this sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants