Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically add metadata to Hugging Face Hub repos when uploading projects #793

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Jun 17, 2024

With this PR, when running annif upload:

  • if README.md (Model Card) does not exist in the destination repository, then README.md is created with default contents and some metadata of the uploaded projects,
  • if README.md exists, its metadata are updated as necessary.

Closes #790.

The metadata includes these:

language:
- <language-code tags automatically obtained from the uploaded projects>
tags:
- annif   # custom tag
pipeline_tag: text-classification  # HFH tag

The Model Card text content is very minimal; it has just the repo name as the heading and info about how to download projects from the repo, see an example in https://huggingface.co/juhoinkinen/Annif-models-upload-testing.

@juhoinkinen
Copy link
Member Author

About @osma's suggestions in #790 (comment):

For example it could include the Annif version used for training, the backend, vocabulary name and size, possibly some of the hyperparameters / configuration settings as well.

  • Annif version:
    • The Annif version used for training is not stored anywhere at the moment; the version performing the upload is not necessarily the same. This kind of metadata should be first stored somewhere, for which there is the issue Store metadata of project training #329
  • Backend, vocabulary name and other project configuration:

Copy link

sonarcloud bot commented Jun 18, 2024

Quality Gate Passed Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.65%. Comparing base (3b5f7a1) to head (125565e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #793   +/-   ##
=======================================
  Coverage   99.64%   99.65%           
=======================================
  Files          91       91           
  Lines        6817     6886   +69     
=======================================
+ Hits         6793     6862   +69     
  Misses         24       24           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@juhoinkinen
Copy link
Member Author

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Reviewer Guide 🔍

⏱️ Estimated effort to review [1-5] 3
🧪 Relevant tests Yes
🔒 Security concerns No
⚡ Key issues to review Possible Bug:
Ensure that the upsert_modelcard function handles cases where project language data might be missing or malformed. The current implementation assumes that proj.vocab_lang is always available and valid.
Data Integrity:
The merging of languages in upsert_modelcard should handle duplicates and potential case sensitivity issues to avoid incorrect language tags in the Model Card.

)
def test_upsert_modelcard_existing_card(ModelCard, _list_files_in_hf_hub, project):
repo_id = "annif-user/Annif-HFH-repo"
project.vocab_lang = "fi"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project fixture does not provide vocab_lang so for these tests it is just set here, not super clean maybe?

@juhoinkinen juhoinkinen marked this pull request as ready for review June 18, 2024 10:31
@juhoinkinen
Copy link
Member Author

Possible Bug:
Ensure that the upsert_modelcard function handles cases where project language data might be missing or malformed. > The current implementation assumes that proj.vocab_lang is always available and valid.

Good point by the AI, but I think the project language is always set if this point is reached...?

@juhoinkinen juhoinkinen requested a review from osma June 18, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically add metadata to Hugging Face Hub repos when uploading projects
2 participants