Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude certain languages from repo stats #11672

Closed
wants to merge 3 commits into from

Conversation

CirnoT
Copy link
Contributor

@CirnoT CirnoT commented May 29, 2020

For one reason or another, Linguist sometimes decides that it is better to not mark file as generated, so that its diff is not suppressed on GitHub.

One such example is yarn.lock, often present in Node repositories using Yarn as package manager. Due to it's nature it was decided to not include it as generated file as it is easy to inspect its diff for security purposes (lockfile poison).


However, despite that, GitHub still does not include such files in their language bar, unless they are the only file present in repository.

Further tests revealed that in fact, GitHub explicitly ignores certain languages unless they have 100% presence in repository. There is no publicly available list of such exclusions, however prominent examples are JSON and YAML - you can verify this by trying to search GitHub repositories by language:.

This PR aims to add support for similar logic in Gitea. The list of languages present right now might not be full, the only real way to find would be to test every language GitHub knows about.

@silverwind
Copy link
Member

I wonder if this should be better done upstream in go-enry.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label May 29, 2020
@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

go-enry aims to replicate linguist behavior, and linguist decided that yarn.lock should not be marked as generated back in 2019.

Doing this in upstream would mean significant changes between linguist and go-enry, and I believe that is not the aim of the project.

Additionally, GitHub also performs same exclusion outside of linguist scope (otherwise we'd be seeing lot of YAML repos instead of JavaScript ones)


ref github-linguist/linguist#4348 github-linguist/linguist#4445 github-linguist/linguist#4459 github-linguist/linguist@3bc8185

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

Frankly I find reasoning given by linguist insufficient in consideration that every other lockfile in existence is excluded EXCEPT for yarn.lock because... someone didn't want diffs for it collapses and linguist maintainers agreed.

That being said I definitely agree that yarn.lock is particularly very easy and pleasant to review and I often find myself checking it on PRs that modify it. Perhaps linguist could implement additional list for lockfiles only, which would be ignored but not diff-collapses; but I'm no Ruby developer and I'm not sure maintainers are even interested in bringing new functionality not used by GitHub.

Either way, I think this PR is still valid as it aims to improve code statistics detection by removing unnecessary noise, bringing it closer to GitLab and GitHub's implementations.

@silverwind
Copy link
Member

I'd definitely agree that yarn.lock should be considered generated and it just seems silly to not do it for yarn.lock but do so for package-lock.json.

The argument that yaml is easier to diff than json is also silly, I can easily visually diff both for small changes and for bigger ones I skip over both of them.

@silverwind
Copy link
Member

Maybe add a comment describing that the reason for the existance of this list is to overrule linguist/enry's behaviour that is considered suboptimal.

@silverwind
Copy link
Member

silverwind commented May 29, 2020

What I worry thought is that let's say I have a repo for a Ansible role. These are almost purely YAML and those files are basically equivalent to code. If I introduce a bash script into said repo, it would no longer count any YAML at all?

Maybe some sort of filename-based blacklist would be better. It should only count files that are definitely always generated. Users could define their own classifications via gitattributes.

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

That isn't strictly true; the existence of the list can be owned to linguist poor choice because that's what personally made me write the code, but in general the purpose of the list for now and for future is to not only overrule but also remove noise.

I would rather not implement any file-based approach lest we re-do linguist/enry on our own. I advise you to push example repository to GitHub and see how they see it. My bet is that they will ignore YAML just like my implementation.

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

I found good example here: https://github.com/ansible/ansible-examples

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

GitLab outright refuses to even acknowledge YAML even if it's lone in repository: https://gitlab.com/maksold/ansible-role-certbot

@silverwind
Copy link
Member

silverwind commented May 29, 2020

So you're saying this exclusion is not part of linguist but some hackery that GitHub does after the files were processed by linuist?

I wonder if one can overrule this via gitattributes *.yaml -linguist-vendored -linguist-generated.

Maybe instead of a plain exclusion list, we want to specify the files as generated so enry will see them as such, it'd be more compatible and user-overridable.

Edit: created https://github.com/silverwind/ansible-examples, so far no change

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

Yes you can overrule it for linguist, it will make GitHub show full diff as it won't treat these files as generated/vendored, but it doesn't seem to affect language statistics.

GitLab doesn't use linguist at all and I'm not sure you can even override it there.

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

These ansible files aren't considered generated or vendored by linguist in the first place, so .gitattributes change has no meaning really. It's just that GitHub explicitly ignores specific languages from language bar/statistics, and so does GitLab. It helps keep the noise down I suppose.

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

Generally GitHub uses linguist for few different things. One of them is deciding if diff should be collapsed, another is language bar statistics. For language bar they apply additional logic to clean out certain languages from showing in almost every repo.

@silverwind
Copy link
Member

Ok, I guess the notion behind this that these files would never represent source code. Certainly not true for YAML or SVG but those are kind of grey areas.

BTW, yarn lockfile v1 is not strictly YAML format, only the v2 version is. Still most mechanisms seem to pick it up as YAML for some reason.

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

Actually linguist explicitly marks yarn.lock filename as YAML, for similar reason the generated exclusion was disabled - syntax coloring for diffs.

@jolheiser
Copy link
Member

jolheiser commented May 29, 2020

Would this be better served as a configurable option?

.gitea/stats file in the repo or something like that?

That would allow us to close #10266 (alternatively you could add TOML to that list, but I feel like a per-repo config file might be more future-proof)

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

I'm generally opposed to idea of configuring language statistics per-repo in yet-another-dot-file(r). My idea is that Gitea should properly identify language of repositories by default, without forcing user to apply additional configuration.

Remember, the function of language statistics is also so that repos can be looked up in search via their language, we can not rely on every repository owner doing that, and it also opens an issue of what to do with migrations?

I strongly believe expected user behavior is for language statistic to work out-of-the-box, without requiring additional configuration for every repository.

Added TOML to list.

@zeripath
Copy link
Contributor

although I'm generally opposed to requiring yet another dotfile - as I think we should generally just get things right first time - I think having an option to override behaviour is always a good idea.

That being said - that can be done as a different PR.

@lafriks
Copy link
Member

lafriks commented May 29, 2020

No need for additional dot file. Github does language stats fine-tuning using .gitattributes file but imho that is out of scope for this PR

@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

I won't oppose any PR that adds such ability via .gitattributes if it's opt-out.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels May 29, 2020
@CirnoT
Copy link
Contributor Author

CirnoT commented May 29, 2020

Deprecated by #11681

@CirnoT CirnoT closed this May 29, 2020
@CirnoT CirnoT deleted the language_stat-excludes branch May 29, 2020 19:37
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 1 This PR needs approval from one additional maintainer to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants