Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect charset and convert non UTF-8 files for display #4950

Merged
merged 6 commits into from
Sep 29, 2018

Conversation

lafriks
Copy link
Member

@lafriks lafriks commented Sep 17, 2018

Fixes #4879

@@ -72,6 +76,21 @@ func DetectEncoding(content []byte) (string, error) {
return result.Charset, err
}

// DetectEncodingAndConvert detects the encoding of content and coverts to UTF-8 if possible
func DetectEncodingAndConvert(content []byte) []byte {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the same function on templates.ToUTF8WithErr(buf)

Copy link
Member Author

@lafriks lafriks Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but there are two problems with it. First it does return string this way I would have unneeded casting to string and back to byte array and second it does return error but I need to return default byte array in case of error. But I will rename my function and move to other module than

@bkcsoft bkcsoft added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Sep 18, 2018
@lunny lunny mentioned this pull request Sep 18, 2018
@lafriks
Copy link
Member Author

lafriks commented Sep 18, 2018

@lunny fixed

@codecov-io
Copy link

codecov-io commented Sep 21, 2018

Codecov Report

Merging #4950 into master will increase coverage by 0.01%.
The diff coverage is 63.63%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4950      +/-   ##
=========================================
+ Coverage   37.39%   37.4%   +0.01%     
=========================================
  Files         306     306              
  Lines       45343   45373      +30     
=========================================
+ Hits        16956   16974      +18     
- Misses      25933   25945      +12     
  Partials     2454    2454
Impacted Files Coverage Δ
routers/repo/view.go 45.06% <100%> (+0.17%) ⬆️
modules/templates/helper.go 48.95% <35.71%> (-0.58%) ⬇️
modules/base/tool.go 74.52% <81.25%> (-0.62%) ⬇️
modules/process/manager.go 81.15% <0%> (+4.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6780661...92d7a69. Read the comment docs.

@bkcsoft bkcsoft added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Sep 21, 2018
@jonasfranz
Copy link
Member

@lunny Need your review

@lunny
Copy link
Member

lunny commented Sep 21, 2018

When the content is too short, the detection will be wrong randomly. I think maybe copy itself before detection.
See

diff --git a/modules/base/tool.go b/modules/base/tool.go
index 2dfd8ffec..19953c319 100644
--- a/modules/base/tool.go
+++ b/modules/base/tool.go
@@ -59,7 +59,17 @@ func DetectEncoding(content []byte) (string, error) {
                return "UTF-8", nil
        }

-       result, err := chardet.NewTextDetector().DetectBest(content)
+       var detectedContent []byte
+       if len(content) < 1024 {
+               times := 1024 / len(content)
+               detectedContent = make([]byte, 0, times*len(content))
+               for i := 0; i < times; i++ {
+                       detectedContent = append(detectedContent, content...)
+               }
+       } else {
+               detectedContent = content
+       }
+       result, err := chardet.NewTextDetector().DetectBest(detectedContent)
        if err != nil {
                return "", err
        }

@lunny
Copy link
Member

lunny commented Sep 21, 2018

I also tested change 1024 to 100. It's still right.

@lafriks
Copy link
Member Author

lafriks commented Sep 23, 2018

@lunny I think that should be fixed in other PR

@lunny
Copy link
Member

lunny commented Sep 23, 2018

@lafriks but this PR will be back port to release/v1.5, I don't think we need two PRs to fix that. This problem should be a chardet library's bug, my code is a temporary method. And it's not easy to update vendor on release/v1.5.

@lafriks
Copy link
Member Author

lafriks commented Sep 24, 2018

@lunny added your requested changes

@bkcsoft bkcsoft added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Sep 25, 2018
@lunny
Copy link
Member

lunny commented Sep 25, 2018

But CI failed.

}

// If there is an error, we concatenate the nicely decoded part and the
// original left over. This way we won't loose data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: loose should be lose

@techknowlogick
Copy link
Member

CI is failing because of following test:

assert.Error(t, err)

It fails to return an error detecting encoding of []byte{0xfa}

I think this is related to the code that has the if len(content) < 1024 { conditional.

@lafriks
Copy link
Member Author

lafriks commented Sep 29, 2018

Fixed tests and comments

@lunny lunny merged commit 81702e6 into go-gitea:master Sep 29, 2018
@lunny
Copy link
Member

lunny commented Sep 29, 2018

Please send back port to v1.5

@lafriks lafriks deleted the fix/chardet_view_file branch September 29, 2018 08:36
lafriks added a commit to lafriks-fork/gitea that referenced this pull request Sep 29, 2018
* Detect charset and convert non UTF-8 files for display

* Refactor and move function to correct module

* Revert unrelated changes

* More unrelated changes

* Duplicate content for small text to have better encoding detection

* Check if original content is valid before duplicating it
@lafriks lafriks added the backport/done All backports for this PR have been created label Sep 29, 2018
lunny pushed a commit that referenced this pull request Sep 30, 2018
* Detect charset and convert non UTF-8 files for display

* Refactor and move function to correct module

* Revert unrelated changes

* More unrelated changes

* Duplicate content for small text to have better encoding detection

* Check if original content is valid before duplicating it
aswild added a commit to aswild/gitea that referenced this pull request Oct 24, 2018
Prepare for wild/v1.6 branch

* BREAKING
  * Respect email privacy option in user search via API (go-gitea#4512)
  * Simply remove tidb and deps (go-gitea#3993)
  * Swagger.v1.json template (go-gitea#3572)
* FEATURE
  * Pull request review/approval and comment on code (go-gitea#3748)
  * Added dependencies for issues (go-gitea#2196) (go-gitea#2531)
  * Add the ability to have built in themes in Gitea and provide dark theme arc-green (go-gitea#4198)
  * Add sudo functionality to the API (go-gitea#4809)
  * Add oauth providers via cli (go-gitea#4591)
  * Disable merging a WIP Pull request (go-gitea#4529)
  * Force user to change password (go-gitea#4489)
  * Add letsencrypt to Gitea (go-gitea#4189)
  * Add push webhook support for mirrored repositories (go-gitea#4127)
  * Add csv file render support defaultly (go-gitea#4105)
  * Add Recaptcha functionality to Gitea (go-gitea#4044)
* BUGFIXES
  * Fix release creation via API (go-gitea#5076)
  * Remove links from topics in edit mode  (go-gitea#5026)
  * Fix missing AppSubUrl in few more templates (fixup) (go-gitea#5021)
  * Fix missing AppSubUrl in some templates (go-gitea#5020)
  * Hide outdated comments in file view (go-gitea#5017)
  * Upgrade gopkg.in/testfixtures.v2 (go-gitea#4999)
  * Disable debug routes unless PPROF is enabled in configuration (go-gitea#4995)
  * Fix user menu item styling (go-gitea#4985)
  * Fix layout of the topics editing form (go-gitea#4971)
  * Fix null pointer dereference in ParseCommitWithSignature (go-gitea#4962)
  * Fix url in discord webhook (go-gitea#4953)
  * Detect charset and convert non UTF-8 files for display (go-gitea#4950)
  * Make sure to catch the right error so it is displayed on the UI (go-gitea#4945)
  * Fix(topics): don't redirect to explore page. (go-gitea#4938)
  * Fix bug forget to remove Stopwatch when remove repository (go-gitea#4928)
  * Fix bug when repo remained bare if multiple branches pushed in single push (go-gitea#4923)
  * Fix: Let's Encrypt configuration settings (go-gitea#4911)
  * Fix: Crippled diff (go-gitea#4726) (go-gitea#4900)
  * Fix trimming of markup section names (go-gitea#4863)
  * Issues api allow pulls and fix go-gitea#4832 (go-gitea#4852)
  * Do not autocreate directory for new users/orgs (go-gitea#4828) (go-gitea#4849)
  * Fix redirect with non-ascii branch names (go-gitea#4764) (go-gitea#4810)
  * Fix missing release title in webhook (go-gitea#4783) (go-gitea#4796)
  * User shouldn't be able to approve or reject his/her own PR (go-gitea#4729)
  * Make sure to reset commit count in the cache on mirror syncing (go-gitea#4720)
  * Fixed bug where team with admin privelege type doesn't get any unit  (go-gitea#4719)
  * Fix incorrect caption of webhook setting (go-gitea#4701) (go-gitea#4717)
  * Allow WIP marker to contains < or > (go-gitea#4709)
  * Hide org/create menu item in Dashboard if user has no rights (go-gitea#4678) (go-gitea#4680)
  * Site admin could create repos even MAX_CREATION_LIMIT=0 (go-gitea#4645)
  * Fix custom templates being ignored (go-gitea#4638)
  * Fix starring icon after semantic ui update (go-gitea#4628)
  * Fix Split-View line adjustment (go-gitea#4622)
  * Fix integer constant overflows in tests (go-gitea#4616)
  * Push whitelist now doesn't apply to branch deletion (go-gitea#4601) (go-gitea#4607)
  * Fix bugs when too many IN variables (go-gitea#4594)
  * Fix failure on creating pull request with assignees (go-gitea#4419) (go-gitea#4583)
  * Fix panic issue on update avatar email (go-gitea#4580) (go-gitea#4581)
  * Fix status code label for a successful webhook (go-gitea#4540)
  * An inactive user shouldn't be able to be added as a collaborator (go-gitea#4535)
  * Don't fail silently if trying to add a collaborator twice (go-gitea#4533)
  * Fix incorrect MergeWhitelistTeamIDs check in CanUserMerge function (go-gitea#4519) (go-gitea#4525)
  * Fix out-of-transaction query in removeOrgUser (go-gitea#4521) (go-gitea#4522)
  * Fix migration from older releases (go-gitea#4495)
  * Accept 'Data:' in commit graph (go-gitea#4487)
  * Update xorm to latest version and fix correct `user` table referencing in sql (go-gitea#4473)
  * Relative URLs for LibreJS page (go-gitea#4460)
  * Redirect to correct page after using scratch token (go-gitea#4458)
  * Fix column droping for MSSQL that need new transaction for that (go-gitea#4440)
  * Replace src with raw to fix image paths (go-gitea#4377)
  * Add default merge options when creating new repository (go-gitea#4369)
  * Fix docker build (go-gitea#4358)
  * Fixes repo membership check in API (go-gitea#4341)
  * Dep upgrade mysql lib (go-gitea#4161)
  * Fix some issues with special chars in branch names (go-gitea#3767)
  * Responsive design fixes (go-gitea#4508)
* ENHANCEMENT
  * Fix milestones sorted wrongly (go-gitea#4987)
  * Allow api to create tags for releases if they don't exist (go-gitea#4890)
  * Fix go-gitea#4877 to follow the OpenID Connect Audiences spec (go-gitea#4878)
  * Enforce token on api routes [fixed critical security issue go-gitea#4357] (go-gitea#4840)
  * Update legacy branch and tag URLs in dashboard to new format (go-gitea#4812)
  * Slack webhook channel name cannot be empty or just contain an hashtag (go-gitea#4786)
  * Add whitespace handling to PR-comparsion (go-gitea#4683)
  * Make reverse proxy auth optional (go-gitea#4643)
  * MySQL TLS (go-gitea#4642)
  * Make sure to set PR split view when creating/previewing a pull request  (go-gitea#4617)
  * Log user in after a successful sign up (go-gitea#4615)
  * Fix typo IsPullReuqestBroken -> IsPullRequestBroken (go-gitea#4578)
  * Allow admin toggle forcing a password change for newly created users (go-gitea#4563)
  * Update jQuery to v1.12.4 (go-gitea#4551)
  * Env var GITEA_PUSHER_EMAIL (go-gitea#4516)
  * Feat(repo): support search repository by topic name (go-gitea#4505)
  * Small improvements to dependency UI (go-gitea#4503)
  * Make max commits in graph configurable (go-gitea#4498)
  * Add valid for lfs oid (go-gitea#4461)
  * Add shortcut to save wiki page (go-gitea#4452)
  * Allow administrator to create repository for any organization (go-gitea#4368)
  * Fix repository last updated time update when delete a user who watched the repo (go-gitea#4363)
  * Switch plaintext scratch tokens to use hash instead (go-gitea#4331)
  * Increase default TOTP secret size to 320 bits (go-gitea#4287)
  * Keep preseeded database password (go-gitea#4284)
  * Implemented hover text showing user FullName (go-gitea#4261)
  * Add ability to delete a token (go-gitea#4235)
  * Fix typos in i18n variable names. (go-gitea#4080)
  * Api: repos/search: add parameters to control the sort order (go-gitea#3964)
  * Add missing path in the Docker app.ini template (go-gitea#2181)
  * Add file name and branch to page title (go-gitea#4902)
  * Offline use of google fonts (go-gitea#4872)
  * Add missing History link to directory listings v2 (go-gitea#4829)
  * Locale for Edit and Remove due date issue (go-gitea#4802)
  * Disable 'May Import Local Repository' when is disabled by setting (Is… (go-gitea#4780)
  * API /admin/users/{username} missing parameter (go-gitea#4775)
  * Display error when adding a user to a team twice (go-gitea#4746)
  * Remove UsePrivilegeSeparation from the Docker sshd_config, see go-gitea#2876 (go-gitea#4722)
  * Focus title input when clicking helper link (go-gitea#4696)
  * Add vendor to user reserved words and format words list according alphabet (go-gitea#4685)
  * Add gitea/issues link to 500 page (go-gitea#4654)
  * Hide home button when landing page is not set to home (go-gitea#4651)
  * Remove link to GitHub issues in 404 template (go-gitea#4639)
  * Cmd/serve: pprof cpu and memory profile dumps to disk (go-gitea#4560)
  * Add flash message after an account has been successfully activated (go-gitea#4510)
  * Prevent html entity escaping on delete branch (go-gitea#4471)
  * Locale for button Edit on protected branch (go-gitea#4442)
  * Update notification icon (go-gitea#4343)
  * Added front-end topics validation (go-gitea#4316)
  * Don't display buttons if there are no system notifications (go-gitea#4280)
  * Issue due date api (go-gitea#3890)
* SECURITY
  * Improve URL validation for external wiki  and external issues (go-gitea#4710)
  * Make cookies HttpOnly and obey COOKIE_SECURE flag (go-gitea#4706)
  * Don't disclose emails of all users when sending out emails (go-gitea#4664)
  * Check that repositories can only be migrated to own user or organizations (go-gitea#4366)
* TRANSLATION
  * Fix punctuation in English translation (go-gitea#4958)
  * Fix translation (go-gitea#4355)
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/done All backports for this PR have been created lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The readme.txt rendering displays the garbled code
6 participants