-
Notifications
You must be signed in to change notification settings - Fork 43.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi-byte character handling in read_file
#3173
Fix multi-byte character handling in read_file
#3173
Conversation
For me it's ok, I've tested the file_operations.py and requirements.txt with several datasets @sidewaysthought |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested the code with a several files dataset to stimulate encoding reading, with the following logs 👍
encoded file type: ascii
encoded file content: ciaociao
encoded file type: ascii
encoded file content: ciaociao
encoded file type: ascii
encoded file content: ciaociao
encoded file type: UTF-8-SIG
encoded file content: ciaociao
encoded file type: UTF-16
encoded file content: ciaociao
encoded file type: UTF-16
encoded file content: ciaociao
encoded file type: UTF-16
encoded file content: ciaociao
encoded file type: UTF-16
encoded file content: ciaociao
encoded file type: UTF-16
encoded file content: ciaociao
encoded file type: ascii
encoded file content: premi?re is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret
encoded file type: Windows-1254
encoded file content: premiŠre is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret
encoded file type: KOI8-R
encoded file content: premi?re is first
premie?re is slightly different
Кириллица is Cyrillic
? am Deseret
encoded file type: ISO-8859-1
encoded file content: première is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret
encoded file type: Windows-1254
I had the same issue so I removed the file and replaced it with txt file. But this is definitely a bug |
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
read_file
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #3173 +/- ##
==========================================
+ Coverage 60.26% 60.28% +0.02%
==========================================
Files 69 69
Lines 3143 3150 +7
Branches 525 525
==========================================
+ Hits 1894 1899 +5
- Misses 1116 1118 +2
Partials 133 133
☔ View full report in Codecov by Sentry. |
8ebe826
to
38cf8c8
Compare
Background
Related to issue #232 - Multi-byte character handling
User reports UTF files not being read. Some files may be UTF-8 but have a BOM marker which Python can see as invalid, or may not be UTF-8 at all.
Changes
The read_file command has been updated to detect the file encoding, then use the detected encoding to provide more reliable reading of files. The change requires the addition of the chardet dependency. It has been added to requirements.txt
Documentation
Change is minor with what I believe to be obvious intent when viewing the code.
Test Plan
Tested by making agent create a file by output redirect from PowerShell (which includes BOM markers, and previously failed read.) Files now correctly read.
PR Quality Checklist
Unfamiliar with generating test cases on a project of this scale. Would rather not horribly break things.