Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-byte character handling in read_file #3173

Merged

Conversation

sidewaysthought
Copy link
Contributor

@sidewaysthought sidewaysthought commented Apr 24, 2023

Background

Related to issue #232 - Multi-byte character handling
User reports UTF files not being read. Some files may be UTF-8 but have a BOM marker which Python can see as invalid, or may not be UTF-8 at all.

Changes

The read_file command has been updated to detect the file encoding, then use the detected encoding to provide more reliable reading of files. The change requires the addition of the chardet dependency. It has been added to requirements.txt

Documentation

Change is minor with what I believe to be obvious intent when viewing the code.

Test Plan

Tested by making agent create a file by output redirect from PowerShell (which includes BOM markers, and previously failed read.) Files now correctly read.

PR Quality Checklist

  • My pull request is atomic and focuses on a single change.
  • I have thoroughly tested my changes with multiple different prompts.
  • I have considered potential risks and mitigations for my changes.
  • I have documented my changes clearly and comprehensively.
  • I have not snuck in any "extra" small tweaks changes

Unfamiliar with generating test cases on a project of this scale. Would rather not horribly break things.

@SpaceTTAngel
Copy link

For me it's ok, I've tested the file_operations.py and requirements.txt with several datasets @sidewaysthought

SpaceTTAngel
SpaceTTAngel previously approved these changes Apr 25, 2023
Copy link

@SpaceTTAngel SpaceTTAngel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the code with a several files dataset to stimulate encoding reading, with the following logs 👍

encoded file type: ascii
encoded file content: ciaociao


encoded file type: ascii
encoded file content: ciaociao


encoded file type: ascii
encoded file content: ciaociao


encoded file type: UTF-8-SIG
encoded file content: ciaociao


encoded file type: UTF-16
encoded file content: ciaociao


encoded file type: UTF-16
encoded file content: ciaociao


encoded file type: UTF-16
encoded file content: ciaociao


encoded file type: UTF-16
encoded file content: ciaociao


encoded file type: UTF-16
encoded file content: ciaociao


encoded file type: ascii
encoded file content: premi?re is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret


encoded file type: Windows-1254
encoded file content: premiŠre is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret


encoded file type: KOI8-R
encoded file content: premi?re is first
premie?re is slightly different
Кириллица is Cyrillic
? am Deseret


encoded file type: ISO-8859-1
encoded file content: première is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret


encoded file type: Windows-1254

@ashreim-UPL
Copy link

I had the same issue so I removed the file and replaced it with txt file. But this is definitely a bug

@vercel
Copy link

vercel bot commented Apr 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
docs ⬜️ Ignored (Inspect) Visit Preview May 1, 2023 4:22pm

@vercel vercel bot temporarily deployed to Preview April 26, 2023 21:08 Inactive
@vercel vercel bot temporarily deployed to Preview April 27, 2023 01:50 Inactive
@Pwuts Pwuts linked an issue Apr 28, 2023 that may be closed by this pull request
@Pwuts Pwuts self-assigned this Apr 28, 2023
@Pwuts Pwuts changed the title Detect-file-encoding-3158 Fix multi-byte character handling in read_file Apr 28, 2023
@vercel vercel bot temporarily deployed to Preview April 28, 2023 13:46 Inactive
@codecov
Copy link

codecov bot commented Apr 28, 2023

Codecov Report

Patch coverage: 81.81% and project coverage change: +0.02 🎉

Comparison is base (94ec4a4) 60.26% compared to head (9df829a) 60.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3173      +/-   ##
==========================================
+ Coverage   60.26%   60.28%   +0.02%     
==========================================
  Files          69       69              
  Lines        3143     3150       +7     
  Branches      525      525              
==========================================
+ Hits         1894     1899       +5     
- Misses       1116     1118       +2     
  Partials      133      133              
Impacted Files Coverage Δ
autogpt/commands/web_selenium.py 86.04% <66.66%> (-1.61%) ⬇️
autogpt/commands/file_operations.py 83.13% <100.00%> (+0.19%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

requirements.txt Outdated Show resolved Hide resolved
@Pwuts Pwuts added the internationalization Multi-language support label Apr 28, 2023
@Pwuts Pwuts force-pushed the detect-file-encoding-3158 branch from 8ebe826 to 38cf8c8 Compare May 1, 2023 16:03
@Pwuts Pwuts merged commit a5f8563 into Significant-Gravitas:master May 1, 2023
@Pwuts Pwuts added this to the v0.3.0 release milestone May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internationalization Multi-language support size/s
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Multi-byte character handling
4 participants