Fix multi-byte character handling in `read_file` #3173

sidewaysthought · 2023-04-24T23:32:46Z

Background

Related to issue #232 - Multi-byte character handling
User reports UTF files not being read. Some files may be UTF-8 but have a BOM marker which Python can see as invalid, or may not be UTF-8 at all.

Changes

The read_file command has been updated to detect the file encoding, then use the detected encoding to provide more reliable reading of files. The change requires the addition of the chardet dependency. It has been added to requirements.txt

Documentation

Change is minor with what I believe to be obvious intent when viewing the code.

Test Plan

Tested by making agent create a file by output redirect from PowerShell (which includes BOM markers, and previously failed read.) Files now correctly read.

PR Quality Checklist

My pull request is atomic and focuses on a single change.
I have thoroughly tested my changes with multiple different prompts.
I have considered potential risks and mitigations for my changes.
I have documented my changes clearly and comprehensively.
I have not snuck in any "extra" small tweaks changes

Unfamiliar with generating test cases on a project of this scale. Would rather not horribly break things.

…-GPT

SpaceTTAngel · 2023-04-25T10:14:44Z

For me it's ok, I've tested the file_operations.py and requirements.txt with several datasets @sidewaysthought

SpaceTTAngel

I've tested the code with a several files dataset to stimulate encoding reading, with the following logs 👍

encoded file type: ascii
encoded file content: ciaociao

encoded file type: UTF-8-SIG
encoded file content: ciaociao

encoded file type: UTF-16
encoded file content: ciaociao

encoded file type: ascii
encoded file content: premi?re is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret

encoded file type: Windows-1254
encoded file content: premiŠre is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret

encoded file type: KOI8-R
encoded file content: premi?re is first
premie?re is slightly different
Кириллица is Cyrillic
? am Deseret

encoded file type: ISO-8859-1
encoded file content: première is first
premie?re is slightly different
????????? is Cyrillic
? am Deseret

encoded file type: Windows-1254

ashreim-UPL · 2023-04-26T11:20:29Z

I had the same issue so I removed the file and replaced it with txt file. But this is definitely a bug

…3158

vercel · 2023-04-26T21:07:52Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
docs	⬜️ Ignored (Inspect)	Visit Preview		May 1, 2023 4:22pm

…3158

codecov · 2023-04-28T13:49:36Z

Codecov Report

Patch coverage: 81.81% and project coverage change: +0.02 🎉

Comparison is base (94ec4a4) 60.26% compared to head (9df829a) 60.28%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3173      +/-   ##
==========================================
+ Coverage   60.26%   60.28%   +0.02%     
==========================================
  Files          69       69              
  Lines        3143     3150       +7     
  Branches      525      525              
==========================================
+ Hits         1894     1899       +5     
- Misses       1116     1118       +2     
  Partials      133      133

Impacted Files	Coverage Δ
autogpt/commands/web_selenium.py	`86.04% <66.66%> (-1.61%)`	⬇️
autogpt/commands/file_operations.py	`83.13% <100.00%> (+0.19%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

requirements.txt

sidewaysthought added 2 commits April 24, 2023 17:00

read_file: detect character encoding

a625e2d

Merge branch 'master' of https://github.com/Significant-Gravitas/Auto…

7bf1b3b

…-GPT

github-actions bot added the size/s label Apr 24, 2023

SpaceTTAngel previously approved these changes Apr 25, 2023

View reviewed changes

Merge branch 'Significant-Gravitas:master' into detect-file-encoding-…

b95d8f9

…3158

vercel bot temporarily deployed to Preview April 26, 2023 21:08 Inactive

Merge branch 'Significant-Gravitas:master' into detect-file-encoding-…

b4968d7

…3158

vercel bot temporarily deployed to Preview April 27, 2023 01:50 Inactive

Pwuts linked an issue Apr 28, 2023 that may be closed by this pull request

Multi-byte character handling #232

Closed

Merge branch 'master' into detect-file-encoding-3158

b9530c5

Pwuts self-assigned this Apr 28, 2023

Pwuts changed the title ~~Detect-file-encoding-3158~~ Fix multi-byte character handling in read_file Apr 28, 2023

vercel bot temporarily deployed to Preview April 28, 2023 13:46 Inactive

Pwuts requested changes Apr 28, 2023

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

Pwuts added the internationalization Multi-language support label Apr 28, 2023

Merge branch 'master' into detect-file-encoding-3158

f39b516

Pwuts dismissed SpaceTTAngel’s stale review via 8ebe826 May 1, 2023 16:03

Use charset_normalizer

38cf8c8

Pwuts force-pushed the detect-file-encoding-3158 branch from 8ebe826 to 38cf8c8 Compare May 1, 2023 16:03

Pwuts approved these changes May 1, 2023

View reviewed changes

Merge branch 'master' into detect-file-encoding-3158

9df829a

Pwuts merged commit a5f8563 into Significant-Gravitas:master May 1, 2023

Pwuts added this to the v0.3.0 release milestone May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-byte character handling in `read_file` #3173

Fix multi-byte character handling in `read_file` #3173

sidewaysthought commented Apr 24, 2023 •

edited by Pwuts

Loading

SpaceTTAngel commented Apr 25, 2023

SpaceTTAngel left a comment

ashreim-UPL commented Apr 26, 2023

vercel bot commented Apr 26, 2023 •

edited

Loading

codecov bot commented Apr 28, 2023 •

edited

Loading

Fix multi-byte character handling in read_file #3173

Fix multi-byte character handling in read_file #3173

Conversation

sidewaysthought commented Apr 24, 2023 • edited by Pwuts Loading

Background

Changes

Documentation

Test Plan

PR Quality Checklist

SpaceTTAngel commented Apr 25, 2023

SpaceTTAngel left a comment

Choose a reason for hiding this comment

ashreim-UPL commented Apr 26, 2023

vercel bot commented Apr 26, 2023 • edited Loading

codecov bot commented Apr 28, 2023 • edited Loading

Codecov Report

Fix multi-byte character handling in `read_file` #3173

Fix multi-byte character handling in `read_file` #3173

sidewaysthought commented Apr 24, 2023 •

edited by Pwuts

Loading

vercel bot commented Apr 26, 2023 •

edited

Loading

codecov bot commented Apr 28, 2023 •

edited

Loading