Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

dhs-rec · 2019-05-02T14:23:38Z

Description of Issue

When trying to run file.replace on a UCS-2 encoded file on Windows, it errors out with the following message:

----------
          ID: patch_minimum_ps_version
    Function: file.replace
        Name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "c:\salt\bin\lib\site-packages\salt\state.py", line 1919, in call
                  **cdata['kwargs'])
                File "c:\salt\bin\lib\site-packages\salt\loader.py", line 1918, in wrapper
                  return f(*args, **kwargs)
                File "c:\salt\bin\lib\site-packages\salt\states\file.py", line 4331, in replace
                  backslash_literal=backslash_literal)
                File "c:\salt\bin\lib\site-packages\salt\modules\file.py", line 2229, in replace
                  .format(path)
              salt.exceptions.SaltInvocationError: Cannot perform string replacements on a binary file: C:\Program Files\WindowsPowerShell\Modules\PSWindowsUpdate\PSWindowsUpdate.psd1
     Started: 14:12:03.685800
    Duration: 15.6 ms
     Changes:

Instead of doing this, file.replace should support specifying the correct encoding for the file, for example:

patch_minimum_ps_version:
  file.replace:
    - name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
    - pattern: PowerShellVersion\s+=\s+.*
    - repl: PowerShellVersion = '3.0'
    - encoding: utf16

The following Python(3) example does this (and even automatically writes a BOM):

#!/usr/bin/python3

import re

file = 'PSWindowsUpdate.psd1'

with open(file, 'r', encoding='utf16') as pswu:
  text = pswu.read()

with open(file + '.new', 'w', encoding='utf16') as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')

Versions Report

> salt-call --versions-report
Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: 1.10.0
       cherrypy: 10.2.1
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.5
      gitpython: 2.1.3
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: Not Installed
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.6
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.17
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)]
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.3
           RAET: Not Installed
          smmap: 2.0.5
        timelib: 0.2.4
        Tornado: 4.5.1
            ZMQ: 4.1.6

System Versions:
           dist:
         locale: cp1252
        machine: AMD64
        release: 2008ServerR2
         system: Windows
        version: 2008ServerR2 6.1.7601 SP1 Multiprocessor Free

The text was updated successfully, but these errors were encountered:

dhs-rec · 2019-05-03T12:54:54Z

Might even be possible to determine the encoding automatically. At least the file command on Linux is able to do so:

% file PSWindowsUpdate.psd1 
PSWindowsUpdate.psd1: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

dhs-rec · 2019-05-03T13:11:06Z

The following, slightly modified version of above sample code does this, using libmagic Python bindings (aka. python3-magic):

#!/usr/bin/python3

import magic
import re

file = 'PSWindowsUpdate.psd1'

# Open file in binary mode and detect encoding
text = open(file, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(text)

# Read and write file with correct encoding
text = open(file, 'r', encoding=encoding).read()
with open(file + '.new', 'w', encoding=encoding) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')

waynew · 2019-05-03T21:23:45Z

I, for one, would like to overhaul a lot of how we handle file/stream encodings, with the ability to specify all the things (and perhaps default to utf-8, if possible).

dhs-rec · 2019-05-06T05:26:20Z

Sounds good, although I'd consider this a bug.

Anyway, here's a version of the sample code using chardet, which might be more platform independent:

#!/usr/bin/python3

import chardet
import re

file = 'PSWindowsUpdate.psd1'

# Detect character encoding
charenc = chardet.detect(open(file, 'rb').read())['encoding']
print(charenc)

# Read file and detect linefeed type
with open(file, 'r', encoding=charenc) as f:
  text = f.read()
  lf = f.newlines

# Write new file, replacing some text
with open(file + '.new', 'w', encoding=charenc, newline=lf) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\n')
    else:
      f.write(line + '\n')

dhs-rec · 2019-06-21T12:39:21Z

Any news on this bug?

waynew · 2019-07-17T19:08:21Z

Nothing currently - we're currently pushing hard to stabilize our test suites/pipeline, but this is definitely on my radar.

My personal opinion is that Salt should never assume encoding (though right now we assume utf-8, if we assume anything). I think I'd welcome a PR that adds an encoding argument to file.replace, if it doesn't require a more invasive change. Otherwise I think we should open a SEP detailing the work required/risks to updating the way Salt handles encoding things.

stale · 2020-01-08T07:10:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

dhs-rec · 2020-01-08T07:14:21Z

Ping.

stale · 2020-01-08T07:14:24Z

Thank you for updating this issue. It is no longer marked as stale.

amalaguti · 2023-02-06T18:00:42Z

file.managed seems to have a similar issue, trying to manage a file encoded utf-16 LE (xml export from Windows Task Scheduler), added some Jinja to templatize the file, but is treated as binary. not replacing jinja by defaults values

get_xml:
  file.managed:
    - name: 'C:\task.xml'
    - source: salt:https://win_tasks/some_template.txt
    - template: jinja
    # - encoding: tried with utf16, utf16-le, utf-16-le 
    - defaults:
        TimeTrigger_StartBoundary: {{ TimeTrigger_StartBoundary }}
        TimeTrigger_StopBoundary: {{ TimeTrigger_StopBoundary }}

some_template.txt (UTF-16LE encoded)
      <StartBoundary>{{ TimeTrigger_StartBoundary }}</StartBoundary>
      <EndBoundary>{{ TimeTrigger_StopBoundary }}</EndBoundary>
      <ExecutionTimeLimit>PT30M</ExecutionTimeLimit>

This causes the file to be treated as binary and Jinja does not get replaced.
The file looks exactly like original

minion-win-1:
----------
          ID: get_xml
    Function: file.managed
        Name: C:\task.xml
      Result: True
     Comment: File C:\task.xml updated
     Started: 17:39:38.661603
    Duration: 140.625 ms
     Changes:
              ----------
              diff:
                  Replace text file with binary file

As soon encoding of the file is changed to utf-8, the file is replaced by text file and jinja values are replaced.

waynew added Feature new functionality including changes to functionality and code refactors, etc. P4 Priority 4 labels May 3, 2019

waynew added this to the Approved milestone May 3, 2019

stale bot added the stale label Jan 8, 2020

stale bot removed the stale label Jan 8, 2020

waynew added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Feb 27, 2020

sagetherage removed the P4 Priority 4 label Jun 3, 2020

oeuftete mentioned this issue Oct 16, 2020

[BUG] file.replace breaks with UTF-8 encoded files on Windows #58749

Closed

twangboy self-assigned this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

dhs-rec commented May 2, 2019

dhs-rec commented May 3, 2019

dhs-rec commented May 3, 2019

waynew commented May 3, 2019

dhs-rec commented May 6, 2019 •

edited

Loading

dhs-rec commented Jun 21, 2019

waynew commented Jul 17, 2019

stale bot commented Jan 8, 2020

dhs-rec commented Jan 8, 2020

stale bot commented Jan 8, 2020

amalaguti commented Feb 6, 2023

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

Comments

dhs-rec commented May 2, 2019

Description of Issue

Versions Report

dhs-rec commented May 3, 2019

dhs-rec commented May 3, 2019

waynew commented May 3, 2019

dhs-rec commented May 6, 2019 • edited Loading

dhs-rec commented Jun 21, 2019

waynew commented Jul 17, 2019

stale bot commented Jan 8, 2020

dhs-rec commented Jan 8, 2020

stale bot commented Jan 8, 2020

amalaguti commented Feb 6, 2023

dhs-rec commented May 6, 2019 •

edited

Loading