Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

Open
dhs-rec opened this issue May 2, 2019 · 10 comments
Assignees
Labels
Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Feature new functionality including changes to functionality and code refactors, etc.
Milestone

Comments

@dhs-rec
Copy link
Contributor

dhs-rec commented May 2, 2019

Description of Issue

When trying to run file.replace on a UCS-2 encoded file on Windows, it errors out with the following message:

----------
          ID: patch_minimum_ps_version
    Function: file.replace
        Name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "c:\salt\bin\lib\site-packages\salt\state.py", line 1919, in call
                  **cdata['kwargs'])
                File "c:\salt\bin\lib\site-packages\salt\loader.py", line 1918, in wrapper
                  return f(*args, **kwargs)
                File "c:\salt\bin\lib\site-packages\salt\states\file.py", line 4331, in replace
                  backslash_literal=backslash_literal)
                File "c:\salt\bin\lib\site-packages\salt\modules\file.py", line 2229, in replace
                  .format(path)
              salt.exceptions.SaltInvocationError: Cannot perform string replacements on a binary file: C:\Program Files\WindowsPowerShell\Modules\PSWindowsUpdate\PSWindowsUpdate.psd1
     Started: 14:12:03.685800
    Duration: 15.6 ms
     Changes:

Instead of doing this, file.replace should support specifying the correct encoding for the file, for example:

patch_minimum_ps_version:
  file.replace:
    - name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
    - pattern: PowerShellVersion\s+=\s+.*
    - repl: PowerShellVersion = '3.0'
    - encoding: utf16

The following Python(3) example does this (and even automatically writes a BOM):

#!/usr/bin/python3

import re

file = 'PSWindowsUpdate.psd1'

with open(file, 'r', encoding='utf16') as pswu:
  text = pswu.read()

with open(file + '.new', 'w', encoding='utf16') as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')

Versions Report

> salt-call --versions-report
Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: 1.10.0
       cherrypy: 10.2.1
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.5
      gitpython: 2.1.3
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: Not Installed
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.6
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.17
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)]
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.3
           RAET: Not Installed
          smmap: 2.0.5
        timelib: 0.2.4
        Tornado: 4.5.1
            ZMQ: 4.1.6

System Versions:
           dist:
         locale: cp1252
        machine: AMD64
        release: 2008ServerR2
         system: Windows
        version: 2008ServerR2 6.1.7601 SP1 Multiprocessor Free
@dhs-rec
Copy link
Contributor Author

dhs-rec commented May 3, 2019

Might even be possible to determine the encoding automatically. At least the file command on Linux is able to do so:

% file PSWindowsUpdate.psd1 
PSWindowsUpdate.psd1: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

@dhs-rec
Copy link
Contributor Author

dhs-rec commented May 3, 2019

The following, slightly modified version of above sample code does this, using libmagic Python bindings (aka. python3-magic):

#!/usr/bin/python3

import magic
import re

file = 'PSWindowsUpdate.psd1'

# Open file in binary mode and detect encoding
text = open(file, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(text)

# Read and write file with correct encoding
text = open(file, 'r', encoding=encoding).read()
with open(file + '.new', 'w', encoding=encoding) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')

@waynew waynew added Feature new functionality including changes to functionality and code refactors, etc. P4 Priority 4 labels May 3, 2019
@waynew waynew added this to the Approved milestone May 3, 2019
@waynew
Copy link
Contributor

waynew commented May 3, 2019

I, for one, would like to overhaul a lot of how we handle file/stream encodings, with the ability to specify all the things (and perhaps default to utf-8, if possible).

@dhs-rec
Copy link
Contributor Author

dhs-rec commented May 6, 2019

Sounds good, although I'd consider this a bug.

Anyway, here's a version of the sample code using chardet, which might be more platform independent:

#!/usr/bin/python3

import chardet
import re

file = 'PSWindowsUpdate.psd1'

# Detect character encoding
charenc = chardet.detect(open(file, 'rb').read())['encoding']
print(charenc)

# Read file and detect linefeed type
with open(file, 'r', encoding=charenc) as f:
  text = f.read()
  lf = f.newlines

# Write new file, replacing some text
with open(file + '.new', 'w', encoding=charenc, newline=lf) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\n')
    else:
      f.write(line + '\n')

@dhs-rec
Copy link
Contributor Author

dhs-rec commented Jun 21, 2019

Any news on this bug?

@waynew
Copy link
Contributor

waynew commented Jul 17, 2019

Nothing currently - we're currently pushing hard to stabilize our test suites/pipeline, but this is definitely on my radar.

My personal opinion is that Salt should never assume encoding (though right now we assume utf-8, if we assume anything). I think I'd welcome a PR that adds an encoding argument to file.replace, if it doesn't require a more invasive change. Otherwise I think we should open a SEP detailing the work required/risks to updating the way Salt handles encoding things.

@stale
Copy link

stale bot commented Jan 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Jan 8, 2020
@dhs-rec
Copy link
Contributor Author

dhs-rec commented Jan 8, 2020

Ping.

@stale
Copy link

stale bot commented Jan 8, 2020

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 8, 2020
@waynew waynew added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Feb 27, 2020
@sagetherage sagetherage removed the P4 Priority 4 label Jun 3, 2020
@amalaguti
Copy link

file.managed seems to have a similar issue, trying to manage a file encoded utf-16 LE (xml export from Windows Task Scheduler), added some Jinja to templatize the file, but is treated as binary. not replacing jinja by defaults values

get_xml:
  file.managed:
    - name: 'C:\task.xml'
    - source: salt:https://win_tasks/some_template.txt
    - template: jinja
    # - encoding: tried with utf16, utf16-le, utf-16-le 
    - defaults:
        TimeTrigger_StartBoundary: {{ TimeTrigger_StartBoundary }}
        TimeTrigger_StopBoundary: {{ TimeTrigger_StopBoundary }}       
some_template.txt (UTF-16LE encoded)
      <StartBoundary>{{ TimeTrigger_StartBoundary }}</StartBoundary>
      <EndBoundary>{{ TimeTrigger_StopBoundary }}</EndBoundary>
      <ExecutionTimeLimit>PT30M</ExecutionTimeLimit>

This causes the file to be treated as binary and Jinja does not get replaced.
The file looks exactly like original

minion-win-1:
----------
          ID: get_xml
    Function: file.managed
        Name: C:\task.xml
      Result: True
     Comment: File C:\task.xml updated
     Started: 17:39:38.661603
    Duration: 140.625 ms
     Changes:
              ----------
              diff:
                  Replace text file with binary file

As soon encoding of the file is changed to utf-8, the file is replaced by text file and jinja values are replaced.

@twangboy twangboy self-assigned this Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Feature new functionality including changes to functionality and code refactors, etc.
Projects
None yet
Development

No branches or pull requests

5 participants