Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ampersand (&) in HTML corrupts word document file #1500

Open
2 of 3 tasks
silverbackdan opened this issue Nov 7, 2018 · 9 comments
Open
2 of 3 tasks

ampersand (&) in HTML corrupts word document file #1500

silverbackdan opened this issue Nov 7, 2018 · 9 comments

Comments

@silverbackdan
Copy link

silverbackdan commented Nov 7, 2018

This is:

Expected Behavior

Correct generation of a word file

Current Behavior

A word file is generated but cannot be opened because it is corrupted.

Failure Information

When adding HTML, if there is an ampersand (just an & character) the output document is corrupt. It is required to turn the ampersand into html encoded &

How to Reproduce

$section = $this->phpWord->addSection();
Html::addHtml($section, '&', false, false);

Context

  • PHP version: 7.2.7
  • PHPWord version: 0.15

EDIT:
My mistake - reproduction can be with either

Html::addHtml($section, '&', false, false);

OR

Html::addHtml($section, '&', false, false);

HTML character codes such as " work fine

@silverbackdan
Copy link
Author

silverbackdan commented Nov 7, 2018

I've had a quick look and the html normalization seems a little too simplistic as well.

Instead of
$html = str_replace('&', '&', $html); how about something like $html = preg_replace('/(&)(?![0-9a-z]+;)/i', '&', $html); This way you will not catch html characters already defined. TBH I'm still not really sure why the ampersand causes this issue. Does it need to be escaped further down the line when inserting into the word document format?

EDIT
Another bug is present in the addText method.
$section->addText('&'); causes the issue, $section->addText('&'); does not. I think I'm nearing a solution here hopefully.

silverbackdan added a commit to silverbackdan/PHPWord that referenced this issue Nov 7, 2018
@troosan
Copy link
Contributor

troosan commented Nov 25, 2018

The following is normal it fails, as this is not valid HTML.

Html::addHtml($section, '&', false, false);

The following works fine

Html::addHtml($section, '&', false, false);

As for the bug with the addText function, indeed, this should be escaped when writing to XML, but I don't think the Element/Text class is the place to do this. You will break the RTF writer. This should be escaped in the Word2007 writer instead.

@silverbackdan
Copy link
Author

When the dom element is created in php, & is converted to just & so there's no getting around it I don't think. There is a normalization attempt made already that I spotted but it is reverted when getting the html from the php dom element.

I am not sure what the final document formats look like, I've not researched that before so am not sure what breaking changes there are between output formats. Could you link me to the function that you'd like me to move the fix to? I'm happy to do that if you'd like - I'm also not precious about the fix should you want to move it to where you think it is most appropriate.

Thanks for your time on this.

@jeromeWeissmann
Copy link

Hi, got the same error here.
I've tried the fallowing:
Html::addHtml($this->section, "Hello & Bye");
Or
Html::addHtml($this->section, "Hello & Bye");

docx file is corrupt in both cases.

@jeromeWeissmann
Copy link

After looking for more information, found that use
\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);
fix the problem

@silverbackdan
Copy link
Author

Oh wow, that's a bit strange - I wonder why a user would not escape the output, are there any side-effects?

@troosan
Copy link
Contributor

troosan commented Nov 28, 2018

@silverbackdan Indeed, ideally, the setting should be set to true by default.
This was not done for backward compatibility reasons. Maybe this could be changed in version 1.0

@margori
Copy link

margori commented Dec 7, 2018

I confirm this bug.

@davidcheal
Copy link

I realise this is an old bug, but I spent a couple of hours chasing it today so thought I'd resurface it. if I send '&' in the html it should go all the way through to Word, but it gets converted back to a '&' and in a Word doc that corrupts it.

I tried finding how the conversion happens, but cant work it out. I can see the '&' in the html at line 82 of Html.php, but in the call to parseNode it is suddenly a '&'.

image

setOutputEscapingEnabled(true); does solve the issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants