Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special Characters (ampersand) in Title break docx output #401

Closed
jonnsn opened this issue Oct 22, 2014 · 27 comments
Closed

Special Characters (ampersand) in Title break docx output #401

jonnsn opened this issue Oct 22, 2014 · 27 comments

Comments

@jonnsn
Copy link
Contributor

jonnsn commented Oct 22, 2014

Hi,

if you change Sample_17_TitleTOC.php to add a title including an &-sign

$section->addTitle('Foo & Bar', 1);

and select docx-Result the document won't open (tested in Word 2013):
"The file ____ cannot be opened because of problems with the contents"
Details: "Invalid char"
Position: [position of the ampersand in document.xml]

I did not test this with other characters but there might also be problems with ",',< and >

I found this post which might indicate the problem.
Escaping the character does not really help (see the StackOveflow-Thread linked above).

If I find some time I will try to solve this issue and make a patch, but I just wanted to note it down here already. Maybe someone has a note or already solved it somehow :-)

Regards

@ghost
Copy link

ghost commented Oct 22, 2014

Hi, @jonnsn. You must escape any string you submit to PHPWord. As for me, I prefer Zend\Escaper for that purpose. PHPWord code is not responsible for text escaping. Developer, who use PHPWord, is responsible for escaping string he passes to PHPWord. There maybe lack of documentation on that topic. Feel free to modify PHPWord documentation. ;)

@ghost ghost self-assigned this Oct 22, 2014
@ghost ghost added Documentation and removed Documentation labels Oct 22, 2014
@ghost ghost removed their assignment Oct 22, 2014
@jonnsn
Copy link
Contributor Author

jonnsn commented Oct 22, 2014

Hi @RomanSyroeshko

Thanks for your reply!
if I try

$section->addTitle(htmlspecialchars('Foo & Bar'), 1);

the document gets opened and the title is correct in TOC, but not in the document body. Its plain 'Foo & amp; Bar' [added whitespace due to github interpreting the entity] there. This is also discussed in the linked StackOverflow-Thread.
According to the linked post the problem is that the TOC works with entitites while the title in the body requires Unicode escaped characters.

I just tried quick debugging and changed line 100 of PhpWord\Writer\Word2007\Element\TOC to

$xmlWriter->writeRaw(htmlspecialchars($title->getText()));

and it seemed to have worked. Maybe others could test this too...

ghost pushed a commit that referenced this issue Oct 22, 2014
@ghost
Copy link

ghost commented Oct 22, 2014

The behavior is caused by double escaping in the code. That's why you get "& amp;" in title. I fixed this in develop branch, so $section->addTitle(htmlspecialchars('Foo & Bar'), 1); should work fine since now.

Could you please test?

@ghost ghost self-assigned this Oct 22, 2014
@jonnsn
Copy link
Contributor Author

jonnsn commented Nov 3, 2014

sorry for the late reply, I was on vacation.
I can confirm that it works now in the sample If I do something like
$section->addTitle(htmlspecialchars('Foo & Bar'), 1);
I am still having some problems in my real application, but i will check this on my own first, as it does not seem to be related to this issue directly.

Thanks for your help!
Regards

@ghost ghost added this to the 0.12.0 milestone Nov 4, 2014
ghost pushed a commit that referenced this issue Nov 4, 2014
@ghost ghost removed the Documentation label Nov 4, 2014
@ghost
Copy link

ghost commented Nov 4, 2014

OK, closed for now. Thanks for the bug report.

@ghost ghost closed this as completed Nov 4, 2014
@PawelAwmous
Copy link

Hi @RomanSyroeshko
I don't think this works now, I still have some issues with TOC and double escaping titles, I found another place in the code where you use "htmlspecialchars" - I removed it to test and my problems has been solved, please check: https://github.com/PHPOffice/PHPWord/blob/master/src/PhpWord/Writer/Word2007/Element/AbstractElement.php#L126

-- edit
ignore my comment, I didn't see it's fixed in develop branch :) Good job guys!

@SailorMax
Copy link
Contributor

@RomanSyroeshko
Looks like not fully fixed. Same problem we have at Writer/Word2007/Element/Text.php:48
This is strange place. Here we write text in tag <w:t> using $this->getText(), which uses String::controlCharacterPHP2OOXML, which convert special symbols to something like x00010000.

Microsoft made this format for "Encode and Decode XML Element and Attribute Names". https://msdn.microsoft.com/en-us/library/vstudio/35577sxd(v=vs.100).aspx
Did it work with element values? And there is no any work with ampersand.

I think we need replace $this->getText() -> htmlspecialchars() here.

@ghost
Copy link

ghost commented Jan 26, 2015

Hi, @maxibul. Thanks for pointing this out. Does it cause any issues: generated document is broken or whatever?

@SailorMax
Copy link
Contributor

Yes, generated document is broken without this fix. Same result, as in first post.

@ghost ghost modified the milestones: 0.13.0, 0.12.0 Jan 27, 2015
@ghost ghost reopened this Jan 27, 2015
@ghost
Copy link

ghost commented Jan 27, 2015

What if you replace $xmlWriter->writeRaw($this->getText($element->getText())) with $xmlWriter->writeRaw($element->getText()) in PhpOffice\PhpWord\Writer\Word2007\Element\Text and escape string you pass to Text outside of it? Will it help you?

I mean, if you say that controlCharacterPHP2OOXML function call is useless here, lets exclude it and see what happens next. Yet, please, don't forget escape strings you pass to PHPWord methods, because all escaping must be done outside of the library since 0.12.0.

P. S.
Also you may share the detailed testcase which could let us reproduce exactly your problem.

@ghost ghost closed this as completed Jan 27, 2015
@SailorMax
Copy link
Contributor

I little confused with "all escaping must be done outside of the library". PHPWord support export result in different formats (docx, pdf,...). Each format has own rules of character escaping. In result if I need result in 2 different formats, I need learn each required format and create 2 documents with different escaping? Do you think this is normal? Or I miss something? If library has different functions for export in different formats, why it can't prepare data for each format?

@ghost
Copy link

ghost commented Jan 28, 2015

OK, lets align our terminology first. What do you mean under "escaping"? Simple htmlspecialchars method call?

@ghost ghost reopened this Jan 28, 2015
@SailorMax
Copy link
Contributor

under "escaping" I mean:

  • for MS XML text between tags = htmlspecialchars(), for example;
  • for MS XML tag names/attributes = to characters like x00010000 (underline at left and right);
  • for C string = backslash before special character;

...

@gmta
Copy link
Contributor

gmta commented Feb 7, 2015

Hi Roman, commit b2286f8 made PHPWord completely broken for all Word documents that have the ampersand ('&') character anywhere in the body of the document, as opposed to being broken only for that character residing in the title of the document.

You said the following: "PHPWord code is not responsible for text escaping."

I sincerely hope that you don't mean that PHPWord is not responsible for proper output escaping. If I supply a valid UTF-8 string to PHPWord, PHPWord should apply proper escaping based on the output location. So if it ends up as the contents for an XML element, you apply XML character escaping. If it ends up in HTML, you apply HTML character escaping. If you don't do this, you will effectively create a broken library.

Please apply MaxiBul's suggestions for output escaping.

@ghost
Copy link

ghost commented Feb 9, 2015

@gmta, what's the problem? Have you read our release notes? Have you reviewed the changes before applying the new version of PHPWord? Did you try escaping strings BEFORE you pass them to the library?

@gmta
Copy link
Contributor

gmta commented Feb 12, 2015

@RomanSyroeshko Please stop deleting my comments. You know that people get email notifications of these, right? And my activity feed still contains those comments?

I gave you a number of reasons why your current direction on output escaping for PHPWord is just plain wrong. Please supply counter arguments if you think I'm wrong. Again, please read this informative article for more information.

Trying to hide critique is a bit dictator-y, if you ask me.

@ghost
Copy link

ghost commented Feb 13, 2015

Your comment was aggressive, not constructive and you didn't answer my questions. Such comments will be deleted in future too.

@gmta
Copy link
Contributor

gmta commented Feb 13, 2015

I do not think it was aggressive - I answered your questions, I gave reasons why the new course for PHPWord output escaping would be a mistake and even offered you help and referenced you to further reading material.

Could you please respond to this issue though? We still do not know if you plan to stick with this new course or plan to change PHPWord in such a way that it becomes a library that can generate valid Word documents without requiring its users to perform escaping themselves?

@ghost
Copy link

ghost commented Feb 15, 2015

We consider introducing escaping components in the next release. For now escaping must be done by user. If you don't like this, continue using v0.11.1.

@maxibul, thanks for the given rules.

@playmono
Copy link

playmono commented May 7, 2016

Totally agree with @gmta . Programming as spanish developer with this tool is completely crazy. If escaping is a core-must to make it work the library. Why the library does not provide good methods to do it? Or better, why library doesn't do it by itself?

@lrigdon
Copy link

lrigdon commented May 20, 2016

Based on comments in gethub (#514) I changed my process to replace & amp ; in the text that I'm feeding phpword with & amp ; amp ; and it works. That gets me past this issue for now. You'll need to remove the spaces, I added them to get them to display correctly

@judgej
Copy link

judgej commented Aug 11, 2016

This is crazy - the application is responsible for escaping text passed in through methods that accept text based on some internal use of that text which the application does not get involved in and should not care about in the slightest? What the hell? Really? You really need to rethink this decision.

@mjpelmear
Copy link

mjpelmear commented Oct 28, 2016

Just adding another comment to the pile.
Dealing with special characters in this manner is just wrong.
Why is there such resistance to handling it within the library?

edit:
It looks like this was finally given a half-hearted response.
If you ran into problems with characters not being escaped, go look at \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

@judgej
Copy link

judgej commented Oct 28, 2016

I suspect it comes down to what level of abstraction is being applied to this library. On one hand the author looks at it as a library to help being do stuff with the the XML that makes up a document, so the XML kind of bleeds out of it. On the other hand, since it has multiple output formats, it is seen as a generic document generation tool, some document types of which have no relation to XML whatsoever. So it's all a but confused.

@mjpelmear IMO You are right - the level of abstraction that applications using using the library should be dealing with is not XML - not even in the slightest bit XML. It's all just data: words, structures and formatting instructions. However, those applications should be able to extend the XML layer that converts the generic document objects into an XMLDOC stream, so that additional features can be added at that level and only affect the XML generation and not rtf documents, and of course plug into the various different XML document types separately.

For example, adding tabs to a header involves some horrible hacks in the application - inserting different special characters depending on what the known output format is going to be (e.g. escape sequences for RDF, closing tags and inserting other elements then reopening tags again for XMLDOC, and completely different XML hacks for OpenXML). There has to be an easier way to extend it to support tabs instead,

The problem in solving this is that the library has evolved the way it has because (I suspect) it has been quite a journey of discovery - there is very little information out there that makes real sense of these document formats. However, having done that, starting again with lessons learnt, and a more formal layered approach is needed to lead on to a more maintainable library. That's a lot of work though, so we are stuck here. I don't want to put down the fantastic effort put into this library so far - I really don't. I do think it is standing at a crossroads though. That's just my personal feeling about it.

@phpdave
Copy link

phpdave commented Dec 2, 2016

This feels weird

\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

I feel like it should be

$this->_phpWord = new \PhpOffice\PhpWord\PhpWord();
$this->_phpWord->getSettings()->setOutputEscapingEnabled(true);

But I guess for some reason doing a static class was decided. Its just hard to know what PHPWord can do when things aren't connected.

@nicdnepr nicdnepr mentioned this issue Jan 26, 2018
1 task
@chrisribe
Copy link

Thanks for the comment @mjpelmear, helped solve my issues !

Just adding another comment to the pile.
Dealing with special characters in this manner is just wrong.
Why is there such resistance to handling it within the library?

edit:
It looks like this was finally given a half-hearted response.
If you ran into problems with characters not being escaped, go look at \PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

@chrisribe
Copy link

chrisribe commented Aug 12, 2019

Dang spoke too soon, was still causing issues.

  1. The escape flag is only used on write, was having issues since I need it to occur before write.

  2. The zend escape function did not help and issues are still present.
    lrigdon
    Finaly used @lrigdon solution of replacing all "&" symbols with:

    & amp ; amp ;
    

Needs review / work as @judgej says.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants