Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloneBlock, replaceBlock, and deleteBlock not working in template docx #2410

Open
MuhammadSyahmi98 opened this issue Jun 9, 2023 · 5 comments

Comments

@MuhammadSyahmi98
Copy link

Describe the Bug

I am experiencing issues with the cloneBlock, replaceBlock, and deleteBlock methods in the PHPWord library. These methods are used to manipulate template DOCX files by cloning, replacing, or deleting specific blocks of content. However, they are not functioning as expected.

Steps to Reproduce

Prepare a template DOCX file with the following content:

Customer: ${customer_name}
Address: ${customer_address}
${/block_name}

In the controller or PHP script, use the following code:

$templateProcessor->cloneBlock('block_name', 3, true, true);

Expected Behavior

The cloneBlock method should clone the block_name block three times in the template DOCX file, resulting in multiple repetitions of the block.

Current Behavior

The cloneBlock method does not clone the block as expected. No changes are made to the template DOCX file. Still same like original template

Additional Information

  • The issue persists even when using the simple code provided in the PHPWord documentation.
  • The setValue and setValues methods work correctly, allowing the replacement of individual placeholders in the template.
  • This problem occurs specifically with the cloneBlock, replaceBlock, and deleteBlock methods.
  • The issue is present when using PHP 8.1.
  • I have verified that I am using the latest version of the PHPWord library.
  • No errors or exceptions are thrown during the execution of the code.

Context

Please fill in your environment information:

  • PHP Version: 8.1
  • PHPWord Version: 1.1.0
  • Laravel Version: 10
@MuhammadSyahmi98
Copy link
Author

#341 solution from @liborm85 is working. For my case need to change the preg_match code. If have problem will update soon. Hope someone can fix this issue without need to override existing code

Create new controller as TemplateProcessorMod extends TemplateProcessor

public function cloneBlock($blockname, $clones = 1, $replace = true, $indexVariables = false, $variableReplacements = null)
    {
        $xmlBlock = null;
        $matches = array();
        preg_match(
            '/(\${' . $blockname . '})(.*?)(\${\/' . $blockname . '})/is',
            $this->tempDocumentMainPart,
            $matches
        );

        if (isset($matches[3])) {
            $xmlBlock = $matches[2];
            if ($indexVariables) {
                $cloned = $this->indexClonedVariables($clones, $xmlBlock);
            } elseif ($variableReplacements !== null && is_array($variableReplacements)) {
                $cloned = $this->replaceClonedVariables($variableReplacements, $xmlBlock);
            } else {
                $cloned = array();
                for ($i = 1; $i <= $clones; $i++) {
                    $cloned[] = $xmlBlock;
                }
            }

            if ($replace) {
                $this->tempDocumentMainPart = str_replace(
                    $matches[1] . $matches[2] . $matches[3],
                    implode('', $cloned),
                    $this->tempDocumentMainPart
                );
            }
        }

        return $xmlBlock;
    }

In controller change from

 $templateProcessor = new TemplateProcessor($url_file);

to

 $templateProcessor = new TemplateProcessorMod($url_file);

@jlcfly
Copy link

jlcfly commented Jul 6, 2023

I encountered this back in 0.16.0 and am just today updating due to migration to PHP 8.2. Apparently, I fixed this in my own code. The issue in in the preg_match and it's super simple to fix in TemplateProcessor.php. I'm not a contributor, but will tell you where the issue is.

Instead of "<w:p.*" it should be changed to "<w:p\b.*". There's two places on this line where it occurs, so both need to be replaced, to make it look like this:

'/(<\?xml.*)(<w:p\b.*>' . $escapedMacroOpeningChars . $blockname . $escapedMacroClosingChars . '<\/w:.*?p>)(.*)(<w:p\b.*' . $escapedMacroOpeningChars . '\/' . $blockname . $escapedMacroClosingChars . '<\/w:.*?p>)/is',

The reason this is an issue is because it will also capture <w:pPr> nodes if the \b isn't present, which ends up orphaning any <w:p> nodes with <w:pPr> nodes nested inside, so something like: "<w:p><w:pPr>...</w:pPr></w:p>" ends up with just "<w:p>" with no closing "</w:p>" tag because it replaces the "<w:pPr>" through "</w:p>", leaving only "<w:p>". The "\b" addition will cause it to account for the word boundary, so will only include "<w:p>" nodes.

@thomasb88
Copy link

Customer: ${customer_name}
Address: ${customer_address}
${/block_name}

?

Do You mean

${block_name}
Customer: ${customer_name}
Address: ${customer_address}
${/block_name}

@MuhammadSyahmi98
Copy link
Author

@thomasb88 Yes like you mention

${block_name}
Customer: ${customer_name}
Address: ${customer_address}
${/block_name}

@thomasb88
Copy link

Actual version of PHPWord CloneBlock function

"public function cloneBlock($blockname, $clones = 1, $replace = true, $indexVariables = false, $variableReplacements = null)
{
$xmlBlock = null;
$matches = [];
$escapedMacroOpeningChars = self::$macroOpeningChars;
$escapedMacroClosingChars = self::$macroClosingChars;
preg_match(
//'/(.((?s)<w:p\b(?:(?!<w:p\b).)?{{' . $blockname . '}</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?{{/' . $blockname . '}</w:.?p>)/is',
'/(.((?s)<w:p\b(?:(?!<w:p\b).)?\' . $escapedMacroOpeningChars . $blockname . $escapedMacroClosingChars . '</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\' . $escapedMacroOpeningChars . '/' . $blockname . $escapedMacroClosingChars . '</w:.?p>)/is',
//'/(.((?s)<w:p\b(?:(?!<w:p\b).)?\'. $escapedMacroOpeningChars . $blockname . '}</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\'.$escapedMacroOpeningChars.'/' . $blockname . '}</w:.?p>)/is',
$this->tempDocumentMainPart,
$matches
);

    if (isset($matches[3])) {
        $xmlBlock = $matches[3];
        if ($indexVariables) {
            $cloned = $this->indexClonedVariables($clones, $xmlBlock);
        } elseif ($variableReplacements !== null && is_array($variableReplacements)) {
            $cloned = $this->replaceClonedVariables($variableReplacements, $xmlBlock);
        } else {
            $cloned = [];
            for ($i = 1; $i <= $clones; ++$i) {
                $cloned[] = $xmlBlock;
            }
        }

        if ($replace) {
            $this->tempDocumentMainPart = str_replace(
                $matches[2] . $matches[3] . $matches[4],
                implode('', $cloned),
                $this->tempDocumentMainPart
            );
        }
    }

    return $xmlBlock;
}"

Which mean the regexp is

'/(.((?s)<w:p\b(?:(?!<w:p\b).)?\' . $escapedMacroOpeningChars . $blockname . $escapedMacroClosingChars . '</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\' . $escapedMacroOpeningChars . '/' . $blockname . $escapedMacroClosingChars . '</w:.?p>)/is',

This is the hard part, because how the regexp is written seems to create assumptions on the way you create your template file.

Let's first replace PHP variables, that are there for generic purpose:

/(.((?s)<w:p\b(?:(?!<w:p\b).)?\${block_name}</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\${/block_name}</w:.?p>)/is

Using https://regex101.com/, one can have kind of explanation

/
(.((?s)<w:p\b(?:(?!<w:p\b).)?\${block_name}</w:.?p>))(.)((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\${/block_name}</w:.?p>)
/
is
1st Capturing Group (.((?s)<w:p\b(?:(?!<w:p\b).)?\${block_name}</w:.*?p>))
. matches any character

  • matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    2nd Capturing Group ((?s)<w:p\b(?:(?!<w:p\b).)?\${block_name}</w:.?p>)
    (?s) match the remainder of the pattern with the following effective flags: is
    s modifier: single line. Dot matches newline characters
    <w:p matches the characters <w:p literally (case insensitive)
    \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
    Non-capturing group (?:(?!<w:p\b).)*?
    *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    Negative Lookahead (?!<w:p\b)
    Assert that the Regex below does not match
    <w:p matches the characters <w:p literally (case insensitive)
    \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
    . matches any character
    \ matches the character \ with index 9210 (5C16 or 1348) literally (case insensitive)
    $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
    {block_name}< matches the characters {block_name}< literally (case insensitive)
    / matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
    w: matches the characters w: literally (case insensitive)
    . matches any character
    ? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    p> matches the characters p> literally (case insensitive)
    3rd Capturing Group (.
    )
    . matches any character
  • matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    4th Capturing Group ((?s)<w:p\b(?:(?!<w:p\b).)[^$]?\${/block_name}</w:.?p>)
    (?s) match the remainder of the pattern with the following effective flags: is
    s modifier: single line. Dot matches newline characters
    <w:p matches the characters <w:p literally (case insensitive)
    \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
    Non-capturing group (?:(?!<w:p\b).)
    Negative Lookahead (?!<w:p\b)
    Assert that the Regex below does not match
    <w:p matches the characters <w:p literally (case insensitive)
    \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
    . matches any character
    Match a single character not present in the list below [^$]
    *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    $ matches the character $ with index 3610 (2416 or 448) literally (case insensitive)
    \ matches the character \ with index 9210 (5C16 or 1348) literally (case insensitive)
    $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
    { matches the character { with index 12310 (7B16 or 1738) literally (case insensitive)
    / matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
    block_name}< matches the characters block_name}< literally (case insensitive)
    / matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
    w: matches the characters w: literally (case insensitive)
    . matches any character
    *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    p> matches the characters p> literally (case insensitive)
    Global pattern flags
    i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
    s modifier: single line. Dot matches newline characters

Not so clear ? -)

Well, some synthetic points about this:
1 - The regexp is made of 3 parts
image
-> First part, try to find the opening block pattern
-> Last Part, try to find the closing block pattern
-> In the middle, take all -)

2 - What one can see looking at the 1st or the last part, is that added to the block_name pattern, there is a lot of stuff with <w:p> and </w:p>, that is to say about how paragraph interact with the block pattern (see http:https://officeopenxml.com/WPparagraph.php)

3 - First try to use cloneBlock was to write directly the block patent as text on word, and expect to be able to read it directly in the corresponding xml file (it would all have same style as it was the case from the user point of view). It fails.

4 - It seems many things can make it fail. For example Track Changes (that introduce w:rsidR tags), Doing enter or Ctrl + Enter at the end of the block patent (Enter should produce a new paragrah, whereas Ctrl+Enter should not - but got examples when i used only Enter and there is only 1 paragraph, but multiple run), adding custom Tabs on the rule (w:tab),...

5 - Word split the document in paragraph, then paragraph style, then run(s), then run(s) style(s)... To see all the tags combination that are allowed, or even used, by your OOXML generator (MS Word, Open Office...), the whole ECMA-376 should be parsed, which is a really long work (https://www.ecma-international.org/publications-and-standards/standards/ecma-376/)

6 - On the last version of PHPWord, the choice have been made to remove lot of tags around block pattern:

"/**
* Finds parts of broken macros and sticks them together.
* Macros, while being edited, could be implicitly broken by some of the word processors.
*
* @param string $documentPart The document part in XML representation
*
* @return string
*/
protected function fixBrokenMacros($documentPart)
{
$brokenMacroOpeningChars = substr(self::$macroOpeningChars, 0, 1);
$endMacroOpeningChars = substr(self::$macroOpeningChars, 1);
$macroClosingChars = self::$macroClosingChars;

    return preg_replace_callback(
        '/\\' . $brokenMacroOpeningChars . '(?:\\' . $endMacroOpeningChars . '|[^{$]*\>\{)[^' . $macroClosingChars . '$]*\}/U',
        function ($match) {
            return strip_tags($match[0]);
        },
        $documentPart
    );
}" 

7 - It seems the way the cloneBlock function is implemented put the complexity on the preg_match part (as one can see the replacement after that is easy). It also means when the document begin to be bigger, then it happens than it reach the regexp limit more easily.

8 - So instead of try to "check" the OOXML consistency, a solution can be to find only the block pattern (opening and closing). It seems that is what you have done on the new controller TemplateProcessorMod

9 - But then, the "block" to replace might not be "inside" the opening and the closing pattern, due to other tags that can surround the block patterns

And so there are still assumptions that are to be made to make things work:

  • Does the block pattern should be in a separate paragraph (on MS Word it mean you press Enter before entering the data) ?
  • Can a same block pattern be repetead more than once in a document ?
  • ...

I see a lot of tickets on those functions, with regexp modification to fit specific docx files. But i didn't find the corresponding specification. If i miss it, sorry, my bad (but i would be interested to see it).

Else, i would:

  • For performance purpose, use a simpler regexp, that don't capture anything specific outside the opening and closing block patterns,
  • Then, when found, retrieve the paragraph start from the block pattern position, that is to say assume a block is a bunch of paragraphs (first one before opening block pattern to first one after closing block pattern).

Note in the actual code, i think one of the problem is that on findContainingXmlBlockForMacro, when searching for the closing block pattern, it start from the opening block pattern (and so it assume that all is included inside 1 paragraph, which is not always true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants