Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"MsDoc" reader fails to open and/or correctly process MS Word 97-2003 (*.doc) files #1318

Open
1 task done
voltel opened this issue Mar 22, 2018 · 3 comments
Open
1 task done
Assignees

Comments

@voltel
Copy link

voltel commented Mar 22, 2018

This is:

  • a bug report

Expected Behavior

  1. The MS Word 97-2003 document (*.doc) would be correctly opened and correctly processed by
    $phpWord = IOFactory::load($c_file_name, 'MsDoc'); // this line causes error

  2. styles would be internally set in MsDoc.php in generatePhpWord() method:

        foreach ($this->arraySections as $itmSection) {
            $oSection = $this->phpWord->addSection();
            $oSection->setStyle($itmSection->styleSection); // this line causes error
        ...

Current Behavior

Errors, inconsistently different:

Notice: Uninitialized string offset: 327680 (or some other wildly large number)
Error traced in getInt2d() and/or getInt1d() of vendor\phpoffice\phpword\src\PhpWord\Reader\MsDoc.php (line 2317)

or

Fatal error: Uncaught PhpOffice\PhpWord\Exception\Exception: Could not open resources/resources/n_466.doc for reading! File does not exist, or it is not readable. in D:\xxx\xxx\vendor\phpoffice\phpword\src\PhpWord\Shared\OLERead.php:78

or

Notice: Undefined property: stdClass::$styleSection
traced to vendor\phpoffice\phpword\src\PhpWord\Reader\MsDoc.php generatePhpWord()

or, when it manages to convert some test file, the layout is completely wrong:
no styles, line breaks in wrong places, parts of words are missing, table is not reproduced.

the elements recognized by the following snippet are of type Text, with failed recognition of paragraphs. A simple table has not been recognized at all.

Failure Information

I tried all possible versions of MS Word 97-2003 documents (created from MS Word 2007, or in MS Word 365). I tried to process downloaded files (i.e. from here n_466.doc or d466.doc), or I created new files manually in both available to me versions of MS Word (2007 and 365) and saved them as *.doc.
The provided set-up (see further) works OK with the same documents saved as .docx files (different reader class).
test_documents.zip

Version, copied from the composer.json:
"phpoffice/phpword": "^0.14.0",

or form composer.lock:
"name": "phpoffice/phpword",
"version": "v0.14.0",
"source": {
"type": "git",
"url": "https://github.com/PHPOffice/PHPWord.git",
"reference": "b614497ae6dd44280be1c2dda56772198bcd25ae"
},

How to Reproduce

This is a part of Symfony 4 project.

Service class:

<?php
namespace App\Service\Parser;

use PhpOffice\PhpWord\Element\{
    Line,
    Section,
    Table,
    Text,
    TextBreak,
    TextRun
};


use PhpOffice\PhpWord\IOFactory;

class DecParser
{
    /**
     * @param string $c_file_name
     * @return array
     * @throws \Exception
     */
    public function get_doc_tables_array(string $c_file_name) : array
    {
        $a_tables = [];

        $readerName = null;
        if (preg_match('/\.(\w*)$/', $c_file_name, $a_matches)) {
            if ($a_matches[1] == 'docx') $readerName = 'Word2007';
            else if ($a_matches[1] == 'doc') $readerName = 'MsDoc';
        }//

        //dump('Reader name: ' . $readerName);
        $phpWord = IOFactory::load($c_file_name, $readerName);
        $a_sections = $phpWord->getSections();

        $table_index = 0;
        foreach ($a_sections as $this_section) {
            foreach ($this_section->getElements() as $el) {

                if ($el instanceof Table) {
                    foreach ($el->getRows() as $row_index => $row) {
                        $a_tables[$table_index][$row_index] = [];
                        foreach ($row->getCells() as $col_index => $cell) {
                            $a_tables[$table_index][$row_index][$col_index] = '';

                            foreach ($cell->getElements() as $cell_el) {
                                $a_tables[$table_index][$row_index][$col_index] .= self::extract_text_from_element($cell_el);
                            }//endforeach

                        }//endforeach
                    }//endforeach

                    $table_index++;
                }//endif
            }//endforeach

        }//endforeach
        return $a_tables;
    }//end of function

    /**
     * @param $el
     * @param int $depth
     * @return null|string
     * @throws \Exception
     */
    private static function extract_text_from_element($el, $depth = 0) :? string
    {
        $c_text = null;

        if ($depth > 100) throw new \Exception("Depth of recursions is over the limit of 100 in " . __METHOD__);

        if ($el instanceof Line) {
            $c_text = "\n\n";

        } else if ($el instanceof TextBreak) {
            $c_text = "\n";

        } else if ($el instanceof Text) {
            $c_text = $el->getText();

        } else if ($el instanceof TextRun) {
            $depth++;
            $a_elements = $el->getElements();

            $c_text = '';
            foreach($a_elements as $this_el) {
                $c_text .= self::extract_text_from_element($this_el, $depth);
            }//endforeach

            if (count($a_elements) > 0 ) {
                $c_text .= "\n";
            }//endif
        }//endif

        return $c_text;
    }//end of function

}//end of class

Controller class:

<?php
namespace App\Controller;

use App\Service\Parser\DecParser;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\Routing\Annotation\Route;

/**
 * @Route("/parse")
 */
class ParserController extends Controller
{
    /**
     * @Route("/dec")
     * a single argument should be injected as a dependency during controller execution 
     * or you can create a new object of a above service of class  DecParser. 
     */
    public function show_parsed_doc(DecParser $parser) : Response
    {
        $doc_name = '../docs/temp/d466.docx'; // change this to real file location

        $a_tables = $parser->get_doc_tables_array($doc_name);
        $a_template_data = [
            'tables' => $a_tables
        ];
        
        // edit twig template to visualize the table data - the sample is provided below
        return $this->render('dec/dec_orders.html.twig', $a_template_data);
    }//end of function

}//end of class

Sample implementation of twig template

{% extends "base.html.twig" %}

{% block title %}Parsed tables{% endblock %}

{% block main %}
    {% if tables is defined %}
        {% for this_table in tables %}
            <h2>Table {{ loop.index }}</h2>
            <table class="table table-bordered table-light">
                <tbody>
                {% for row in this_table %}
                    <tr>
                        {% for cell in row %}
                            <td>{{ cell }} </td>
                        {% endfor %}
                    </tr>
                {% endfor %}
                </tbody>
            </table>
        {% endfor %}
    {% endif %}

{% endblock %}

Context

  • PHP version: 7.1.6
  • PHP Framework: Symfony 4
  • PHPWord version: 0.14
@Progi1984 Progi1984 self-assigned this Oct 4, 2018
@nickpoulos
Copy link

nickpoulos commented Jul 19, 2019

Hey @Progi1984 , any luck with this? I am seeing similar behavior. It seems unable to read a pretty standard Word97 doc, no special formatting. Instead I get broken, fragmented text and/or not getting other sections entirely.

I was able to get much better results from a simple fread style function. But that was only useful for plaintext extraction, no style or formatting data unfortunately.

function readWord($filename) {
        if(file_exists($filename))
        {
            if(($fh = fopen($filename, 'r')) !== false )
            {
                $headers = fread($fh, 0xA00);

                // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
                $n1 = ( ord($headers[0x21C]) - 1 );

                // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
                $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

                // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
                $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

                // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
                $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

                // Total length of text in the document
                $textLength = ($n1 + $n2 + $n3 + $n4);

                $extracted_plaintext = fread($fh, $textLength);

                return $extracted_plaintext;
            } else {
                return false;
            }
        } else {
            return false;
        }
    }

@woaijiangjing
Copy link

bad

@ijohnson-TCR
Copy link

ijohnson-TCR commented Jan 20, 2023

Any updates on this? Trying to convert a .doc file to pdf, it works, but in the pdf part of the text is cut off and the italics are gone.

`
require 'vendor/autoload.php';

use PhpOffice\PhpWord\IOFactory;
use PhpOffice\PhpWord\Settings;

Settings::setPdfRendererName(Settings::PDF_RENDERER_DOMPDF);
Settings::setPdfRendererPath('.');

$phpWord = IOFactory::load('TEST2.doc', 'MsDoc');
$phpWord->save('word_doc.pdf', 'PDF');
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants