Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How Read Doc Or DocX #2106

Open
lucaswhob opened this issue Jun 24, 2021 · 10 comments
Open

How Read Doc Or DocX #2106

lucaswhob opened this issue Jun 24, 2021 · 10 comments

Comments

@lucaswhob
Copy link

Hello
Thank you For Best Library Word Processing
I Need Read Docx File And Extract : 1- Text 2- All Images 3- All Link with Title
Please Help Me And Guide Me For Reading File Docx
I Read Document and All your Examples But I Can not Found Read Element and Section Example
Please Help Me
thx

@gisostallenberg
Copy link

@lucaswhob something like this?

$objReader = \PhpOffice\PhpWord\IOFactory::createReader('Word2007');
$phpWord = $objReader->load('my/file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';
foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        if ($element instanceof \PhpOffice\PhpWord\Element\Text) {
            $text .= $element->getText();
       }
       // and so on for other element types (see src/PhpWord/Element)
    }
}

@nikunjbhatt
Copy link

@gisostallenberg
Got no output on echo $text;, and no error either.

The reader documentation of DOCX file at https://github.com/PHPOffice/PHPWord/blob/develop/samples/Sample_11_ReadWord2007.php has no useful information about how to actually read a Word 2007 file.

@gisostallenberg
Copy link

@nikunjbhatt

This was just a simple example. Sections seem to also contain TextRun's (these are containers), which contain sub elements.
Something like this should work:

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

@peter-at-bpt
Copy link

Might I suggest a small improvement to the recursive method since it has the opportunity to miss text from several object types

    // I would assume this is being run in the context of a Class
    
    public function getDocumentText(string $filepath): string
    {
        $document = IOFactory::createReader('Word2007')
            ->load($filepath);
        $documentText = '';

        foreach ($document->getSections() as $section) {
            foreach ($section->getElements() as $element) {
                $text = $this->getElementText($element);
                
                if (strlen($text)) {
                    // This ensures that the text from one section doesn't stickRightToTheNextSectionLikeThis
                    $documentText.= $this->getElementText($element) . "\n";
                }
            }
        }

        return $documentText;
    }
    
    protected function getElementText($element): string
    {
        $result = '';

        if ($element instanceof AbstractContainer) {
            foreach ($element->getElements() as $subElement) {
                $result .= $this->getElementText($subElement);
            }
        }

        if (method_exists($element, 'getText')) {
            $result .= $element->getText();
        }

        return $result;
    }

@osnard
Copy link

osnard commented Nov 16, 2021

Sorry for hijacking the topic, but I have a related question. I am also walking the document object tree in some recursive implementation. I try to extract a "table of contents", so I am looking for PhpOffice\PhpWord\Element\Title objects. Unfortunately even though the documents seems to be formatted properly, the object model will not give me any such objects. I can see only PhpOffice\PhpWord\Element\TextRun|Text|Break|Image|...

The XML looks like this

<w:p xmlns:wp14="http:https://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="02051CF4" w14:paraId="4E47C1E7" wp14:textId="5ECEFD8F">
  <w:pPr>
    <w:pStyle w:val="Title"/>
    <w:rPr>
      <w:rFonts w:ascii="Calibri Light" w:hAnsi="Calibri Light" w:eastAsia="" w:cs=""/>
      <w:sz w:val="56"/>
      <w:szCs w:val="56"/>
    </w:rPr>
  </w:pPr>
  <w:bookmarkStart w:name="_GoBack" w:id="0"/>
  <w:bookmarkEnd w:id="0"/>
  <w:r w:rsidR="7A933B85">
  <w:rPr/>
  <w:t xml:space="preserve">The </w:t>
  </w:r>
  <w:proofErr w:type="spellStart"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t>document</w:t>
  </w:r>
  <w:proofErr w:type="spellEnd"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t xml:space="preserve"> title</w:t>
  </w:r>
</w:p>

Do you have any suggestions?

@richardsonoge
Copy link

@nikunjbhatt

Ce n'était qu'un simple exemple. Les sections semblent également contenir des TextRun (ce sont des conteneurs), qui contiennent des sous-éléments. Quelque chose comme ça devrait fonctionner :

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

Thank you for showing how to take the content of a docx file. But I would like you to show me how I can take the content of a doc file please?

@mrtsglk
Copy link

mrtsglk commented Jan 24, 2024

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

@gravitiq-cm
Copy link

A method like Reader::getContentAsPlainText() would be very useful!

@richardsonoge
Copy link

A method like Reader::getContentAsPlainText() would be very useful!
Can you this method that you give me with my code?

@richardsonoge
Copy link

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

But how can i do it with my code? Or Can you give me a code to do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants