How Read Doc Or DocX #2106

lucaswhob · 2021-06-24T07:42:13Z

Hello
Thank you For Best Library Word Processing
I Need Read Docx File And Extract : 1- Text 2- All Images 3- All Link with Title
Please Help Me And Guide Me For Reading File Docx
I Read Document and All your Examples But I Can not Found Read Element and Section Example
Please Help Me
thx

gisostallenberg · 2021-06-29T09:03:13Z

@lucaswhob something like this?

$objReader = \PhpOffice\PhpWord\IOFactory::createReader('Word2007');
$phpWord = $objReader->load('my/file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';
foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        if ($element instanceof \PhpOffice\PhpWord\Element\Text) {
            $text .= $element->getText();
       }
       // and so on for other element types (see src/PhpWord/Element)
    }
}

nikunjbhatt · 2021-08-15T15:10:08Z

@gisostallenberg
Got no output on echo $text;, and no error either.

The reader documentation of DOCX file at https://github.com/PHPOffice/PHPWord/blob/develop/samples/Sample_11_ReadWord2007.php has no useful information about how to actually read a Word 2007 file.

gisostallenberg · 2021-08-18T09:36:32Z

@nikunjbhatt

This was just a simple example. Sections seem to also contain TextRun's (these are containers), which contain sub elements.
Something like this should work:

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

peter-at-bpt · 2021-08-31T20:37:40Z

Might I suggest a small improvement to the recursive method since it has the opportunity to miss text from several object types

    // I would assume this is being run in the context of a Class
    
    public function getDocumentText(string $filepath): string
    {
        $document = IOFactory::createReader('Word2007')
            ->load($filepath);
        $documentText = '';

        foreach ($document->getSections() as $section) {
            foreach ($section->getElements() as $element) {
                $text = $this->getElementText($element);
                
                if (strlen($text)) {
                    // This ensures that the text from one section doesn't stickRightToTheNextSectionLikeThis
                    $documentText.= $this->getElementText($element) . "\n";
                }
            }
        }

        return $documentText;
    }
    
    protected function getElementText($element): string
    {
        $result = '';

        if ($element instanceof AbstractContainer) {
            foreach ($element->getElements() as $subElement) {
                $result .= $this->getElementText($subElement);
            }
        }

        if (method_exists($element, 'getText')) {
            $result .= $element->getText();
        }

        return $result;
    }

osnard · 2021-11-16T16:40:12Z

Sorry for hijacking the topic, but I have a related question. I am also walking the document object tree in some recursive implementation. I try to extract a "table of contents", so I am looking for PhpOffice\PhpWord\Element\Title objects. Unfortunately even though the documents seems to be formatted properly, the object model will not give me any such objects. I can see only PhpOffice\PhpWord\Element\TextRun|Text|Break|Image|...

The XML looks like this

<w:p xmlns:wp14="http:https://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="02051CF4" w14:paraId="4E47C1E7" wp14:textId="5ECEFD8F">
  <w:pPr>
    <w:pStyle w:val="Title"/>
    <w:rPr>
      <w:rFonts w:ascii="Calibri Light" w:hAnsi="Calibri Light" w:eastAsia="" w:cs=""/>
      <w:sz w:val="56"/>
      <w:szCs w:val="56"/>
    </w:rPr>
  </w:pPr>
  <w:bookmarkStart w:name="_GoBack" w:id="0"/>
  <w:bookmarkEnd w:id="0"/>
  <w:r w:rsidR="7A933B85">
  <w:rPr/>
  <w:t xml:space="preserve">The </w:t>
  </w:r>
  <w:proofErr w:type="spellStart"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t>document</w:t>
  </w:r>
  <w:proofErr w:type="spellEnd"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t xml:space="preserve"> title</w:t>
  </w:r>
</w:p>

Do you have any suggestions?

richardsonoge · 2023-01-04T15:59:04Z

@nikunjbhatt

Ce n'était qu'un simple exemple. Les sections semblent également contenir des TextRun (ce sont des conteneurs), qui contiennent des sous-éléments. Quelque chose comme ça devrait fonctionner :

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

Thank you for showing how to take the content of a docx file. But I would like you to show me how I can take the content of a doc file please?

mrtsglk · 2024-01-24T12:39:29Z

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

gravitiq-cm · 2024-04-06T13:59:53Z

A method like Reader::getContentAsPlainText() would be very useful!

richardsonoge · 2024-04-06T14:12:37Z

A method like Reader::getContentAsPlainText() would be very useful!
Can you this method that you give me with my code?

richardsonoge · 2024-04-06T19:56:49Z

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

But how can i do it with my code? Or Can you give me a code to do it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Read Doc Or DocX #2106

How Read Doc Or DocX #2106

lucaswhob commented Jun 24, 2021

gisostallenberg commented Jun 29, 2021

nikunjbhatt commented Aug 15, 2021

gisostallenberg commented Aug 18, 2021

peter-at-bpt commented Aug 31, 2021

osnard commented Nov 16, 2021

richardsonoge commented Jan 4, 2023

mrtsglk commented Jan 24, 2024

gravitiq-cm commented Apr 6, 2024

richardsonoge commented Apr 6, 2024

richardsonoge commented Apr 6, 2024

How Read Doc Or DocX #2106

How Read Doc Or DocX #2106

Comments

lucaswhob commented Jun 24, 2021

gisostallenberg commented Jun 29, 2021

nikunjbhatt commented Aug 15, 2021

gisostallenberg commented Aug 18, 2021

peter-at-bpt commented Aug 31, 2021

osnard commented Nov 16, 2021

richardsonoge commented Jan 4, 2023

mrtsglk commented Jan 24, 2024

gravitiq-cm commented Apr 6, 2024

richardsonoge commented Apr 6, 2024

richardsonoge commented Apr 6, 2024