Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of words in document #1089

Open
mr-pack opened this issue Jul 3, 2017 · 3 comments
Open

Number of words in document #1089

mr-pack opened this issue Jul 3, 2017 · 3 comments

Comments

@mr-pack
Copy link

mr-pack commented Jul 3, 2017

What is the best and fast way
to count number of words in document
.doc .docx .pdf etc. ?

Thanks.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@Anexo
Copy link

Anexo commented Jul 24, 2017

I'm doing it this way:

https://gist.github.com/Anexo/106c25dc4a99843936562ab71c5eef18

But I can not get the content under TextRun... any ideas?

@FBnil
Copy link

FBnil commented Oct 8, 2017

@mr-pack another idea.
@Anexo You do not need to know it is a TextRun, just if it has a method called getText. Also, sinds elements can be nested, you need recursion:

function ExtractText($obj, $nested = 0){
	$txt = "";
	if(method_exists($obj, 'getSections')) {
		foreach ($obj->getSections() as $section) {
			$txt .= " " . ExtractText($section, $nested+1);
		}
	}else if (method_exists($obj, 'getElements')) {
		foreach ($obj->getElements() as $element) {
			$txt .= " " . ExtractText($element, $nested+1);
		}
	}else if (method_exists($obj, 'getText')) {
		$txt .= $obj->getText();
	}else if(method_exists($obj, 'getRows')) {
		foreach ($obj->getRows() as $row) {
			$txt .= " " . ExtractText($row, $nested+1);
		}
	}else if(method_exists($obj, 'getCells')) {
		foreach ($obj->getCells() as $cell) {
			$txt .= " " . ExtractText($cell, $nested+1);
		}
	}else if (get_class($obj) != "PhpOffice\PhpWord\Element\TextBreak"){
		$txt .= "(".get_class($obj).")"; # unknown object, you need to add it
	}
	return $txt;
}

$text = ExtractText($phpWord->load($filename));
$text = str_replace(' ',"", $text );
$text = str_replace('•',"",$text );
$textArray = preg_split('/\s+/', $text );
$numberWords = count($textArray);

This example is woefully incomplete though. Do not use it to count words. This is becausethe textrun's actually do not need the space at the beginning from the previous text, so extra markers need to be added to remove artificials spaces between words. Something like:
if (get_class($obj) == "PhpOffice\PhpWord\Element\TextRun") { $txt = "<bs>" . $txt; } which then are to be removed before returning (but only when nested==0)

@francescozanoni
Copy link

@FBnil "You do not need to know it is a TextRun": the problem is with class PhpOffice\PhpWord\Element\Title, which can return a TextRun when getText() is executed:

class Title extends AbstractElement
{
    // [...]

    /**
     * Create a new Title Element
     *
     * @param string|TextRun $text
     * @param int $depth
     */
    public function __construct($text, $depth = 1)
    {
        if (is_string($text)) {
            $this->text = CommonText::toUTF8($text);
        } elseif ($text instanceof TextRun) {       // <-- THIS IS THE PROBLEM
            $this->text = $text;                    // <-- THIS IS THE PROBLEM
        }
        // [...]
    }

    /**
     * Get Title Text content
     *
     * @return string
     */
    public function getText()
    {
        return $this->text;
    }

    // [...]

}

Here's my solution, which extends yours:

function ExtractText($obj, $nested = 0) {
        $txt = "";
        if (method_exists($obj, 'getSections')) {
            foreach ($obj->getSections() as $section) {
                $txt .= " " . ExtractText($section, $nested + 1);
            }
        } else if (method_exists($obj, 'getElements')) {
            foreach ($obj->getElements() as $element) {
                $txt .= " " . ExtractText($element, $nested + 1);
            }
        } else if (method_exists($obj, 'getText')) {
            // --------------------------------------------------------------
            // THIS IS THE DIFFERENT BLOCK
            $extracted = $obj->getText();
            if (is_string($extracted) === true) {
                $txt .= $extracted;
            } else {
                $txt .= " " . ExtractText($extracted, $nested + 1);
            }
            // --------------------------------------------------------------
        } else if (method_exists($obj, 'getRows')) {
            foreach ($obj->getRows() as $row) {
                $txt .= " " . ExtractText($row, $nested + 1);
            }
        } else if (method_exists($obj, 'getCells')) {
            foreach ($obj->getCells() as $cell) {
                $txt .= " " . ExtractText($cell, $nested + 1);
            }
        } else if (get_class($obj) != "PhpOffice\PhpWord\Element\TextBreak") {
            $txt .= "(" . get_class($obj) . ")"; # unknown object, you need to add it
        }
        return $txt;
}

$text = ExtractText($phpWord->load($filename));
$text = str_replace('&nbsp;',"", $text );
$text = str_replace('•',"",$text );
$textArray = preg_split('/\s+/', $text );
$numberWords = count($textArray);

Hope this helps! :-)

Francesco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants