Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid image: zip:https://t8.docx#word/media/image1.wmf #1612

Open
2 tasks
yangweijie opened this issue Apr 12, 2019 · 4 comments
Open
2 tasks

Invalid image: zip:https://t8.docx#word/media/image1.wmf #1612

yangweijie opened this issue Apr 12, 2019 · 4 comments

Comments

@yangweijie
Copy link

This is:
win i read a docx file ,the code got an exception:
Invalid image: zip:https://t8.docx#word/media/image1.wmf

Expected Behavior

Please describe the behavior you are expecting.

Current Behavior

What is the current behavior?

Failure Information

Please help provide information about the failure.

How to Reproduce

Please provide a code sample that reproduces the issue.

<?php
$phpWord  = \PhpOffice\PhpWord\IOFactory::load($path);
		$sections = $phpWord->getSections();
		$ph = new PhpWordHelper;
		$to_combines = [];
		foreach ($sections as $key => $section) {
			$headers = $section->getHeaders();
			$htitle  = $ph->get_section_header_text($section);
			foreach ($check_titles as $key => $title) {
				$short_title = str_ireplace(' ', '', $title);
				if(false !== stripos($htitle, $title) || false !== stripos($htitle, $short_title)){
					if(!in_array($title, $to_combines)){
						$to_combines[$title][] = $title;
					}
				}
			}
		}
		return $to_combines;
<?php
namespace util;
use PhpOffice\PhpWord\PhpWord;
use PhpOffice\PhpWord\IOFactory;
use PhpOffice\PhpWord\Style\Font;
use PhpOffice\PhpWord\Shared\ZipArchive;
use PhpOffice\PhpWord\Settings;
use PhpOffice\PhpWord\Reader\Word2007\Styles;
/**
 * 文件压缩解压类
 */
class PhpWordHelper {

    private $currentPage = 0;  // 当前分页
    private $page        = 0; // 插入页数
    private $args        = null; // 文本段样式
    private $tmpFiles    = []; // 临时文件
    private $styles      = [];
    private $breakCount  = 0;


    public function split_by_section($file, $dir = ''){
        $old_phpword = IOFactory::load($file);
        // dump($old_phpword);
        // dump([
        //     $old_phpword->getDefaultFontName(),
        //     $old_phpword->getDefaultFontSize(),
        // ]);
        // die;
        $sections = $old_phpword->getSections();
        $arr      = [];
        foreach ($sections as $section) {
            $elements = $section->getElements();
            // dump($elements);
            // die;
            $phpWord = new PhpWord();
            $headers = $section->getHeaders();
            // dump($headers);
            $htitle = $this->get_section_header_text($section);
            $new_section = $phpWord->addSection();
            // $this->setAttr('section', $new_section, $section);
            $this->copyElement($new_section, $headers);
            // dump($phpWord);
            // 逐级读取/写入节点
            $this->copyElement($new_section, $elements);
            // dump($phpWord);
            $footers = $section->getFooters();
            $this->copyElement($new_section, $footers);
            // dump($phpWord);
            $fwrite = IOFactory::createWriter($phpWord);
            $path = $dir?:dirname($file);
            if(!is_dir($path)) mkdir($path);

            // dump($htitle);
            // $htitle = \mb_convert_encoding($htitle, IS_WIN?'GB2312':'UTF-8', 'UTF-8');
            $filePath = $path . DS . $htitle. '.docx';
            // dump($filePath);
            // $fwrite->save($filePath);
        }
    }

    // 获取分区页眉文本
    public function get_section_header_text($section){
        $headers = $section->getHeaders();
        $text = [];
        foreach ($headers as $key => $header) {
            $elements = $header->getElements();
            foreach ($elements as $key => $element) {
                $type = $this->get_element_type($element);
                if(in_array($type, ['Text', 'TextRun', 'Link', 'Title'])){
                    if($type == 'TextRun'){
                        $textRunElements = $element->getElements();
                        foreach ($textRunElements as $key => $textRunElement) {
                            $text[] = $textRunElement->getText();
                        }
                    }else{
                        $text[] = $element->getText();
                    }
                }
            }
        }
        $str = \implode('', $text);
        return \str_ireplace(' ', '', $str);
    }

    // 获取元素的类型
    public function get_element_type($element){
        $class = \get_class($element);
        // dump(explode('\\', $class)[3]);
        return explode('\\', $class)[3];
    }

    public function copyElement(&$container, $elements){
        foreach ($elements as $key => $element) {
            if(!is_null($element)){
                $phpword = $element->getPhpWord();
                $type    = $this->get_element_type($element);
                // dump($type);
                $fun     = "add{$type}";
                // 判断一个元素是否是容器
                if(in_array($type, [
                    'Header',
                    'Footer',
                    'Section'
                ])){
                    if($type == 'Section'){
                        $newEl = $container->$fun();
                    }else{
                        $ptype = $element->getType();
                        // dump('ptype:'.$ptype);
                        // dump("{$type}_{$key}");
                        // dump([$element, $element->getElements()]);
                        // dump($element->getElements());
                        $newEl = $container->$fun($ptype);
                    }
                    // dump($newEl);
                    // if(is_null($newEl)){
                    //     dump($container);
                    // }
                    $this->copyElement($newEl, $element->getElements());
                }elseif(in_array($type, [
                    'Table',
                    'Footnote',
                    'Endnote',
                    'TextRun'
                ])){ // 判断是否复合元素
                    switch ($type) {
                        case 'Table':
                            $tableStyle = $element->getStyle();
                            $newEl      = $container->$fun($tableStyle);
                            $this->setAttr($type, $newEl, $element);
                            $rows = $element->getRows();
                            $this->copyElement($newEl, $rows);
                            break;
                        case 'Footnote':
                            $newEl = $container->$fun();
                            break;
                        case 'Endnote':
                            $newEl = $container->$fun();
                            break;
                        case 'TextRun':
                            dump($text);
                            $paragraphStyle = $element->paragraphStyle();
                            // dump($paragraphStyle);
                            $newEl          = $container->$fun($paragraphStyle);
                            break;
                        default:
                            break;
                    }
                    $this->setAttr($type, $newEl, $element);
                    $sub_elements = $element->getElements();
                    $this->copyElement($newEl, $sub_elements);
                }else{
                    if(\method_exists($element, 'getText')){
                        $text = $element->getText();
                    }else{
                        $text = '';
                    }
                    // dump($type);
                    switch ($type) {
                        case 'Text':
                            // dump([
                            //     $text,
                            //     $element->getFontStyle(),
                            //     $element->getParagraphStyle()
                            // ]);
                            $container->addText($text, $element->getFontStyle(), $element->getParagraphStyle());
                            break;
                        case 'Title':
                            $depth = $element->getDepth();
                            $style = $element->getStyle();
                            $phpWord->addTitleStyle($depth, $style['font'], $style['paragraph']);
                            $container->$fun($text, $depth);
                            break;
                        case 'Link':
                            $linkName       = $text;
                            $linkSrc        = $element->getLinkSrc();
                            $fontStyle      = $element->getFontStyle();
                            $paragraphStyle = $element->getParagraphStyle();
                            $container->$fun($linkSrc, $linkName, $fontStyle, $paragraphStyle);
                            break;
                        case 'PreserveText':
                            $container->$fun();
                            break;
                        case 'TextBreak':
                            $fontStyle      = $element->getFontStyle();
                            $paragraphStyle = $element->getParagraphStyle();
                            $container->$fun($breakCount = 1, $fontStyle, $paragraphStyle);
                            break;
                        case 'PageBreak':
                            $container->$fun();
                            break;
                        case 'Line':
                            $lineStyle = $element->getStyle();
                            $container->$fun($lineStyle);
                            break;
                        case 'Chart':
                            $type       = $element->getType();
                            $series     = $element->getSeries();
                            $style      = $element->getStyle();
                            $categories = array_column($series, 'categories');
                            $chart      = $container->$fun($type, $categories, $style);
                            break;
                        case 'ListItem':
                            $depth          = $element->getDepth();
                            $fontStyle      = $element->getTextObject()->getFontStyle();
                            $paragraphStyle = $element->getTextObject()->getParagraphStyle();
                            $listStyle      = $element->getStyle();
                            $container->$fun($text, $depth, $fontStyle, $listStyle, $paragraphStyle);
                            break;
                        case 'ListItemRun':
                            // TODO
                            break;
                        case 'Shape':
                            $type  = $element->getType();
                            $style = $element->getStyle();
                            $container->$fun($type, $style);
                            break;
                        case 'Image':
                            $source = $element->getSource();
                            $style  = $element->getStyle();
                            if($element->isWatermark() && !$element->isInSection()){
                                if($this->get_element_type($element->getParent()) == 'Header'){
                                    $element->getParent()->addWatermark($source, $style);
                                }
                            }
                            $container->$fun($source, $style);
                            break;
                        case 'OLEObject':
                            $source = $element->getSource();
                            $style  = $element->getStyle();
                            $container->$fun($source, $style);
                            break;
                        case 'Row':
                            $height = $element->getHeight();
                            $style  = $element->getStyle();
                            $row    = $container->$fun($height, $style);
                            $cells  = $row->getCells();
                            if($cells){
                                $this->copyElement($row, $cells);
                            }
                            break;
                        case 'Cell':
                            $width = $element->getWidth();
                            $style = $element->getStyle();
                            $cell  = $container->$fun($width, $style);
                            break;
                        default:
                            // Shape SDT
                            # code...
                            break;
                    }
                }
            }
        }
    }

    public static function checkDocumentLinks($file){
        $links       = [];
        try {
            $phpword     = IOFactory::load($file);
        } catch (\Exception $e) {
            ptrace($e->getMessage() . PHP_EOL . $e->getTraceAsString());
            return $links;
        }
        $sections    = $phpword->getSections();

        foreach ($sections as $key => $section) {
            $headers     = $section->getHeaders();
            // dump($headers);
            $ret = self::checkElement($headers, $links);
            if($ret){
                foreach ($ret as $key => $r) {
                    $links[] = $ret;
                }
            }
            // 逐级读取/写入节点
            $ret = self::checkElement($section->getElements(), $links);
            if($ret){
                foreach ($ret as $key => $r) {
                    $links[] = $ret;
                }
            }
            $footers = $section->getFooters();
            $ret = self::checkElement($footers, $links);
            if($ret){
                foreach ($ret as $key => $r) {
                    $links[] = $ret;
                }
            }
        }
        return $links;
    }

    public static function checkElement($elements, &$links){
        foreach ($elements as $key => $element) {
            $class = \get_class($element);
            $type  = explode('\\', $class)[3];
            $ret   = [];
            // dump($type);
            // 判断一个元素是否是容器
            if(in_array($type, [
                'Table',
                'Footnote',
                'Endnote',
                'TextRun'
            ])){ // 判断是否复合元素
                switch ($type) {
                    case 'Table':
                        $rows = $element->getRows();
                        $ret  = self::checkElement($rows, $links);
                        break;
                    case 'Footnote':
                    case 'Endnote':
                        $sub_elements = $element->getElements();
                        $ret = self::checkElement($sub_elements, $links);
                        break;
                    case 'TextRun':
                        $sub_elements = $element->getElements();
                        $ret = self::checkElement($sub_elements, $links);
                        // dump($element);
                        // $text = $element->getText();
                        // $ret  = $text == null?[]:self::check_text_link($text);
                        break;
                    case 'Row':
                        $cells = $element->getCells();
                        $ret  = self::checkElement($cells);
                        break;
                    default:
                        $ret = [];
                        break;
                }
            }elseif(method_exists($element, 'getElements')){
                $sub_elements = $element->getElements();
                $ret = self::checkElement($sub_elements, $links);
            }else{
                if(\method_exists($element, 'getText')){
                    $text = $element->getText();
                }else{
                    $text = '';
                }
                // dump($type);
                switch ($type) {
                    case 'Link':
                        $ret =  [
                            'word'=>$text,
                            'link'=>$element->getLinkSrc()
                        ];
                        break;
                    case 'PreserveText':
                    case 'Text':
                    case 'Title':
                    case 'Cell':
                        if($text){
                            $ret = self::check_text_link($text);
                        }
                        // $ret = self::check_text_link($text);
                        break;
                    case 'TextBreak':
                    case 'PageBreak':
                    case 'Line':
                    case 'Chart':
                    case 'ListItem':
                    case 'ListItemRun':
                    case 'Shape':
                    case 'Image':
                    case 'OLEObject':
                        $ret = [];
                        break;
                    default:
                        // Shape SDT
                        break;
                }
            }
            if($ret){
                $links[] = $ret;
            }
        }
    }

    // { HYPERLINK &quot;https://baike.baidu.com/item/%E5%BB%BA%E7%AD%91&quot; \t &quot;_blank&quot; }
    public static function check_text_link($text){
        if(is_array($text)){
            $text = implode('', $text);
            $text = trim($text);
        }elseif(is_numeric($text)){
            return [];
        }
        if(stripos($text, '{ HYPERLINK ') === false && stripos($text, 'quot; }') === false){
            return [];
        }else{
            ptrace($text);
            $arr      = explode('&quot;', $text);
            $link     = $arr[1];
            $new_link = urldecode($link);
            $word     = getChinese($new_link)?:$new_link;
            return [
                    'word'=>$word == false?'':$word,
                    'link'=>$new_link,
            ];
        }
    }

    private function setAttr($elName, &$newEl, $element)
    {
        switch (strtolower($elName)) {
            case 'section':
                $orders = [
                    'getOrientation'        => 'setOrientation',
                    'getGutter'             => 'setGutter',
                    'getBreakType'          => 'setBreakType',
                    'getPaperSize'          => 'setPaperSize',
                    'getMarginTop'          => 'setMarginTop',
                    'getMarginLeft'         => 'setMarginLeft',
                    'getMarginBottom'       => 'setMarginBottom',
                    'getHeaderHeight'       => 'setHeaderHeight',
                    'getFooterHeight'       => 'setFooterHeight',
                    'getColsNum'            => 'setColsNum',
                    'getColsSpace'          => 'setColsSpace',
                    'getLineNumbering'      => 'setLineNumbering',
                    'getPageSizeW'          => 'setPageSizeW',
                    'getPageSizeH'          => 'setPageSizeH',
                    'getPageNumberingStart' => 'setPageNumberingStart',
                ];
                foreach ($orders as $get => $set) {
                    $newEl->$set($element->$get());
                }
                break;
            case 'footnote':
                $newEl->setReferenceId($element->getReferenceId());
                break;
            case 'formfield':
                $newEl->setName($element->getName());
                $newEl->setDefault($element->getDefault());
                $newEl->setValue($element->getValue());
                $newEl->setEntries($element->getEntries());
                break;
            case 'object':
                $newEl->setImageRelationId($element->getImageRelationId());
                $newEl->setObjectId($element->getObjectId());
                break;
            case 'sdt':
                $newEl->setValue($element->getValue());
                $newEl->setListItems($element->getListItems());
                break;
            case 'table':
                $newEl->setWidth($element->getWidth());
                break;
        }
    }
}

Context

t8.docx

  • PHP version:
  • PHPWord version: dev-develop 1534dc2
@JieAnthony
Copy link

解决了么

@yangweijie
Copy link
Author

解决了么

没 后来改为用 win 服务器执行宏来 做格式化 提取内容了 拦截了一些异常 不支持word里 特殊的图片 只支持jpg png 之类的

@JieAnthony
Copy link

解决了么

没后来后来用win服务器执行宏来做格式化格式化提取内容了拦截了一些异常不支持word里特殊的图片只支持jpg png之类的

那得估计要翻一下源码看看哪里限制了。如果你是word转html的话还有一种办法就是用libreoffice

@thomasb88
Copy link

Same issue as #1480 (also add WMF support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants