Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping an empty line to a new section #25

Closed
skinkie opened this issue Jun 12, 2021 · 14 comments
Closed

Mapping an empty line to a new section #25

skinkie opened this issue Jun 12, 2021 · 14 comments

Comments

@skinkie
Copy link

skinkie commented Jun 12, 2021

I would like to map an empty line in the source document to produce a new section in tex (I am using docx2tex), or anything that could be post processed, \newline would be fine too. I noticed that this section from the docx file is completely omitted, hence it does not appear in the 24.docx2hub_join-runs.xml file at all.

Hence I would like to replace any w:p with children count 1, being w:pPr, with something?

    <w:p w14:paraId="1E4C5E8A" w14:textId="77777777" w:rsidR="00AE18CA" w:rsidRPr="003A1575" w:rsidRDefault="00AE18CA" w:rsidP="00F03C13">
      <w:pPr>
        <w:spacing w:after="0" w:line="360" w:lineRule="auto"/>
        <w:ind w:firstLine="284"/>
        <w:jc w:val="both"/>
        <w:rPr>
          <w:rFonts w:ascii="Constantia" w:hAnsi="Constantia" w:cs="Courier New"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
    </w:p>
@gimsieke
Copy link
Contributor

gimsieke commented Jun 12, 2021

Hi Stefan,

This can absolutely be done, although not in docx2hub. This is a module that uniformly converts docx to an XML representation that can be handled more easily than OOXML. It is not meant to be configured, for example in order to associate meaning with certain paragraph or character styles, or with empty paragraphs, as in your case.

The place to apply custom processing is the next macroscopic step in the pipeline, which we call 'evolve-hub'.

The XProc pipeline that is invoked by d2t or d2t.bat has an input port called custom-evolve-hub-driver to which you can pass an XSLT stylesheet that implements your sectioning rules. (Unfortunately you can’t pass this XSLT’s location to the scripts yet; we’d need to provide such an option unless you invoke calabash/calabash.sh yourself.)

Based on the assumption that neither XSLT nor XProc are your home turf, we can create such an XSLT for you and maybe add an option to the invocation script that allows you to pass the XSLT’s location to the XProc pipeline.

We might add this XSLT to the repo in order to give other people a starting point for their own customizations.

Gerrit

@skinkie
Copy link
Author

skinkie commented Jun 12, 2021

Does the evolve-hub step run before the docx2hub? Because if it by definition does not pick up an empty line, there isn't much to convert either. Since pandoc has exactly the same issue, I have now the workaround that every NLP guy would do: used Microsoft Word's fantastic non-regexp search and replace: ^p^p by ^pNEWLINE^p and that did the trick downstream.

@gimsieke
Copy link
Contributor

No, evolve-hub runs after docx2hub.

Only after I posted my comment I became aware of your assertion that empty paragraphs are removed by docx2hub. I had doubts. See attached H1.docx. It contains empty paragraphs. They are present in 24.docx2hub_join-runs.xml:

   <para role="Heading1">H1</para>
   <para>text1</para>
   <para/>
   <para>text2</para>
   <para role="Heading2">H2</para>
   <para/>
   <para>text3</para>
   <para role="Heading1">H1</para>
   <para>text4</para>
   <para/>
   <para>text5</para>
   <para/>

@skinkie
Copy link
Author

skinkie commented Jun 12, 2021 via email

@gimsieke
Copy link
Contributor

Sure

@gimsieke
Copy link
Contributor

Maybe you inserted a page break? These also have a paragraph marker in Word, but they will only be converted to a paragraph if they also contain text. In the latter case, the resulting paragraph will have a css:page-break-after="always" attribute. In the former case, the next paragraph will have a css:page-break-before="always" attribute. Of course one can write templates (in an XSLT mode of evolve-hub) that take these attributes as a hint to create headings. (I don’t think that LaTeX will accept headings with no text though.)

gimsieke added a commit to transpect/docx2tex that referenced this issue Jun 13, 2021
@gimsieke
Copy link
Contributor

gimsieke commented Jun 13, 2021

./d2t -d -e xsl/custom-evolve-hub-driver-example.xsl ../tmp/H1.docx (after a git pull of https://github.com/transpect/docx2tex, and of course mutatis mutandis wrt the docx file location…)

Empty headings don’t seem to be a problem.

@skinkie
Copy link
Author

skinkie commented Jun 13, 2021

I am currently going down the rabbit hole where abiword crashes on modifying (and even a single copy) the document I want to make the example of. With with respect to empty sections *section{} will just work. This is my magic trick to workaround another issue. https://tex.stackexchange.com/questions/595254/paragraph-hanging-based-on-lines

But this adventure helped me to validate your statement: you are right, it does do something with the empty paragraph.

  <para css:margin-bottom="0pt" css:line-height="1.5" css:text-indent="14.2pt" css:text-align="justify">
    <anchor xml:id="SysFlag"/>
    <phrase css:font-family="Constantia">Met een diepe rimpel in zijn voorhoofd beleeft Peter alles opnieuw.</phrase>
  </para>
  <para css:margin-bottom="0pt" css:line-height="1.5" css:text-align="justify"/>
  <para css:hyphens="manual" css:margin-bottom="0pt" css:line-height="1.5" css:text-indent="14.2pt">
    <phrase css:font-family="Constantia">Er werd gezamenlijk buiten gespeeld door groep 1 en 2. </phrase>
  </para>

So yes, in this case it could just be searching for an empty textcontent.

@gimsieke
Copy link
Contributor

gimsieke commented Jun 13, 2021

If Abiword crashes and if you don't have Word, maybe you can use LibreOffice?

Alternatively, you can send the original document to letexml at le-tex.de.

Or you can use another font. The Abiword bug seems to be font-related.

If you want LaTeX output and do something based on empty paras, you need to provide a custom evolve-hub XSLT nevertheless since empty paras are removed by the default evolve-hub customization that docx2tex uses.

This custom evolve-hub is what my previous comment was about. I provided an example that creates chapter headings from each empty para.

@skinkie
Copy link
Author

skinkie commented Jun 13, 2021

If Abiword crashes and if you don't have Word, maybe you can use LibreOffice?

I asked the document producer to create the lighter version. That showed me my initial assumption was wrong, it was just self-closing para.

If you want LaTeX output and do something based on empty paras, you need to provide a custom evolve-hub XSLT nevertheless since empty paras are removed by the default evolve-hub customization that docx2tex uses.

This custom evolve-hub is what my previous comment was about. I provided an example that creates chapter headings from each empty para.

I'm going to give it a shot. Thanks for your suggestions so far, it is sincerely appreciated.

@skinkie
Copy link
Author

skinkie commented Jun 13, 2021

For what I wanted to achieve I only needed to make this small change.

diff --git a/conf/conf.xml b/conf/conf.xml
index fb8f10b..5ad3d60 100755
--- a/conf/conf.xml
+++ b/conf/conf.xml
@@ -614,6 +614,12 @@
     </rule>
   </template>
 
+  <template context="dbk:para[@role = ('ResetParagraph')]">
+    <rule break-after="2" name="section*" type="cmd">
+      <param/>
+    </rule>
+  </template>
+
   <xsl:variable name="footnote-ids" as="xs:string*" 
                 select="for $i in $footnotes return generate-id($i)"/>
 
diff --git a/xsl/custom-evolve-hub-driver-example.xsl b/xsl/custom-evolve-hub-driver-example.xsl
index ea744d5..201f22d 100755
--- a/xsl/custom-evolve-hub-driver-example.xsl
+++ b/xsl/custom-evolve-hub-driver-example.xsl
@@ -13,8 +13,8 @@
   <xsl:template match="para[empty(node())]" mode="docx2tex-preprocess">
     <xsl:copy>
       <xsl:apply-templates select="@*"/>
-      <xsl:attribute name="role" select="'Heading1'"/>
+      <xsl:attribute name="role" select="'ResetParagraph'"/>
     </xsl:copy>
   </xsl:template>

@skinkie skinkie closed this as completed Jun 13, 2021
@gimsieke
Copy link
Contributor

…although this could have probably been achieved without changing the core configuration. But whatever floats your boat. If you need to make this change stable across updates, feel free to reopen this issue or to open a new one.

@skinkie
Copy link
Author

skinkie commented Jun 13, 2021

…although this could have probably been achieved without changing the core configuration. But whatever floats your boat. If you need to make this change stable across updates, feel free to reopen this issue or to open a new one.

I think both Dutch and German have a different word for a paragraph and "section". (Absatz, Sektion, Paragraaf, Alinea) The problem with this type of thing is that users expect it to 'work', while they acknowledge it does not work in Microsoft Word in the first place. Hence faking paragraph hanging works by setting a margin, but then the "reset" is not picked up. So maybe more generically a section without number section* could be added, adding a specific ResetParagraph template in the default config wouldn't make sense for anyone else unaware of the issue and the hack. I think the fundamental question is: should Word create individual "sections" or should the intention of the user be enough, to have it in evolve-hub? How docx2tex works currently does not equal the markup in the original document, with respect to empty lines, so something in that direction should be added anyway, maybe even just:

   <xsl:template match="para[empty(node())]" mode="docx2tex-preprocess">
     <xsl:copy>
       <xsl:apply-templates select="@*"/>
       <xsl:attribute name="role" select="'NewLine'"/>
     </xsl:copy>
   </xsl:template>
  <template context="dbk:para[@role = ('ResetParagraph')]">
    <rule break-after="2" name="newline" type="cmd">
      <param/>
    </rule>
  </template>

(this is not correct, but did not investigate how it should be encoded)

@gimsieke
Copy link
Contributor

gimsieke commented Jun 15, 2021

Empty paras have been removed in docx2tex since 2015-12. Since not that many people complained about it since then and because it is quite easy to treat empty paras differently since 2021-06 , we should leave it at that.

I documented how to override the evolve-hub behavior in README.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants