-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping an empty line to a new section #25
Comments
Hi Stefan, This can absolutely be done, although not in docx2hub. This is a module that uniformly converts docx to an XML representation that can be handled more easily than OOXML. It is not meant to be configured, for example in order to associate meaning with certain paragraph or character styles, or with empty paragraphs, as in your case. The place to apply custom processing is the next macroscopic step in the pipeline, which we call 'evolve-hub'. The XProc pipeline that is invoked by Based on the assumption that neither XSLT nor XProc are your home turf, we can create such an XSLT for you and maybe add an option to the invocation script that allows you to pass the XSLT’s location to the XProc pipeline. We might add this XSLT to the repo in order to give other people a starting point for their own customizations. Gerrit |
Does the evolve-hub step run before the docx2hub? Because if it by definition does not pick up an empty line, there isn't much to convert either. Since pandoc has exactly the same issue, I have now the workaround that every NLP guy would do: used Microsoft Word's fantastic non-regexp search and replace: ^p^p by ^pNEWLINE^p and that did the trick downstream. |
No, evolve-hub runs after docx2hub. Only after I posted my comment I became aware of your assertion that empty paragraphs are removed by docx2hub. I had doubts. See attached H1.docx. It contains empty paragraphs. They are present in 24.docx2hub_join-runs.xml: <para role="Heading1">H1</para>
<para>text1</para>
<para/>
<para>text2</para>
<para role="Heading2">H2</para>
<para/>
<para>text3</para>
<para role="Heading1">H1</para>
<para>text4</para>
<para/>
<para>text5</para>
<para/> |
Would it help if I provide a document for QA?
--
Stefan
|
Sure |
Maybe you inserted a page break? These also have a paragraph marker in Word, but they will only be converted to a paragraph if they also contain text. In the latter case, the resulting paragraph will have a |
Empty headings don’t seem to be a problem. |
I am currently going down the rabbit hole where abiword crashes on modifying (and even a single copy) the document I want to make the example of. With with respect to empty sections *section{} will just work. This is my magic trick to workaround another issue. https://tex.stackexchange.com/questions/595254/paragraph-hanging-based-on-lines But this adventure helped me to validate your statement: you are right, it does do something with the empty paragraph. <para css:margin-bottom="0pt" css:line-height="1.5" css:text-indent="14.2pt" css:text-align="justify">
<anchor xml:id="SysFlag"/>
<phrase css:font-family="Constantia">Met een diepe rimpel in zijn voorhoofd beleeft Peter alles opnieuw.</phrase>
</para>
<para css:margin-bottom="0pt" css:line-height="1.5" css:text-align="justify"/>
<para css:hyphens="manual" css:margin-bottom="0pt" css:line-height="1.5" css:text-indent="14.2pt">
<phrase css:font-family="Constantia">Er werd gezamenlijk buiten gespeeld door groep 1 en 2. </phrase>
</para> So yes, in this case it could just be searching for an empty textcontent. |
If Abiword crashes and if you don't have Word, maybe you can use LibreOffice? Alternatively, you can send the original document to letexml at le-tex.de. Or you can use another font. The Abiword bug seems to be font-related. If you want LaTeX output and do something based on empty paras, you need to provide a custom evolve-hub XSLT nevertheless since empty paras are removed by the default evolve-hub customization that docx2tex uses. This custom evolve-hub is what my previous comment was about. I provided an example that creates chapter headings from each empty para. |
I asked the document producer to create the lighter version. That showed me my initial assumption was wrong, it was just self-closing para.
I'm going to give it a shot. Thanks for your suggestions so far, it is sincerely appreciated. |
For what I wanted to achieve I only needed to make this small change. diff --git a/conf/conf.xml b/conf/conf.xml
index fb8f10b..5ad3d60 100755
--- a/conf/conf.xml
+++ b/conf/conf.xml
@@ -614,6 +614,12 @@
</rule>
</template>
+ <template context="dbk:para[@role = ('ResetParagraph')]">
+ <rule break-after="2" name="section*" type="cmd">
+ <param/>
+ </rule>
+ </template>
+
<xsl:variable name="footnote-ids" as="xs:string*"
select="for $i in $footnotes return generate-id($i)"/>
diff --git a/xsl/custom-evolve-hub-driver-example.xsl b/xsl/custom-evolve-hub-driver-example.xsl
index ea744d5..201f22d 100755
--- a/xsl/custom-evolve-hub-driver-example.xsl
+++ b/xsl/custom-evolve-hub-driver-example.xsl
@@ -13,8 +13,8 @@
<xsl:template match="para[empty(node())]" mode="docx2tex-preprocess">
<xsl:copy>
<xsl:apply-templates select="@*"/>
- <xsl:attribute name="role" select="'Heading1'"/>
+ <xsl:attribute name="role" select="'ResetParagraph'"/>
</xsl:copy>
</xsl:template> |
…although this could have probably been achieved without changing the core configuration. But whatever floats your boat. If you need to make this change stable across updates, feel free to reopen this issue or to open a new one. |
I think both Dutch and German have a different word for a paragraph and "section". (Absatz, Sektion, Paragraaf, Alinea) The problem with this type of thing is that users expect it to 'work', while they acknowledge it does not work in Microsoft Word in the first place. Hence faking paragraph hanging works by setting a margin, but then the "reset" is not picked up. So maybe more generically a section without number section* could be added, adding a specific ResetParagraph template in the default config wouldn't make sense for anyone else unaware of the issue and the hack. I think the fundamental question is: should Word create individual "sections" or should the intention of the user be enough, to have it in evolve-hub? How docx2tex works currently does not equal the markup in the original document, with respect to empty lines, so something in that direction should be added anyway, maybe even just: <xsl:template match="para[empty(node())]" mode="docx2tex-preprocess">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:attribute name="role" select="'NewLine'"/>
</xsl:copy>
</xsl:template>
(this is not correct, but did not investigate how it should be encoded) |
Empty paras have been removed in docx2tex since 2015-12. Since not that many people complained about it since then and because it is quite easy to treat empty paras differently since 2021-06 , we should leave it at that. I documented how to override the evolve-hub behavior in README.md. |
I would like to map an empty line in the source document to produce a new section in tex (I am using docx2tex), or anything that could be post processed, \newline would be fine too. I noticed that this section from the docx file is completely omitted, hence it does not appear in the 24.docx2hub_join-runs.xml file at all.
Hence I would like to replace any w:p with children count 1, being w:pPr, with something?
The text was updated successfully, but these errors were encountered: