Change default encoding for PDFToTextConverter from Latin 1 to `U…

…TF-8` (#2420) * Change default encoding for PDFToTextConverter * Update Documentation & Code Style * Improve docstring * Update Documentation & Code Style * Add list of ligatures to ignore and add the possibility to modify such list at need * Add docstring * Add tests * Rename parameter * Update Documentation & Code Style * Move implementation into the base converter to make mypy happier * Update Documentation & Code Style * mypy and pylint * mypy * move encoding parameter to init of PDFToTextConverter * Update Documentation & Code Style * make utf8 default and fix mypy * Update Documentation & Code Style * Update Documentation & Code Style * remove note on encoding in tutorial8 * Update Documentation & Code Style * skip OCRConverter and test converter.run * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <[email protected]>
deepset-ai · May 4, 2022 · 01ea4bf · 01ea4bf
1 parent a4e603c
commit 01ea4bf
Show file tree

Hide file tree

Showing 11 changed files with 300 additions and 49 deletions.
diff --git a/docs/_src/api/api/file_converter.md b/docs/_src/api/api/file_converter.md
@@ -43,7 +43,7 @@ In this case the id will be generated by using the content and the defined metad
 
 ```python
 @abstractmethod
-def convert(file_path: Path, meta: Optional[Dict[str, str]], remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8", id_hash_keys: Optional[List[str]] = None) -> List[Document]
+def convert(file_path: Path, meta: Optional[Dict[str, str]], remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "UTF-8", id_hash_keys: Optional[List[str]] = None) -> List[Document]
 ```
 
 Convert a file to a dictionary containing the text and any associated meta data.
@@ -65,7 +65,7 @@ The rows containing strings are thus retained in this option.
 This option can be used to add test for encoding errors. If the extracted text is
 not one of the valid languages, then it might likely be encoding error resulting
 in garbled text.
-- `encoding`: Select the file encoding (default is `utf-8`)
+- `encoding`: Select the file encoding (default is `UTF-8`)
 - `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
 attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
 not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
@@ -81,6 +81,40 @@ def validate_language(text: str, valid_languages: Optional[List[str]] = None) ->
 
 Validate if the language of the text is one of valid languages.
 
+<a id="base.BaseConverter.run"></a>
+
+#### run
+
+```python
+def run(file_paths: Union[Path, List[Path]], meta: Optional[Union[Dict[str, str], List[Optional[Dict[str, str]]]]] = None, remove_numeric_tables: Optional[bool] = None, known_ligatures: Dict[str, str] = KNOWN_LIGATURES, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "UTF-8")
+```
+
+Extract text from a file.
+
+**Arguments**:
+
+- `file_paths`: Path to the files you want to convert
+- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
+Can be any custom keys and values.
+- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
+The tabular structures in documents might be noise for the reader model if it
+does not have table parsing capability for finding answers. However, tables
+may also have long strings that could possible candidate for searching answers.
+The rows containing strings are thus retained in this option.
+- `known_ligatures`: Some converters tends to recognize clusters of letters as ligatures, such as "ﬀ" (double f).
+Such ligatures however make text hard to compare with the content of other files,
+which are generally ligature free. Therefore we automatically find and replace the most
+common ligatures with their split counterparts. The default mapping is in
+`haystack.nodes.file_converter.base.KNOWN_LIGATURES`: it is rather biased towards Latin alphabeths
+but excludes all ligatures that are known to be used in IPA.
+You can use this parameter to provide your own set of ligatures to clean up from the documents.
+- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
+(https://en.wikipedia.org/wiki/ISO_639-1) format.
+This option can be used to add test for encoding errors. If the extracted text is
+not one of the valid languages, then it might likely be encoding error resulting
+in garbled text.
+- `encoding`: Select the file encoding (default is `UTF-8`)
+
 <a id="docx"></a>
 
 # Module docx
@@ -261,7 +295,7 @@ class PDFToTextConverter(BaseConverter)
 #### \_\_init\_\_
 
 ```python
-def __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = None, id_hash_keys: Optional[List[str]] = None)
+def __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = None, id_hash_keys: Optional[List[str]] = None, encoding: Optional[str] = "UTF-8")
 ```
 
 **Arguments**:
@@ -280,13 +314,16 @@ in garbled text.
 attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
 not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
 In this case the id will be generated by using the content and the defined metadata.
+- `encoding`: Encoding that will be passed as `-enc` parameter to `pdftotext`.
+Defaults to "UTF-8" in order to support special characters (e.g. German Umlauts, Cyrillic ...).
+(See list of available encodings, such as "Latin1", by running `pdftotext -listenc` in the terminal)
 
 <a id="pdf.PDFToTextConverter.convert"></a>
 
 #### convert
 
 ```python
-def convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "Latin1", id_hash_keys: Optional[List[str]] = None) -> List[Document]
+def convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = None, id_hash_keys: Optional[List[str]] = None) -> List[Document]
 ```
 
 Extract text from a .pdf file using the pdftotext library (https://www.xpdfreader.com/pdftotext-man.html)
@@ -306,11 +343,7 @@ The rows containing strings are thus retained in this option.
 This option can be used to add test for encoding errors. If the extracted text is
 not one of the valid languages, then it might likely be encoding error resulting
 in garbled text.
-- `encoding`: Encoding that will be passed as -enc parameter to pdftotext. "Latin 1" is the default encoding
-of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
-others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).
-Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as
-"xef\xac\x81c" (see test cases). That's why we keep "Latin 1" as default here.
+- `encoding`: Encoding that overwrites self.encoding and will be passed as `-enc` parameter to `pdftotext`.
 (See list of available encodings by running `pdftotext -listenc` in the terminal)
 - `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
 attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
@@ -357,7 +390,7 @@ In this case the id will be generated by using the content and the defined metad
 #### convert
 
 ```python
-def convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8", id_hash_keys: Optional[List[str]] = None) -> List[Document]
+def convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "UTF-8", id_hash_keys: Optional[List[str]] = None) -> List[Document]
 ```
 
 Convert a file to a dictionary containing the text and any associated meta data.
@@ -379,7 +412,7 @@ The rows containing strings are thus retained in this option.
 This option can be used to add test for encoding errors. If the extracted text is
 not one of the valid languages, then it might likely be encoding error resulting
 in garbled text.
-- `encoding`: Select the file encoding (default is `utf-8`)
+- `encoding`: Select the file encoding (default is `UTF-8`)
 - `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
 attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
 not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).

diff --git a/docs/_src/tutorials/tutorials/8.md b/docs/_src/tutorials/tutorials/8.md
@@ -67,7 +67,6 @@ Haystack's converter classes are designed to help you turn files on your compute
 that can be processed by the Haystack pipeline.
 There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.
 The parameter `valid_langugages` does not convert files to the target language, but checks if the conversion worked as expected.
-For converting PDFs, try changing the encoding to UTF-8 if the conversion isn't great.
 
 
 ```python

diff --git a/haystack/json-schemas/haystack-pipeline-1.3.0.schema.json b/haystack/json-schemas/haystack-pipeline-1.3.0.schema.json
@@ -2375,6 +2375,11 @@
  "items": {
  "type": "string"
  }
+ },
+ "encoding": {
+ "title": "Encoding",
+ "default": "UTF-8",
+ "type": "string"
  }
  },
  "additionalProperties": false,