Tables of Content - TOC - can be created for any PyMuPDF document type by method doc.get_toc(simple: bool = True/False)
- the old name getTOC
is deprecated and kept as an alias.
The method returns a Python list like this:
>>> pprint(doc.get_toc(True))
[[1, 'Page 1 - level 1', 1], # hierarchy level, title, page number
[2, 'Page 2 - level 2', 2],
[3, 'Page 3 - level 3', 3],
[1, 'Page 4 - level 1', 4],
[2, 'Page 5 - level 2', 5],
[3, 'Page 6 - level 3', 6],
[1, 'Page 7 - level 1', 7],
[2, 'Page 8 - level 2', 8],
[3, 'Page 9 - level 3', 9]]
>>>
Respectively this:
>>> pprint(doc.get_toc(False))
[[1, # hierarchy level
'Page 1 - level 1', # title
1, # page number, followed by item details:
{'collapse': True,
'kind': 1,
'page': 0,
'to': Point(72.0, 36.0),
'xref': 15,
'zoom': 0.0}],
[2,
'Page 2 - level 2',
2,
{'collapse': True,
'kind': 1,
'page': 1,
'to': Point(72.0, 36.0),
'xref': 16,
'zoom': 0.0}],
[3,
'Page 3 - level 3',
3,
{'kind': 1, 'page': 2, 'to': Point(72.0, 36.0), 'xref': 17, 'zoom': 0.0}],
[1,
'Page 4 - level 1',
4,
{'collapse': True,
'kind': 1,
'page': 3,
'to': Point(72.0, 36.0),
'xref': 18,
'zoom': 0.0}],
[2,
'Page 5 - level 2',
5,
{'collapse': True,
'kind': 1,
'page': 4,
'to': Point(72.0, 36.0),
'xref': 19,
'zoom': 0.0}],
[3,
'Page 6 - level 3',
6,
{'kind': 1, 'page': 5, 'to': Point(72.0, 36.0), 'xref': 20, 'zoom': 0.0}],
[1,
'Page 7 - level 1',
7,
{'collapse': True,
'kind': 1,
'page': 6,
'to': Point(72.0, 36.0),
'xref': 21,
'zoom': 0.0}],
[2,
'Page 8 - level 2',
8,
{'collapse': True,
'kind': 1,
'page': 7,
'to': Point(72.0, 36.0),
'xref': 22,
'zoom': 0.0}],
[3,
'Page 9 - level 3',
9,
{'kind': 1, 'page': 8, 'to': Point(72.0, 36.0), 'xref': 23, 'zoom': 0.0}]]
perhe second version provides an additional entry with more detail per item.
Please note that the items occur in the sequence as defined in the document. They are not sorted, specifically not by page number.
Page numbers may be -1 to indicate that the item does not point to anywhere in the document. In that case the detail dictionary will tell, whether a different document, an internet resource or indeed nothing at all is the target.
In PDF documents, a list like a TOC can be used as argument for method doc.set_toc
. This will either completely replace any existing TOC or create a new one.
Both of the above formats are supported. For items with the simple format, a default detail dictionary will be created internally.
The important thing to note is, that you can manipulate the TOC list to your liking before using the method. The following rules must be adhered to, however:
- The first item in the list must have a level of 1.
- The level of a successor item must either be (a) smaller (= a higher level), (b) the same, or (c) 1 larger than that of the previous item.
- Page numbers must be 1-based. The maximum value is the document's page count.
If you follow these rules, you can add or remove items, change titles, page numbers, page target points, and target types.
PDF viewers normally support collapsed or expanded views for the TOC. In PyMuPDF you have the following options to use this:
- Use the
collapse
method parameter:doc.set_toc(toc, collapse=n)
. This will collapse all items with a hierarchy level greater thann
. The default is 1, so only top level items are initially shown. To show all items expanded, use 0 orNone
(or some crazily large integer). - Use key
"collapse"
of the item detail dictionary. If set toTrue
, items below this one are collapsed. In this case set thecollapse
parameter of the method to 0.
Advanced PDF viewers also support colored TOC views and more sophisticated properties like bold and italics.
In PyMuPDF you can use the item detail dictionary to achieve this:
- set the
"color"
key to a PDF RGB color triple (red, green, blue) - each of the three entries is a float in range 0 to 1. - set the
"bold"
key toTrue
/False
. - set the
"italic"
key toTrue
/False
.
Replacing the complete TOC as offered by doc.set_toc
may not always be desireable:
If all you want is changing e.g. a few bookmark titles out of several hundred or even thousands of TOC items, then replacing the whole bunch may be a waste of disk space and / or processing time.
For example, the Adobe manual has about 800 TOC items, the Pandas manual over 500 and the SWIG manual about 1240.
To modify a single item, use method doc.set_toc_item
. All properties are available for change - please see the following call pattern:
toc = doc.get_toc(False) # recommended to always use False
item = toc[idx] # some item of the list
dest = item[3] # the detail dictionary of the destination
# now manipulate the 'dest' values, then:
doc.set_toc_item(idx, # index of an item in above list
dest_dict=dest, # modified detail dict of item
kind=int, # link kind, only if dest_dict omitted
pno=int, # target page, only if dest_dict omitted
uri=str, # URI, only if dest_dict omitted
title=str, # new title
to=point, # target point, only if dest_dict omitted
filename=str, # only if dest_dict omitted
zoom=float, # zoom factor, only if dest_dict omitted
)
Here is a "live" script that reads the PDF file in this folder and does the following:
- expand all items
- set level 1 to red and bold
- set level 2 to blue and italic
- set other levels to green
import fitz
doc = fitz.open("example.pdf")
toc = doc.get_toc(False)
for i, item in enumerate(toc):
lvl, title, pno, ddict = item
ddict["collapse"] = False
if lvl == 1:
ddict["color"] = (1, 0, 0)
ddict["bold"] = True
ddict["italic"] = False
elif lvl == 2:
ddict["color"] = (0, 0, 1)
ddict["bold"] = False
ddict["italic"] = True
else:
ddict["color"] = (0, 1, 0)
ddict["bold"] = ddict["italic"] = False
doc.set_toc_item(i, dest_dict=ddict)
doc.save("colored-toc.pdf")
This is how the result looks like:
There are differences in favor of using this function:
doc.set_toc_item()
reuses the old xref number, whereasdoc.set_toc()
acquires new xref numbers. The old ones must be regained using garbage option of 2 or more onsave()
.- If an exception occurs within underlying MuPDF functions,
doc.set_toc_item()
is better recoverable because of its granular approach (changes only one xref at a time).doc.set_toc()
consists of two internal steps, both of which are bulk changes to all xref numbers dealing with TOC storage, making it impossible to roll back.
Other advantages of using doc.set_toc_item()
:
- If a PDF is signed, only incremental changes are possible without invalidating the signature. Also, incremental saves are fast, but always increase the file size. So you should generally be interested in small overall changed data volumes in order to minimize those increments. This makes the method an ideal candidate for small changes to the TOC.
There are disadvantages, too:
- You cannot change the overall TOC structure with
doc.set_toc_item()
- only the content of each item. You cannot add items, or change hierarchy levels, or positions inside the TOC list. - You cannot delete TOC items either ... but there is method
doc.del_toc_item(idx)
, which sets the item's target to empty and its color to some gray.