-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
root
committed
Apr 17, 2015
0 parents
commit 8c8b264
Showing
208 changed files
with
13,036 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# Analyzing OOXML with OfficeDissector | ||
|
||
OOXML files can be thought of at six levels of abstraction: | ||
|
||
* As a file | ||
* As a Zip archive | ||
* As a set of Parts (a Part has a _name_ and an array of data bytes called a _stream_) | ||
* As a set of Parts with data in well defined formats (usually XML) | ||
* As a graph of Parts, connected by Relationships, and having associated metadata (specifically Content-Type) | ||
* As a document with Features and CoreProperties | ||
|
||
Each abstraction level completely contains all the levels beneath it; there are no leaks. That is, all Features are implemented as Parts+Relationships, all Parts are contained in the Zip file, etc. This concept will be explained more below. | ||
|
||
Different types of analysis can be done at each of these levels. | ||
|
||
## File | ||
|
||
At its simplest, an OOXML file is a single file; that is, a filename and a sequence of bytes. OfficeDissector's Document class takes this file and exposes its deeper content: | ||
|
||
import officedissector | ||
doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx') # Returns a Document object | ||
|
||
Note that the filename's extension is signficant and affects the behavior of Microsoft Office. Member variables `doc.type`, `doc.is_macro_enabled` and `doc.is_template` expose the meaning of the filename extension. | ||
|
||
## Zip file | ||
|
||
An OOXML file is a Zip archive (each members of this archive is called a Part, described below). Method`doc.zip()` provides a Zip object which provides access at this level of abstraction. However, most analysis is best performed at the deeper levels of abstraction below. | ||
|
||
## Part | ||
|
||
Parts are the heart of OOXML. **The entire Document is defined by its Parts** (other than the limited information provided by the filename's extension and Zip metadata described above). Even the Parts' own metadata is stored in Parts (as will be described below). Thus, **analyzing OOXML is about analyzing Parts**. | ||
|
||
`Document.parts` provides a List of all Parts in the document, and `Document.parts_by_name` provides a Dictionary of Parts by their name. | ||
|
||
At their simplest level, Parts have only two properties: a _a _name_ (exposed through `Part.name`) and an array of data bytes called a _stream_ (exposed through `Part.stream()`). | ||
|
||
for p in doc.parts: | ||
print p.name # p.name returns a String | ||
print p.stream().read(10) # p.stream() returns a File-like object | ||
|
||
Note that, with a few exceptions, the Part's name is irrelevant and will not affect the behavior of Office. | ||
Parts roles are instead determined by their Content-Type and Relationships (described below). | ||
|
||
## Content-Type | ||
|
||
All Parts have a Content-Type, such as `image/png` or `application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml`, | ||
exposed via the `Part.content_type()` method. (Content-Types are defined by a single Part with a well known name, | ||
`[Content_Types.xml]`, which OfficeDissector automatically parses.) | ||
|
||
## Relationships | ||
|
||
Parts have Relationships to other parts (or external resources), forming a graph-like structure. (Relationships themselves are defined by dedicated `.rels` Parts, | ||
which OfficeDissector automatically finds and parses.) | ||
|
||
`Document.relationships` provides a List of all Relationships in the Document. `Part.relationships_in()` provides a List of all Relationships pointing _to_ a Part; and `Part.relationships_out()` provides a List of all | ||
Relationships _from_ a Part. | ||
|
||
|
||
## Finding Parts | ||
|
||
OfficeDissector provides several ways to find Parts of interest: | ||
|
||
* `Document.main_part()` returns the main Part; each Document has exactly one main Part | ||
* `Document.parts_by_content_type()` and `Document.parts_by_content_type_regex()` allowing finding all Parts with particular Content-Type | ||
* `Document.parts_by_relationship_type()` allows finding all Parts which have an incoming Relationship of a particular type. | ||
|
||
In nearly every case, Office's treatment of a Part is soley determined by the Part's Content-Type and incoming Relationship types. Thus, | ||
these methods provide the best way to perform most analysis. | ||
|
||
## XML Parts | ||
|
||
Most Parts are in XML format. OfficeDissector will parse the XML for you, using the `Part.xml()` method, or perform XPath queries (recommended), using the `Part.xpath()` method. | ||
XPath is a very powerful tool which makes most OOXML analysis much simpler. | ||
|
||
OOXML makes heavy use of XML Namespaces. XML Namespaces can be confusing if you do not have experience with them. | ||
See XML_NAMESPACES.txt for a quick introduction. | ||
|
||
## Other Parts | ||
|
||
Multimedia and embedded objects are typically stored in their own Parts, one Part per object. For example, each image will comprises its own Part, | ||
with the appropriate Content-Type (e.g. `image/png`) and Relationships. Since each image is kept separate, in its native format, and with its | ||
Content-Type exposed, analyzing them is easy: The image's data is readable as `part.stream()`. For example: | ||
|
||
jpegs = doc.parts_by_content_type('image/jpeg') | ||
jpegs[0].name # Returns '/word/media/image1.jpeg' | ||
jpegs[0].stream().read(10) # Returns '\xff\xd8\xff\xe0\x00\x10JFIF', the first 10 bytes of the JPEG data | ||
|
||
## Features and CoreProperties | ||
|
||
The Features object (retrieved via `Document.features`) provides access to common Document features, such as macros, images, videos, and | ||
sounds. This interface provides convenient access to the features most relevant to security analysis. (It's important to realize that all of | ||
these features are simply Parts found by their Content-Type and Relationships.) | ||
|
||
The CoreProperties object (retrieved via `Document.core_properties`) parses and exposes the Document's Core Properties, such as `creator` | ||
and `modified`; these properties are very useful for security analysis and forensics. | ||
|
||
# How do I...? | ||
|
||
To **get started**, install OfficeDissector (see Installing in README.txt), and begin with these two lines: | ||
|
||
import officedissector | ||
doc = officedissector.doc.Document('path/to/your/ooxml.docx') # Returns a Document object | ||
|
||
To **interactively analyze an OOXML document**, use ipython. See Usage in README.txt for an example. | ||
|
||
To **automatically analyze a large volume of documents**, use plugins. See mastiff-plugins/README.txt. | ||
|
||
To **learn OfficeDissector**, use ipython, and press the TAB key to see the methods available. | ||
(This is demonstrated in Usage in README.txt.) Look at the full API docs for more details. | ||
|
||
If you want to **retrieve specific features and properties of a document**, use the Features and CoreProperties interface, described above. | ||
|
||
The best way to **explore the behavior of Office** is to retrieve Parts by their Content-Type and Relationships; see Finding Parts above. | ||
As far as we know, all of Office's behavior can be reconstructed throught this technique. (Office also looks at the filename's extension; | ||
see the section "File" above.) **This is therefore the most general purpose and powerful mode of analysis**. (This is the method OfficeDissector uses | ||
internally to find Features, such as multimedia.) | ||
|
||
To **drill deeper into an XML Part**, use `Part.xpath()`, paying careful attention to XML namespaces; see the section "XML Parts" abobe. (This is | ||
the method OfficeDissector uses internally to find and parse CoreProperties, such as `creator`.) | ||
|
||
To **export the data into other tools**, use the `to_json()` method, which nearly all OfficeDissector objects provide. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
|
||
Copyright (C) 2013-2015 Grier Forensics | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, | ||
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A | ||
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT | ||
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF | ||
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE | ||
OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
# OfficeDissector | ||
|
||
OfficeDissector is a parser library for static security analysis of Office Open XML (OOXML) Documents, | ||
created by Grier Forensics for the Cyber System Assessments Group at MIT's Lincoln Laboratory. | ||
|
||
OfficeDissector is the first parser designed specifically for security analysis of OOXML documents. It exposes all internals, including | ||
document properties, parts, content-type, relationships, embedded macros and multimedia, and comments, and more. | ||
It provides full JSON export, and a MASTIFF based plugin architecture. It also includes a nearly 600 MB test corpus, unit tests with nearly | ||
100% coverage, smoke tests running against the entire corpus, and simple, well factored, fully commented code | ||
|
||
## Install | ||
|
||
OfficeDissector requires the lxml package and Python version 2.7. | ||
|
||
To use OfficeDissector without installing it, set the `PYTHONPATH` to the `officedissector` directory: | ||
|
||
$ export PYTHONPATH=/path/to/thisfolder | ||
|
||
Alternatively, it can be installed using pip (recommended) or python setup: | ||
|
||
$ sudo pip install /path/to/thisfolder # Recommended, as pip supports uninstall | ||
$ sudo python setup.py install # Alternative | ||
|
||
## Documentation | ||
|
||
To view OfficeDissector documentation, open in a browser: | ||
|
||
$ doc/html/index.html | ||
|
||
## Testing | ||
|
||
To test, first set PYTHONPATH or install `officedissector` as described above. Then: | ||
|
||
# Unit tests | ||
$ cd test/unit_test | ||
$ python test_officedissector.py | ||
|
||
# Smoke tests | ||
$ cd test | ||
$ python smoke_tests.py | ||
|
||
The smoke tests will create log files with more information about them. | ||
|
||
## MASTIFF Plugins | ||
|
||
To find more information about the MASTIFF architecture and sample plugins, see | ||
`mastiff-plugins/README.txt`. | ||
|
||
## Usage | ||
|
||
Below is an ipython session demonstrating usage of OfficeDissector: | ||
|
||
$ ipython | ||
In [1]: import officedissector | ||
In [2]: doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx') | ||
In [4]: doc.is_macro_enabled | ||
Out[4]: False | ||
|
||
In [5]: doc.is_template | ||
Out[5]: False | ||
|
||
In [6]: mp = doc.main_part() | ||
In [7]: mp.content_type() | ||
Out[7]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml' | ||
|
||
In [9]: mp.name | ||
Out[9]: '/word/document.xml' | ||
|
||
In [10]: mp.content_type() | ||
Out[10]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml' | ||
|
||
# We can read the part's stream of data: | ||
In [17]: mp.stream().read(200) | ||
Out[17]: '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http:https://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:https://schemas.openxmlformats.org/markup-c' | ||
|
||
# Or use XPath to parse it: | ||
In [33]: t = mp.xpath('//w:t', {'w': "http:https://schemas.openxmlformats.org/wordprocessingml/2006/main"}) | ||
In [37]: t[2].text | ||
Out[37]: u'Das vorliegende Dokument ist ein Beispiel f\xfcr einen zur Publikation in einer Zeitschrift vorgesehenen Artikel. Es verwendet f\xfcr Autor und Titel in den Dokumenteigenschaften festgelegte Eintr\xe4ge.' | ||
|
||
# All Relationships in and out are exposed: | ||
In [38]: mp.relationships_in() | ||
Out[38]: [Relationship [rId1] (source Part [RootPart])] | ||
|
||
In [39]: mp.relationships_out() | ||
Out[39]: | ||
[Relationship [rId8] (source Part [/word/document.xml]), | ||
Relationship [rId13] (source Part [/word/document.xml]), | ||
Relationship [rId3] (source Part [/word/document.xml]), | ||
... | ||
Relationship [rId14] (source Part [/word/document.xml])] | ||
|
||
In [40]: rel = mp.relationships_out()[0] | ||
In [43]: rel.type | ||
Out[43]: 'http:https://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes' | ||
|
||
In [46]: endnotes = rel.target_part | ||
In [48]: endnotes.content_type() | ||
Out[48]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml' | ||
|
||
# Any Part (or the entire Document) can be exported to JSON: | ||
In [50]: print endnotes.to_json() | ||
{ | ||
"content-type": "application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml", | ||
"uri": "/word/endnotes.xml", | ||
"relationships_out": [], | ||
"relationships_in": [ | ||
"Relationship [rId8] (source Part [/word/document.xml])" | ||
] | ||
} | ||
|
||
# Features are automatically exposed: | ||
In [55]: doc.features.[TAB] | ||
... | ||
doc.features.comments | ||
doc.features.custom_properties | ||
doc.features.custom_xml | ||
doc.features.digital_signatures | ||
doc.features.doc | ||
doc.features.embedded_controls | ||
doc.features.embedded_objects | ||
doc.features.embedded_packages | ||
doc.features.fonts | ||
doc.features.get_parts | ||
doc.features.get_union | ||
doc.features.images | ||
doc.features.macros | ||
doc.features.sounds | ||
doc.features.videos | ||
|
||
In [55]: doc.features.images | ||
Out[55]: [Part [/word/media/image1.jpeg]] | ||
|
||
In [56]: image = doc.features.images[0] | ||
In [58]: image.content_type() | ||
Out[58]: 'image/jpeg' | ||
|
||
# We can export the binary data to JSON as well, by setting include_stream = True: | ||
In [61]: print image.to_json(include_stream = True) | ||
{ | ||
"stream_b64": "/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAFAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3uGGO3iWKJdqL0Gc0UUUAf//Z", | ||
"content-type": "image/jpeg", | ||
"uri": "/word/media/image1.jpeg", | ||
"relationships_out": [], | ||
"relationships_in": [ | ||
"Relationship [rId1] (source Part [/word/theme/theme1.xml])" | ||
] | ||
} | ||
|
||
# Check for macros: | ||
In [62]: doc.features.macros | ||
Out[62]: [] | ||
|
||
# Or comments: | ||
In [63]: doc.features.comments | ||
Out[63]: [] | ||
|
||
# Core properties are exposed: | ||
In [64]: doc.core_properties.[TAB] | ||
... | ||
doc.core_properties.content_status | ||
doc.core_properties.core_prop_part | ||
doc.core_properties.created | ||
doc.core_properties.creator | ||
doc.core_properties.description | ||
doc.core_properties.identifier | ||
doc.core_properties.keywords | ||
doc.core_properties.language | ||
doc.core_properties.last_modified_by | ||
doc.core_properties.last_printed | ||
doc.core_properties.modified | ||
doc.core_properties.name | ||
doc.core_properties.parse_all | ||
doc.core_properties.parse_prop | ||
doc.core_properties.revision | ||
doc.core_properties.subject | ||
doc.core_properties.title | ||
doc.core_properties.version | ||
doc.core_properties.category | ||
|
||
In [68]: doc.core_properties.modified | ||
Out[68]: '2009-12-04T14:47:00Z' | ||
|
||
## Analyzing OOXML | ||
|
||
See `doc/txt/ANALYZING_OOXML.txt` for a quick start guide on how to use | ||
OfficeDissector to analyze OOXML documents. | ||
|
||
## API | ||
|
||
For more details about OfficeDissector, see the API - `doc/html/rst/api.html` documentation. |
Oops, something went wrong.