Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
root committed Apr 17, 2015
0 parents commit 8c8b264
Show file tree
Hide file tree
Showing 208 changed files with 13,036 additions and 0 deletions.
122 changes: 122 additions & 0 deletions ANALYZING_OOXML.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Analyzing OOXML with OfficeDissector

OOXML files can be thought of at six levels of abstraction:

* As a file
* As a Zip archive
* As a set of Parts (a Part has a _name_ and an array of data bytes called a _stream_)
* As a set of Parts with data in well defined formats (usually XML)
* As a graph of Parts, connected by Relationships, and having associated metadata (specifically Content-Type)
* As a document with Features and CoreProperties

Each abstraction level completely contains all the levels beneath it; there are no leaks. That is, all Features are implemented as Parts+Relationships, all Parts are contained in the Zip file, etc. This concept will be explained more below.

Different types of analysis can be done at each of these levels.

## File

At its simplest, an OOXML file is a single file; that is, a filename and a sequence of bytes. OfficeDissector's Document class takes this file and exposes its deeper content:

import officedissector
doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx') # Returns a Document object

Note that the filename's extension is signficant and affects the behavior of Microsoft Office. Member variables `doc.type`, `doc.is_macro_enabled` and `doc.is_template` expose the meaning of the filename extension.

## Zip file

An OOXML file is a Zip archive (each members of this archive is called a Part, described below). Method`doc.zip()` provides a Zip object which provides access at this level of abstraction. However, most analysis is best performed at the deeper levels of abstraction below.

## Part

Parts are the heart of OOXML. **The entire Document is defined by its Parts** (other than the limited information provided by the filename's extension and Zip metadata described above). Even the Parts' own metadata is stored in Parts (as will be described below). Thus, **analyzing OOXML is about analyzing Parts**.

`Document.parts` provides a List of all Parts in the document, and `Document.parts_by_name` provides a Dictionary of Parts by their name.

At their simplest level, Parts have only two properties: a _a _name_ (exposed through `Part.name`) and an array of data bytes called a _stream_ (exposed through `Part.stream()`).

for p in doc.parts:
print p.name # p.name returns a String
print p.stream().read(10) # p.stream() returns a File-like object

Note that, with a few exceptions, the Part's name is irrelevant and will not affect the behavior of Office.
Parts roles are instead determined by their Content-Type and Relationships (described below).

## Content-Type

All Parts have a Content-Type, such as `image/png` or `application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml`,
exposed via the `Part.content_type()` method. (Content-Types are defined by a single Part with a well known name,
`[Content_Types.xml]`, which OfficeDissector automatically parses.)

## Relationships

Parts have Relationships to other parts (or external resources), forming a graph-like structure. (Relationships themselves are defined by dedicated `.rels` Parts,
which OfficeDissector automatically finds and parses.)

`Document.relationships` provides a List of all Relationships in the Document. `Part.relationships_in()` provides a List of all Relationships pointing _to_ a Part; and `Part.relationships_out()` provides a List of all
Relationships _from_ a Part.


## Finding Parts

OfficeDissector provides several ways to find Parts of interest:

* `Document.main_part()` returns the main Part; each Document has exactly one main Part
* `Document.parts_by_content_type()` and `Document.parts_by_content_type_regex()` allowing finding all Parts with particular Content-Type
* `Document.parts_by_relationship_type()` allows finding all Parts which have an incoming Relationship of a particular type.

In nearly every case, Office's treatment of a Part is soley determined by the Part's Content-Type and incoming Relationship types. Thus,
these methods provide the best way to perform most analysis.

## XML Parts

Most Parts are in XML format. OfficeDissector will parse the XML for you, using the `Part.xml()` method, or perform XPath queries (recommended), using the `Part.xpath()` method.
XPath is a very powerful tool which makes most OOXML analysis much simpler.

OOXML makes heavy use of XML Namespaces. XML Namespaces can be confusing if you do not have experience with them.
See XML_NAMESPACES.txt for a quick introduction.

## Other Parts

Multimedia and embedded objects are typically stored in their own Parts, one Part per object. For example, each image will comprises its own Part,
with the appropriate Content-Type (e.g. `image/png`) and Relationships. Since each image is kept separate, in its native format, and with its
Content-Type exposed, analyzing them is easy: The image's data is readable as `part.stream()`. For example:

jpegs = doc.parts_by_content_type('image/jpeg')
jpegs[0].name # Returns '/word/media/image1.jpeg'
jpegs[0].stream().read(10) # Returns '\xff\xd8\xff\xe0\x00\x10JFIF', the first 10 bytes of the JPEG data

## Features and CoreProperties

The Features object (retrieved via `Document.features`) provides access to common Document features, such as macros, images, videos, and
sounds. This interface provides convenient access to the features most relevant to security analysis. (It's important to realize that all of
these features are simply Parts found by their Content-Type and Relationships.)

The CoreProperties object (retrieved via `Document.core_properties`) parses and exposes the Document's Core Properties, such as `creator`
and `modified`; these properties are very useful for security analysis and forensics.

# How do I...?

To **get started**, install OfficeDissector (see Installing in README.txt), and begin with these two lines:

import officedissector
doc = officedissector.doc.Document('path/to/your/ooxml.docx') # Returns a Document object

To **interactively analyze an OOXML document**, use ipython. See Usage in README.txt for an example.

To **automatically analyze a large volume of documents**, use plugins. See mastiff-plugins/README.txt.

To **learn OfficeDissector**, use ipython, and press the TAB key to see the methods available.
(This is demonstrated in Usage in README.txt.) Look at the full API docs for more details.

If you want to **retrieve specific features and properties of a document**, use the Features and CoreProperties interface, described above.

The best way to **explore the behavior of Office** is to retrieve Parts by their Content-Type and Relationships; see Finding Parts above.
As far as we know, all of Office's behavior can be reconstructed throught this technique. (Office also looks at the filename's extension;
see the section "File" above.) **This is therefore the most general purpose and powerful mode of analysis**. (This is the method OfficeDissector uses
internally to find Features, such as multimedia.)

To **drill deeper into an XML Part**, use `Part.xpath()`, paying careful attention to XML namespaces; see the section "XML Parts" abobe. (This is
the method OfficeDissector uses internally to find and parse CoreProperties, such as `creator`.)

To **export the data into other tools**, use the `to_json()` method, which nearly all OfficeDissector objects provide.

10 changes: 10 additions & 0 deletions LEGAL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

Copyright (C) 2013-2015 Grier Forensics

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

191 changes: 191 additions & 0 deletions README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# OfficeDissector

OfficeDissector is a parser library for static security analysis of Office Open XML (OOXML) Documents,
created by Grier Forensics for the Cyber System Assessments Group at MIT's Lincoln Laboratory.

OfficeDissector is the first parser designed specifically for security analysis of OOXML documents. It exposes all internals, including
document properties, parts, content-type, relationships, embedded macros and multimedia, and comments, and more.
It provides full JSON export, and a MASTIFF based plugin architecture. It also includes a nearly 600 MB test corpus, unit tests with nearly
100% coverage, smoke tests running against the entire corpus, and simple, well factored, fully commented code

## Install

OfficeDissector requires the lxml package and Python version 2.7.

To use OfficeDissector without installing it, set the `PYTHONPATH` to the `officedissector` directory:

$ export PYTHONPATH=/path/to/thisfolder

Alternatively, it can be installed using pip (recommended) or python setup:

$ sudo pip install /path/to/thisfolder # Recommended, as pip supports uninstall
$ sudo python setup.py install # Alternative

## Documentation

To view OfficeDissector documentation, open in a browser:

$ doc/html/index.html

## Testing

To test, first set PYTHONPATH or install `officedissector` as described above. Then:

# Unit tests
$ cd test/unit_test
$ python test_officedissector.py

# Smoke tests
$ cd test
$ python smoke_tests.py

The smoke tests will create log files with more information about them.

## MASTIFF Plugins

To find more information about the MASTIFF architecture and sample plugins, see
`mastiff-plugins/README.txt`.

## Usage

Below is an ipython session demonstrating usage of OfficeDissector:

$ ipython
In [1]: import officedissector
In [2]: doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx')
In [4]: doc.is_macro_enabled
Out[4]: False

In [5]: doc.is_template
Out[5]: False

In [6]: mp = doc.main_part()
In [7]: mp.content_type()
Out[7]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml'

In [9]: mp.name
Out[9]: '/word/document.xml'

In [10]: mp.content_type()
Out[10]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml'

# We can read the part's stream of data:
In [17]: mp.stream().read(200)
Out[17]: '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http:https://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:https://schemas.openxmlformats.org/markup-c'

# Or use XPath to parse it:
In [33]: t = mp.xpath('//w:t', {'w': "http:https://schemas.openxmlformats.org/wordprocessingml/2006/main"})
In [37]: t[2].text
Out[37]: u'Das vorliegende Dokument ist ein Beispiel f\xfcr einen zur Publikation in einer Zeitschrift vorgesehenen Artikel. Es verwendet f\xfcr Autor und Titel in den Dokumenteigenschaften festgelegte Eintr\xe4ge.'

# All Relationships in and out are exposed:
In [38]: mp.relationships_in()
Out[38]: [Relationship [rId1] (source Part [RootPart])]

In [39]: mp.relationships_out()
Out[39]:
[Relationship [rId8] (source Part [/word/document.xml]),
Relationship [rId13] (source Part [/word/document.xml]),
Relationship [rId3] (source Part [/word/document.xml]),
...
Relationship [rId14] (source Part [/word/document.xml])]

In [40]: rel = mp.relationships_out()[0]
In [43]: rel.type
Out[43]: 'http:https://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes'

In [46]: endnotes = rel.target_part
In [48]: endnotes.content_type()
Out[48]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml'

# Any Part (or the entire Document) can be exported to JSON:
In [50]: print endnotes.to_json()
{
"content-type": "application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml",
"uri": "/word/endnotes.xml",
"relationships_out": [],
"relationships_in": [
"Relationship [rId8] (source Part [/word/document.xml])"
]
}

# Features are automatically exposed:
In [55]: doc.features.[TAB]
...
doc.features.comments
doc.features.custom_properties
doc.features.custom_xml
doc.features.digital_signatures
doc.features.doc
doc.features.embedded_controls
doc.features.embedded_objects
doc.features.embedded_packages
doc.features.fonts
doc.features.get_parts
doc.features.get_union
doc.features.images
doc.features.macros
doc.features.sounds
doc.features.videos

In [55]: doc.features.images
Out[55]: [Part [/word/media/image1.jpeg]]

In [56]: image = doc.features.images[0]
In [58]: image.content_type()
Out[58]: 'image/jpeg'

# We can export the binary data to JSON as well, by setting include_stream = True:
In [61]: print image.to_json(include_stream = True)
{
"stream_b64": "/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAFAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3uGGO3iWKJdqL0Gc0UUUAf//Z",
"content-type": "image/jpeg",
"uri": "/word/media/image1.jpeg",
"relationships_out": [],
"relationships_in": [
"Relationship [rId1] (source Part [/word/theme/theme1.xml])"
]
}

# Check for macros:
In [62]: doc.features.macros
Out[62]: []

# Or comments:
In [63]: doc.features.comments
Out[63]: []

# Core properties are exposed:
In [64]: doc.core_properties.[TAB]
...
doc.core_properties.content_status
doc.core_properties.core_prop_part
doc.core_properties.created
doc.core_properties.creator
doc.core_properties.description
doc.core_properties.identifier
doc.core_properties.keywords
doc.core_properties.language
doc.core_properties.last_modified_by
doc.core_properties.last_printed
doc.core_properties.modified
doc.core_properties.name
doc.core_properties.parse_all
doc.core_properties.parse_prop
doc.core_properties.revision
doc.core_properties.subject
doc.core_properties.title
doc.core_properties.version
doc.core_properties.category

In [68]: doc.core_properties.modified
Out[68]: '2009-12-04T14:47:00Z'

## Analyzing OOXML

See `doc/txt/ANALYZING_OOXML.txt` for a quick start guide on how to use
OfficeDissector to analyze OOXML documents.

## API

For more details about OfficeDissector, see the API - `doc/html/rst/api.html` documentation.
Loading

0 comments on commit 8c8b264

Please sign in to comment.