Initial Commit

grierforensics · Apr 17, 2015 · 8c8b264 · 8c8b264
commit 8c8b264
Show file tree

Hide file tree

Showing 208 changed files with 13,036 additions and 0 deletions.
diff --git a/ANALYZING_OOXML.txt b/ANALYZING_OOXML.txt
@@ -0,0 +1,122 @@
+# Analyzing OOXML with OfficeDissector
+
+OOXML files can be thought of at six levels of abstraction:
+
+* As a file
+* As a Zip archive
+* As a set of Parts (a Part has a _name_ and an array of data bytes called a _stream_)
+* As a set of Parts with data in well defined formats (usually XML)
+* As a graph of Parts, connected by Relationships, and having associated metadata (specifically Content-Type)
+* As a document with Features and CoreProperties
+
+Each abstraction level completely contains all the levels beneath it; there are no leaks. That is, all Features are implemented as Parts+Relationships, all Parts are contained in the Zip file, etc. This concept will be explained more below.
+
+Different types of analysis can be done at each of these levels.
+
+## File
+
+At its simplest, an OOXML file is a single file; that is, a filename and a sequence of bytes. OfficeDissector's Document class takes this file and exposes its deeper content:
+
+ import officedissector
+ doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx') # Returns a Document object
+
+Note that the filename's extension is signficant and affects the behavior of Microsoft Office. Member variables `doc.type`, `doc.is_macro_enabled` and `doc.is_template` expose the meaning of the filename extension.
+
+## Zip file
+
+An OOXML file is a Zip archive (each members of this archive is called a Part, described below). Method`doc.zip()` provides a Zip object which provides access at this level of abstraction. However, most analysis is best performed at the deeper levels of abstraction below.
+
+## Part
+
+Parts are the heart of OOXML. **The entire Document is defined by its Parts** (other than the limited information provided by the filename's extension and Zip metadata described above). Even the Parts' own metadata is stored in Parts (as will be described below). Thus, **analyzing OOXML is about analyzing Parts**.
+
+`Document.parts` provides a List of all Parts in the document, and `Document.parts_by_name` provides a Dictionary of Parts by their name.
+
+At their simplest level, Parts have only two properties: a _a _name_ (exposed through `Part.name`) and an array of data bytes called a _stream_ (exposed through `Part.stream()`).
+
+ for p in doc.parts:
+ print p.name # p.name returns a String
+ print p.stream().read(10) # p.stream() returns a File-like object
+
+Note that, with a few exceptions, the Part's name is irrelevant and will not affect the behavior of Office. 
+Parts roles are instead determined by their Content-Type and Relationships (described below).
+
+## Content-Type
+
+All Parts have a Content-Type, such as `image/png` or `application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml`, 
+exposed via the `Part.content_type()` method. (Content-Types are defined by a single Part with a well known name, 
+`[Content_Types.xml]`, which OfficeDissector automatically parses.)
+
+## Relationships
+
+Parts have Relationships to other parts (or external resources), forming a graph-like structure. (Relationships themselves are defined by dedicated `.rels` Parts, 
+which OfficeDissector automatically finds and parses.)
+
+`Document.relationships` provides a List of all Relationships in the Document. `Part.relationships_in()` provides a List of all Relationships pointing _to_ a Part; and `Part.relationships_out()` provides a List of all 
+Relationships _from_ a Part.
+
+
+## Finding Parts
+
+OfficeDissector provides several ways to find Parts of interest:
+
+* `Document.main_part()` returns the main Part; each Document has exactly one main Part
+* `Document.parts_by_content_type()` and `Document.parts_by_content_type_regex()` allowing finding all Parts with particular Content-Type
+* `Document.parts_by_relationship_type()` allows finding all Parts which have an incoming Relationship of a particular type.
+
+In nearly every case, Office's treatment of a Part is soley determined by the Part's Content-Type and incoming Relationship types. Thus, 
+these methods provide the best way to perform most analysis.
+
+## XML Parts
+
+Most Parts are in XML format. OfficeDissector will parse the XML for you, using the `Part.xml()` method, or perform XPath queries (recommended), using the `Part.xpath()` method. 
+XPath is a very powerful tool which makes most OOXML analysis much simpler.
+
+OOXML makes heavy use of XML Namespaces. XML Namespaces can be confusing if you do not have experience with them. 
+See XML_NAMESPACES.txt for a quick introduction.
+
+## Other Parts
+
+Multimedia and embedded objects are typically stored in their own Parts, one Part per object. For example, each image will comprises its own Part, 
+with the appropriate Content-Type (e.g. `image/png`) and Relationships. Since each image is kept separate, in its native format, and with its 
+Content-Type exposed, analyzing them is easy: The image's data is readable as `part.stream()`. For example:
+
+ jpegs = doc.parts_by_content_type('image/jpeg')
+ jpegs[0].name # Returns '/word/media/image1.jpeg'
+ jpegs[0].stream().read(10) # Returns '\xff\xd8\xff\xe0\x00\x10JFIF', the first 10 bytes of the JPEG data
+
+## Features and CoreProperties
+
+The Features object (retrieved via `Document.features`) provides access to common Document features, such as macros, images, videos, and 
+sounds. This interface provides convenient access to the features most relevant to security analysis. (It's important to realize that all of 
+these features are simply Parts found by their Content-Type and Relationships.)
+
+The CoreProperties object (retrieved via `Document.core_properties`) parses and exposes the Document's Core Properties, such as `creator` 
+and `modified`; these properties are very useful for security analysis and forensics.
+
+# How do I...?
+
+To **get started**, install OfficeDissector (see Installing in README.txt), and begin with these two lines:
+
+ import officedissector
+ doc = officedissector.doc.Document('path/to/your/ooxml.docx') # Returns a Document object
+
+To **interactively analyze an OOXML document**, use ipython. See Usage in README.txt for an example.
+
+To **automatically analyze a large volume of documents**, use plugins. See mastiff-plugins/README.txt.
+
+To **learn OfficeDissector**, use ipython, and press the TAB key to see the methods available. 
+(This is demonstrated in Usage in README.txt.) Look at the full API docs for more details.
+
+If you want to **retrieve specific features and properties of a document**, use the Features and CoreProperties interface, described above.
+
+The best way to **explore the behavior of Office** is to retrieve Parts by their Content-Type and Relationships; see Finding Parts above. 
+As far as we know, all of Office's behavior can be reconstructed throught this technique. (Office also looks at the filename's extension; 
+see the section "File" above.) **This is therefore the most general purpose and powerful mode of analysis**. (This is the method OfficeDissector uses 
+internally to find Features, such as multimedia.)
+
+To **drill deeper into an XML Part**, use `Part.xpath()`, paying careful attention to XML namespaces; see the section "XML Parts" abobe. (This is 
+the method OfficeDissector uses internally to find and parse CoreProperties, such as `creator`.)
+
+To **export the data into other tools**, use the `to_json()` method, which nearly all OfficeDissector objects provide.
+
diff --git a/LEGAL.txt b/LEGAL.txt
@@ -0,0 +1,10 @@
+
+Copyright (C) 2013-2015 Grier Forensics
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
+PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
+OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
diff --git a/README.txt b/README.txt
@@ -0,0 +1,191 @@
+# OfficeDissector
+
+OfficeDissector is a parser library for static security analysis of Office Open XML (OOXML) Documents,
+created by Grier Forensics for the Cyber System Assessments Group at MIT's Lincoln Laboratory.
+
+OfficeDissector is the first parser designed specifically for security analysis of OOXML documents. It exposes all internals, including 
+document properties, parts, content-type, relationships, embedded macros and multimedia, and comments, and more. 
+It provides full JSON export, and a MASTIFF based plugin architecture. It also includes a nearly 600 MB test corpus, unit tests with nearly 
+100% coverage, smoke tests running against the entire corpus, and simple, well factored, fully commented code 
+
+## Install
+
+OfficeDissector requires the lxml package and Python version 2.7.
+
+To use OfficeDissector without installing it, set the `PYTHONPATH` to the `officedissector` directory:
+
+ $ export PYTHONPATH=/path/to/thisfolder
+
+Alternatively, it can be installed using pip (recommended) or python setup:
+
+ $ sudo pip install /path/to/thisfolder # Recommended, as pip supports uninstall
+ $ sudo python setup.py install # Alternative
+
+## Documentation
+
+To view OfficeDissector documentation, open in a browser:
+
+ $ doc/html/index.html
+
+## Testing
+
+To test, first set PYTHONPATH or install `officedissector` as described above. Then:
+
+ # Unit tests
+ $ cd test/unit_test
+ $ python test_officedissector.py
+
+ # Smoke tests
+ $ cd test
+ $ python smoke_tests.py
+
+The smoke tests will create log files with more information about them.
+
+## MASTIFF Plugins
+
+To find more information about the MASTIFF architecture and sample plugins, see
+`mastiff-plugins/README.txt`.
+
+## Usage
+
+Below is an ipython session demonstrating usage of OfficeDissector:
+
+ $ ipython
+ In [1]: import officedissector
+ In [2]: doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx')
+ In [4]: doc.is_macro_enabled
+ Out[4]: False
+
+ In [5]: doc.is_template
+ Out[5]: False
+
+ In [6]: mp = doc.main_part()
+ In [7]: mp.content_type()
+ Out[7]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml'
+
+ In [9]: mp.name
+ Out[9]: '/word/document.xml'
+
+ In [10]: mp.content_type()
+ Out[10]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml'
+
+ # We can read the part's stream of data:
+ In [17]: mp.stream().read(200)
+ Out[17]: '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http:https://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:https://schemas.openxmlformats.org/markup-c'
+
+ # Or use XPath to parse it:
+ In [33]: t = mp.xpath('//w:t', {'w': "http:https://schemas.openxmlformats.org/wordprocessingml/2006/main"})
+ In [37]: t[2].text
+ Out[37]: u'Das vorliegende Dokument ist ein Beispiel f\xfcr einen zur Publikation in einer Zeitschrift vorgesehenen Artikel. Es verwendet f\xfcr Autor und Titel in den Dokumenteigenschaften festgelegte Eintr\xe4ge.'
+
+ # All Relationships in and out are exposed:
+ In [38]: mp.relationships_in()
+ Out[38]: [Relationship [rId1] (source Part [RootPart])]
+
+ In [39]: mp.relationships_out()
+ Out[39]:
+ [Relationship [rId8] (source Part [/word/document.xml]),
+ Relationship [rId13] (source Part [/word/document.xml]),
+ Relationship [rId3] (source Part [/word/document.xml]),
+ ...
+ Relationship [rId14] (source Part [/word/document.xml])]
+
+ In [40]: rel = mp.relationships_out()[0]
+ In [43]: rel.type
+ Out[43]: 'http:https://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes'
+
+ In [46]: endnotes = rel.target_part
+ In [48]: endnotes.content_type()
+ Out[48]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml'
+
+ # Any Part (or the entire Document) can be exported to JSON:
+ In [50]: print endnotes.to_json()
+ {
+ "content-type": "application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml",
+ "uri": "/word/endnotes.xml",
+ "relationships_out": [],
+ "relationships_in": [
+ "Relationship [rId8] (source Part [/word/document.xml])"
+ ]
+ }
+
+ # Features are automatically exposed:
+ In [55]: doc.features.[TAB]
+ ...
+ doc.features.comments
+ doc.features.custom_properties
+ doc.features.custom_xml
+ doc.features.digital_signatures
+ doc.features.doc
+ doc.features.embedded_controls
+ doc.features.embedded_objects
+ doc.features.embedded_packages
+ doc.features.fonts
+ doc.features.get_parts
+ doc.features.get_union
+ doc.features.images
+ doc.features.macros
+ doc.features.sounds
+ doc.features.videos
+
+ In [55]: doc.features.images
+ Out[55]: [Part [/word/media/image1.jpeg]]
+
+ In [56]: image = doc.features.images[0]
+ In [58]: image.content_type()
+ Out[58]: 'image/jpeg'
+
+ # We can export the binary data to JSON as well, by setting include_stream = True:
+ In [61]: print image.to_json(include_stream = True)
+ {
+ "stream_b64": "/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAFAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3uGGO3iWKJdqL0Gc0UUUAf//Z",
+ "content-type": "image/jpeg",
+ "uri": "/word/media/image1.jpeg",
+ "relationships_out": [],
+ "relationships_in": [
+ "Relationship [rId1] (source Part [/word/theme/theme1.xml])"
+ ]
+ }
+
+ # Check for macros:
+ In [62]: doc.features.macros
+ Out[62]: []
+
+ # Or comments:
+ In [63]: doc.features.comments
+ Out[63]: []
+
+ # Core properties are exposed:
+ In [64]: doc.core_properties.[TAB]
+ ...
+ doc.core_properties.content_status
+ doc.core_properties.core_prop_part
+ doc.core_properties.created
+ doc.core_properties.creator
+ doc.core_properties.description
+ doc.core_properties.identifier
+ doc.core_properties.keywords
+ doc.core_properties.language
+ doc.core_properties.last_modified_by
+ doc.core_properties.last_printed
+ doc.core_properties.modified
+ doc.core_properties.name
+ doc.core_properties.parse_all
+ doc.core_properties.parse_prop
+ doc.core_properties.revision
+ doc.core_properties.subject
+ doc.core_properties.title
+ doc.core_properties.version
+ doc.core_properties.category
+
+ In [68]: doc.core_properties.modified
+ Out[68]: '2009-12-04T14:47:00Z'
+
+## Analyzing OOXML
+
+See `doc/txt/ANALYZING_OOXML.txt` for a quick start guide on how to use 
+OfficeDissector to analyze OOXML documents.
+
+## API
+
+For more details about OfficeDissector, see the API - `doc/html/rst/api.html` documentation.