Skip to content
/ ANHTML Public
forked from unixpickle/ANHTML

A lightweight HTML parser for Objective-C (ARC only)

Notifications You must be signed in to change notification settings

gitm2m/ANHTML

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ANHTML

ANHTML is an easy-to-use standalone Objective-C HTML parser written with ARC. It does not depend on external libraries such as libxml or NSXMLParser; and still manages to provide very reliable DOM-based parsing.

At the core of ANHTML, the ANHTMLParser class tokenizes the XML/HTML document, telling the delegate when it finds text, tags, or escape codes. On top of ANHTMLParser is ANHTMLDocument, the DOM implementation. ANHTMLDocument encapsulates an ANHTMLParser, and constructs a DOM as the parser sends it callbacks.

When an ANHTMLDocument has been created and is done processing the document, the DOM elements can be traversed and examined through several simple to use methods, allowing for easy, native DOM access. The ANHTMLDocument class provides a root element by way of the rootElement getter. From there, the children attribute can be used to traverse through the root's children. In the event that the given document does not have a single root node, the rootNode getter will return a nameless element with more than one sub-nodes.

A simple parsing example

The main.m file included in this repository contains a basic example of how parsing with an ANHTMLDocument would be done. Here is a basic walkthrough of this simple example:

NSData * testData = [@"<html><body><p>This is a basic test</p></body></html>" dataUsingEncoding:NSASCIIStringEncoding];

The purpose of this line is to give ourselves a simple document that we will parse. Next, we create an ANHTMLDocument using this data. The initWithDocumentData: method automatically parses the given document's data, leaving us with a complete DOM representation by the time the method returns:

ANHTMLDocument * document = [[ANHTMLDocument alloc] initWithDocumentData:testData];

Since we know that the <html> element should be the root element of the document, the rootElement method on the document variable will return the ANHTMLElement for our <html> element:

ANHTMLElement * htmlElem = [document rootElement];

Next, we can get the <body> element, which is a child of the <html> element:

ANHTMLElement * body = [[htmlElem childElementsWithName:@"body"] lastObject];

Note that the childElementsWithName: method returns an array of children that have a specified element name. In this case, we know that there is only one <body> element, so childElementsWithName: will return an array with one object. Calling lastObject on the array is the shortest way to get the one and only element that is in the array.

Now, finally, we can get all of the text from the <body> element and log it to the console:

NSLog(@"%@", [body stringValue]);

About

A lightweight HTML parser for Objective-C (ARC only)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published