markup-parser


A simple parser for HTML-like markup

This set of C# classes can be used to parse "HTML-like" markup. They are intended to be inherited from, rather than being used directly.

Class Descriptions

Tag
The simplest possible form of a tag. Includes the tag name, and a list of attributes.

PrimitiveTag
LiteralPrimitiveTag, CommentPrimitiveTag
A low-level representation of a tag, in HTML it would correspond to a piece of text contained between < and >. Literals (text outside of < and >) and comments are also treated as tags.

ParserState
Container class that encapsulates the raw input (as a string), the output (as a list of PrimitiveTags), and includes methods for navigating the input.

Parser
This class encapsulates most of the parsing logic. Given a ParserState (which contains the raw markup as a string), it produces a collection of PrimitiveTags.

HierarchyNode
LiteralNode, CommentNode, RootNode
A higher level representation of a tag. At this level, the document hierarchy is maintained as a collection of child nodes.

TagConverter
Given a Tag, produces a HierarchyNode. Can also identify tags which automatically close themselves (like <br> in HTML).

The IsSelfClosingMethod attribute identifies which methods of the TagConverter class are used to determine whether or not a tag is self closing. The GetHierarchyNodeMethod attribute is used to determine which method is used to create a HierarchyNode from a given Tag.

DocumentHierarchyCreator
Traverses a list of primitive tags, and creates a tree of HierarchyNodes. Uses the TagConverter class to create the HierarchyNodes.

Workflow

Instantiate a ParseState object, and set its Source property to the markup to be parsed:

ParseState state = new ParseState();

state.Source = markupToParse;


Instantiate a Parser object, passing in the ParseState created earlier:

Parser parser = new Parser(state);


Call the Parse() method, and store the results:

Collection<PrimitiveTag> tags = parser.Parse();


Instantiate a TagConverter and DocumentHierarchyCreator:

TagConverter converter = new TagConverter();

DocumentHierarchyCreator hierarchyCreator = new DocumentHierarchyConverter();


Call the BuildHierarchy method of the DocumentHierarchyCreator, passing in the list of PrimitiveTags and the TagConverter:

HierarchyNode root = hierarchyCreator.BuildHierarchy(tags.ToArray<PrimitiveTag>(), converter);

At this point root contains the parsed document in a tree structure. Calling the render method will convert it back into a string:

string output = root.Render();

Project Information

Labels:
HTML Markup Parse Wiki .NET