Author: | Paul McGuire |
---|---|
Address: | ptmcg@users.sourceforge.net |
Revision: | 1.2.2 |
Date: | September, 2004 |
Copyright: | Copyright © 2003,2004 Paul McGuire. |
abstract: | This document provides how-to instructions for the pyparsing library, an easy-to-use Python module for constructing and executing basic text parsers. The pyparsing module is useful for evaluating user-definable expressions, processing custom application language commands, or extracting data from formatted reports. |
---|
To parse an incoming data string, the client code must follow these steps:
The following complete Python program will parse the greeting "Hello, World!", or any other greeting of the form "<salutation>, <addressee>!":
from pyparsing import Word, alphas greet = Word( alphas ) + "," + Word( alphas ) + "!" greeting = greet.parseString( "Hello, World!" ) print greeting
The parsed tokens are returned in the following form:
['Hello', ',', 'World', '!']
The pyparsing module can be used to interpret simple command strings or algebraic expressions, or can be used to extract data from text reports with complicated format and structure ("screen or report scraping"). However, it is possible that your defined matching patterns may accept invalid inputs. Use pyparsing to extract data from strings assumed to be well-formatted.
To keep up the readability of your code, use the +, |, and ^ operators to combine expressions. You can also combine string literals with ParseExpressions - they will be automatically converted to Literal objects. For example:
integer = Word( nums ) # simple unsigned integer variable = Word( alphas, max=1 ) # single letter variable, such as x, z, m, etc. arithOp = Word( "+-*/", max=1 ) # arithmetic operators equation = variable + "=" + integer + arithOp + integer # will match "x=2+2", etc.
In the definition of equation, the string "=" will get added as a Literal("="), but in a more readable way.
The pyparsing module's default behavior is to ignore whitespace. This is the case for 99% of all parsers ever written. This allows you to write simple, clean, grammars, such as the above equation, without having to clutter it up with extraneous ws markers. The equation grammar will successfully parse all of the following statements:
x=2+2 x = 2+2 a = 10 * 4 r= 1234/ 100000
Of course, it is quite simple to extend this example to support more elaborate expressions, with nesting with parentheses, floating point numbers, scientific notation, and named constants (such as e or pi). See fourFn.py, included in the examples directory.
MatchFirst expressions are matched left-to-right, and the first match found will skip all later expressions within, so be sure to define less-specific patterns after more-specific patterns. If you are not sure which expressions are most specific, use Or expressions (defined using the ^ operator) - they will always match the longest expression, although they are more compute-intensive.
Or expressions will evaluate all of the specified subexpressions to determine which is the "best" match, that is, which matches the longest string in the input data. In case of a tie, the left-most expression in the Or list will win.
If parsing the contents of an entire file, pass it to the parseFile method using:
expr.parseFile( sourceFile )
ParseExceptions will report the location where an expected token or expression failed to match. In the case of complex expressions, the reported location may not be exactly where you would expect.
Use the Group class to enclose logical groups of tokens within a sublist. This will help organize your results into more hierarchical form (the default behavior is to return matching tokens as a flat list of matching input strings).
Punctuation may be significant for matching, but is rarely of much interest in the parsed results. Use the suppress() method to keep these tokens from cluttering up your returned lists of tokens. For example, delimitedList() matches a succession of one or more expressions, separated by delimiters (commas by default), but only returns a list of the actual expressions - the delimiters are used for parsing, but are suppressed from the returned output.
Parse actions can be used to convert values from strings to other data types (ints, floats, booleans, etc.). But be careful not to include converted data within a Combine object.
Be careful when defining parse actions that modify global variables or data structures (as in fourFn.py), especially for low level tokens or expressions that may occur within an And expression; an early element of an And may match, but the overall expression may fail.
Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.
ParserElement - abstract base class for all pyparsing classes; methods for code to use are:
parseString( sourceString ) - only called once, on the overall matching pattern; returns a ParseResults object that makes the matched tokens available as a list, and optionally as a dictionary, or as an object with named attributes
parseFile( sourceFile ) - a convenience function, that accepts an input file object or filename. The file contents are passed as a string to parseString().
scanString( sourceString ) - generator function, used to find and extract matching text in the given source string; for each matched text, returns a tuple of:
scanString allows you to scan through the input source string for random matches, instead of exhaustively defining the grammar for the entire source text (as would be required with parseString).
transformString( sourceString ) - convenience wrapper function for scanString, to process the input source string, and replace matching text with the tokens returned from parse actions defined in the grammar (see setParseAction).
setName( name ) - associate a short descriptive name for this element, useful in displaying exceptions and trace information
setResultsName( string, listAllMatches=False ) - name to be given to tokens matching the element; returns a copy of the element so that a single basic element can be referenced multiple times and given different names within a complex grammar; if multiple tokens within a repetition group (such as ZeroOrMore or delimitedList) the default is to return only the last matching token - if listAllMatches is set to True, then a list of matching tokens is returned.
setParseAction( fn ) - function to call after successful matching of the element; the function is defined as fn( s, loc, toks ), where:
fn can return a modified toks list, to perform conversion, or string modifications. For brevity, fn may also be a lambda - here is an example of using a parse action to convert matched integer tokens from strings to integers:
intNumber = Word(nums).setParseAction( lambda s,l,t: [ int(t[0]) ] )
If fn does not modify the toks list, it does not need to return anything at all.
leaveWhiteSpace() - change default behavior of skipping whitespace before starting matching (mostly used internally to the pyparsing module, rarely used by client code)
suppress() - convenience function to suppress the output of the given element, instead of wrapping it with a Suppress object.
ignore( expr ) - function to specify parse expression to be ignored while matching defined patterns; can be called repeatedly to specify multiple expressions; useful to specify patterns of comment syntax, for example
setDebug( dbgFlag=True ) - function to enable/disable tracing output when trying to match this element
validate() - function to verify that the defined grammar does not contain infinitely recursive constructs
Word - one or more contiguous characters; construct with a string containing the set of allowed initial characters, and an optional second string of allowed body characters; if only one string given, it specifies that the same character set defined for the initial character is used for the word body; a Word may also be constructed with any of the following optional parameters:
If exact is specified, it will override any values for min or max.
CharsNotIn - similar to Word, but matches characters not in the given constructor string (accepts only one string for both initial and body characters); also supports min, max, and exact optional parameters.
SkipTo - skips ahead in the input string, accepting any characters up to the specified pattern; may be constructed with the following optional parameters:
Group - causes the matched tokens to be enclosed in a list; useful in repeated elements like ZeroOrMore and OneOrMore to break up matched tokens into groups for each repeated pattern
Dict - like Group, but also constructs a dictionary, using the [0]'th elements of all enclosed token lists as the keys, and each token list as the value
SkipTo - catch-all matching expression that accepts all characters up until the given pattern is found to match; useful for specifying incomplete grammars
Forward - placeholder token used to define recursive token patterns; when defining the actual expression later in the program, insert it into the Forward object using the << operator (see fourFn.py for an example).
ParseException - exception returned when a grammar parse fails; ParseExceptions have attributes loc, msg, line, lineno, and column
RecursiveGrammarException - exception returned by validate() if the grammar contains a recursive infinite loop, such as:
badGrammar = Forward() goodToken = Literal("A") badGrammar << Optional(goodToken) + badGrammar
ParseResults - class used to contain and manage the lists of tokens created from parsing the input using the user-defined parse expression. ParseResults can be accessed in a number of ways:
ParseResults can also be converted to an ordinary list of strings by calling asList(). Note that this will strip the results of any field names that have been defined for any embedded parse elements. (The pprint module is especially good at printing out the nested contents given by asList().)
Finally, ParseResults can be converted to an XML string by calling asXML(). Where possible, results will be tagged using the results names defined for the respective ParseExpressions. asXML() takes two optional arguments: