Saturday, January 7, 2012

A Pythonic C++ Parser

If you google for "python C++ parser", you will find a variety of internet discussions related to parsing C++ in Python.  C++ cannot be parsed by a LALR parser and it is well-known that parsing C++ is a nontrivial task.  Thus, these discussions generally fall into one of several categories:
  1. It is too hard to parse C++ in Python, so use a package like GCC_XML that does this for you.  If you really need to do something in Python, write a wrapper to GCC_XML.
  2. It is too hard to perform a complete parse of C++ in Python, but we can use a LALR parser to collect gross structural information from C++ files.  The CppHeaderParser is an example of this type of package, which uses the ply parser to collect information about classes in header files.
In the recent release of CxxTest, I included a LALR C++ parser that is similar to CppHeaderParser. CxxTest is a unit testing framework for C++ that is similar in spirit to JUnit, CppUnit, and xUnit. CxxTest is easy to use because it does not require precompiling a CxxTest testing library, it employs no advanced features of C++ (e.g. RTTI) and it supports a very flexible form of test discovery.

CxxTest performs test discovery by searching C++ header files for CxxTest test classes. The default process for test discovery is a simple process that analyzes each line in a header file sequentially, looking for a sequence of lines that represent class definitions and test method definitions.

I added a new test discovery mechanism in CxxTest 4.0 that is based on the a parser for the Flexible Object Generator (FOG) language, which is a superset of C++. The grammar for the FOG language was adapted to parse C++ header files to identify class definitions and class inheritance relationships, class and namespace nesting of declarations, and class methods. This allows CxxTest to identify test classes that are defined with complex inheritance relationships.

As I noted earlier, the CxxTest FOG parser is similar to the parser in CppHeaderParser.  Based on my limited knowledge of CppHeaderParser, here are some points of contrast between these two capabilities:

  1. The FOG parser is embedded in CxxTest, while the CppHeaderParser is a stand-alone package.  Although I implemented the FOG parser as a separate component in CxxTest, I did not have specific design requirements that led me to make this a separate package.  (Interested parties should give me a buzz...)
  2. The FOG parser is a specifically focused on the features required by CxxTest, and thus it does not parse out much of the information that CppHeaderParser provides (return values, argument types, etc).
  3. The FOG parser was specifically designed to capture class inheritance relationships.  It is not clear to me that the CppHeaderParser does this.
  4. The FOG parser is based on a superset of C++.  Thus, it can robustly parse C++ method and function definitions.  The examples provided by CppHeaderParser suggest that it can parser function and method declarations, but not headers that include their definitions.  (Of course, the FOG parser ignores these definitions, but that's the point.  The parser can do that.)
  5. The FOG parser has been tested on a large set of C and C++ test files that are used to test the ELSA compiler.  This is a much more extensive test suite than is used to develop CppHeaderParser.
The point of this comparison is that the FOG parser may be of interest for other C++ parsing applications.  It has not been developed for general use, but it could easily be adapted to provide a more general capability.  

1 comment:

  1. Yet another possibility is to use libclang ( which has Python bindings. Clang is a full-featured, fast, production quality (used in Xcode, for example) front end for C/C++/Objective C. Unlike GCC, it is designed to be used as a library. However, it's obviously not a pure Python solution.