GSoC 2010: Multiple language support for autodoc in Sphinx via ANTLR

Table of Contents

Problem description

The Sphinx documentation system has been extensively used by the Python community to document their projects. The autodoc extension is especially useful as it allows elegant mixing of external (outside source code) and internal (inside source code) documentation. However, currently autodoc only works with the Python language. Better support for other languages is being developed at sphinx-domains branch, but it still will not support autodoc out-of-the-box for other languages.

This project will add autodoc support for multiple languages. This is done by using the ANTLR parser generator that can generate parsers for programming languages and extract documentation from it.

A successful implementation of this project will enable Sphinx as the ultimate documentation system for languages supported by ANTLR, and also paves the way of using other parsers to generate documentation in Sphinx.

Use cases

Case 1:A developer using Pylons for an AJAX-heavy website wants to document her static javascript files inline (she obviously already used Sphinx for her Python codebase). She would be able to use autodoc for the javascript inline documentation, and she could also author javascript documentations in ReST format.
Case 2:A project [1] is largely written in Python but had certain critical components written in C for increased performance. The project could write inline documentation for the C code in ReST, the format all the developers are familiar with.
Case 3:A project is largely written in C but was scriptable with Python (and a few other languages!) [2] . Rather than using a few different documentation systems for each scripting languages it supported, the project could use Sphinx instead.
Case 4:A PHP developer wants to document his project but found out that the documentation systems Doxygen and phpDocumentor were lacking [3] . Namely, he did not like the format used to write the documentation (either DocBook or loosely defined proprietary format) and how those systems integrated external documentation. He would be delighted that Sphinx accommodates all his needs.

Implementation details

sphinx.ext.autodoc will be refactored to separate the only current source of autodoc, i.e. live inspection of Python objects. Therefore, original Python implementation can still be used (it will be included by default just like how the py: domain will be a default domain). Existing tests on autodoc should work with little modification.

A new internal module will be created, which will reuse / import parts of sphinx.ext.autodoc wherever applicable, and serves as a generic autodoc-from-ANTLR base module.

For language X to be supported by the autodoc-from-ANTLR module, they need to provide a language support extension, dependent on the new internal module, which contains:

  1. either
    • the ANTLR-compatible EBNF grammar of language X, or
    • the ANTLR-generated parser [4];
  2. a rule to extract documentation tokens (which are most likely just comments with some restrictions) given an AST tree;
  3. a rule to extract documentable tokens (e.g. classes, functions, attributes) given an AST tree, and additional rules e.g. whether it is "class-like" (automatically documents its member documentable tokens).

The ANTLR-compatible EBNF grammars for most popular languages are readily available over the internet, and the rules to define tokens will be quite straightforward to create.

From this data ANTLR is able to generate a Python-based parser for language X. This parser can generate an Abstract Syntax Tree (AST) for language X; the extension then walks this AST and generates internal Sphinx data structures as necessary, guided by the rules defined from the language support extension.

Moreover, a Sphinx domain for language X can also be automatically defined from the data, thus eliminating the need to manually create domains altogether [5].

The internals will also be implemented in such a way that supporting parsers not generated by ANTLR is possible should the ANTLR parser not suffice. For example, Clang is a C, C++, Objective C and Objective C++ for the LLVM compiler that might provide useful parsing and diagnostic features to Sphinx users documenting code in those languages.

Rationale for using ANTLR

Deliverables for this project

  • a refactored or more extensible autodoc module that supports multiple autodoc sources;
  • a new ANTLR autodoc source module;
  • a new module that can automatically define domains provided the language support extension is available;
  • C and Javascript implementations of the language support extension as proof-of-concept;
  • documentation on creating new language support extensions.

Plan

I will push all of my work to bitbucket. I will start coding from the sphinx-domains branch (except if the branch gets merged to main before I start the project, which actually will not matter much due to wonders of Mercurial merging). I will always try to "release early, release often".

Communication to the mentors and the community about technical issues will be mostly done by IRC at #pocoo@freenode and perhaps by email. Depending on the severity of the concerns at hand and response in IRC, I may also write to sphinx-dev@googlegroups.com, especially if conceptual / architectural decisions are being made.

Documentation will be written along while writing code; tests and additional documentation will be written in specific time points as specified in the next Timeline section. The current platform for documentation (Sphinx itself) and tests (Nose) will be used.

Timeline

I will start the project on 3 May, which is the last day of my final examinations.

The first month might look a bit sparse, because I have a compulsory part-time attachment from my faculty (i.e. becoming cheap labour). I believe it would not affect my performance significantly.

3 -- 9 May
Become more familiar with the code (and the community!) and branch the sphinx-domain repository.
10 -- 23 May
Refactor autodoc to separate Python-specific autodoc source from autodoc module. Ensure that current tests work.
24 May -- 6 June
Build domains support in autodoc, where autodoc sources can specify the domains of the documentation they generate. Ensure current tests work.
7 -- 13 June
Write additional tests for the domain support of autodoc. Also improve on tests and documentation generated so far.
14 -- 27 June
Study the ANTLR parser, its generated AST (Abstract Syntax Tree), and ways to manipulate it from Python. Then, create the ANTLR autodoc source and script to generate parser provided that the ANTLR binaries are available.
28 June -- 4 July
Generate a C parser from ANTLR and build the language support module for C.
5 -- 11 July
Add and refine tests and documentation for the parser generator script, ANTLR autodoc source, and C implementation.
12 -- 25 July
Create a module to automate generating new language domains using the information provided in the language support extension; test and document accordingly.
26 July -- 8 August
Create the javascript implementation of the language support extension; document and test accordingly.
9 -- 16 August
Write more tests and documentation. A guide to create a language support module is also written.

About me

My name is Leontius Adhika Pradhana, a second-year pharmacy student from National University of Singapore but was living in Indonesia before university life. Despite the somewhat unrelated major, I have been programming since grade 9 of formal school and won several national programming competitions; one of my project (a web application to manage a programming contest) competed in Asia-Pacific level. These competitions and projects were mostly done in Pascal, C, C++, Java, or PHP.

In the university I joined some informatics-related activities and received exposure to C# / ASP.NET, more PHP, working with and leading large teams.

Python is my current favourite language and I have used it for a whole host of one-time-use scripts such as tabular data / CSV processing, backup scripts, and pulling data from the internet (BeautifulSoup rocks!). I have also dabbled around with Pylons although I have not done any substantial work on it. I am using Sphinx for an internal project (an intranet). I am also adept in general web development (HTML, CSS, Javascript, SQL), as a lot of my projects depend on those skills.

So far my experience with open source is largely with Drupal. There I discussed issues and submitted patches. I also authored two Drupal modules: Pingback and IP2Nation API.

A semi-exhaustive CV (not the ones that you send to companies) can be viewed at http://tr.im/cvleon.

Contact

Blog:http://leapon.net/
Email:leon@leapon.net
IRC:leonth in irc://irc.freenode.net:6667
Google Talk:leontius@gmail.com
Phone:+6584256806 (Singapore, UTC+8)
[1]Mercurial and SciPy are examples of such projects.
[2]CPython and openoffice.org are examples of such projects.
[3]This was me in some point of time :)
[4]This way, end users do not even need to touch ANTLR. I would imagine another repository maintaining generated parsers for commonly used languages so that people wishing support for those languages would only need to install the pure Python extension.
[5]The need to write domains will be replaced by the need to write the "language support" extension, but this will enable autodoc support out-of-the-box.