Python Code quality – building tokenizer and rules for sklearn or maybe recursive-descent parser

by Security Dude

sklearn hack-a-thon vs Hurricane Sandy

I was at a Hack-a-thon for sklearn hours before Hurricane Sandy. I remember walking thru Grand Central Station and people where scrambling to get out of NYC. I was scrambling home to pick up beer and pop tarts, because Corey told me that its good hurricane food. When I got to the grocery store, I couldn’t find anything but beer and pop tarts left on the shelves. It was a mad house.

In the scramble home, I put the days conversation with Jake (sklearn dude) on the back burner.

Foreground (fg)

Jake and I had set up sklearn and discussed how a n00b could contribute to sklearn. I’m really new to Machine Learning and Data Mining discipline so I found it really hard to contribute to a library that I barely understand. Most of the documentation bugs where to clean up example code in the sklearn tutorials. You definitely have to have a great handle on the library and ML/DM discipline to help out.

Documentation woes

Jake had mentioned that most libraries have difficulty enforcing documentation standards to the submissions. I have dug thru the Twisted code and submitted a couple of patches to clean up doc strings etc., so I have seen some of the challenges of OSS development. Some of my research has lead me understand that this challenge, one might build a parser to manage “docstring standardization”.

Parsers are programs that we build to process text. This text could be a set of encoded notations in log files like looking for IP addresses, node-edge descriptions showing interconnections of a graph or HTML tags in a web page. In each case, the parsers process a specific set of character groups and patterns.

Regular expressions are commonly used for parsing text. Some python programs tokenize and apply rules after a match. My instincts tells me that this is something that every project would benefit from. Based on the Code Sniffer project, I will look into how to build code sniffer for python. Another approach is to build a recursive-descent parser. I have found the pyparsing module and the PLY module that I will look at. Some other programs like pep8.py , pylint and pychecker can parse code and find violations of the rules.

Bunch-o-links (one day I would like to do a tutorial)

http://pear.php.net/package/PHP_CodeSniffer/redirected

http://www.logilab.org/857

http://pychecker.sourceforge.net/

https://github.com/cburroughs/pep8.py

http://www.doughellmann.com/articles/pythonmagazine/completely-different/2008-03-linters/index.html

http://pyparsing.wikispaces.com/HowToUsePyparsing

http://cutter.rexx.com/~dkuhlman/python_201/python_201.html#SECTION007600000000000000000

http://wiki.python.org/moin/PyCon2006/Talks#4

http://onlamp.com/pub/a/python/2006/01/26/pyparsing.html

http://www.gooli.org/blog/a-simple-lexer-in-python/

http://www.dabeaz.com/ply/

http://www.dabeaz.com/ply/PLYTalk.pdf

http://www.ptmcg.com/geo/python/confs/pyCon2006_pres2.html

http://sigusr2.net/2011/Apr/18/parser-combinators-made-simple.html

http://effbot.org/zone/simple-top-down-parsing.htm

http://javascript.crockford.com/tdop/tdop.html

Advertisements