Python Code quality – building tokenizer and rules for sklearn or maybe recursive-descent parser

by Security Dude

sklearn hack-a-thon vs Hurricane Sandy

I was at a Hack-a-thon for sklearn hours before Hurricane Sandy. I remember walking thru Grand Central Station and people where scrambling to get out of NYC. I was scrambling home to pick up beer and pop tarts, because Corey told me that its good hurricane food. When I got to the grocery store, I couldn’t find anything but beer and pop tarts left on the shelves. It was a mad house.

In the scramble home, I put the days conversation with Jake (sklearn dude) on the back burner.

Foreground (fg)

Jake and I had set up sklearn and discussed how a n00b could contribute to sklearn. I’m really new to Machine Learning and Data Mining discipline so I found it really hard to contribute to a library that I barely understand. Most of the documentation bugs where to clean up example code in the sklearn tutorials. You definitely have to have a great handle on the library and ML/DM discipline to help out.

Documentation woes

Jake had mentioned that most libraries have difficulty enforcing documentation standards to the submissions. I have dug thru the Twisted code and submitted a couple of patches to clean up doc strings etc., so I have seen some of the challenges of OSS development. Some of my research has lead me understand that this challenge, one might build a parser to manage “docstring standardization”.

Parsers are programs that we build to process text. This text could be a set of encoded notations in log files like looking for IP addresses, node-edge descriptions showing interconnections of a graph or HTML tags in a web page. In each case, the parsers process a specific set of character groups and patterns.

Regular expressions are commonly used for parsing text. Some python programs tokenize and apply rules after a match. My instincts tells me that this is something that every project would benefit from. Based on the Code Sniffer project, I will look into how to build code sniffer for python. Another approach is to build a recursive-descent parser. I have found the pyparsing module and the PLY module that I will look at. Some other programs like , pylint and pychecker can parse code and find violations of the rules.

Bunch-o-links (one day I would like to do a tutorial)