martes, 4 de febrero de 2014

Python, PyBrain, Cython and CyBrain

The best thing about Python is that its a diverse language where some use it to create commercial web apps in Django, some to teach programming, and there are those in the scientific community who seek an open source replacement for Matlab.
When I came in touch with Python I belong to this last group; my main motivation to use this language was its growing popularity in Artificial Intelligence or Machine Learning. In this terrain packages like Numpy, SciPy and the SciKits collection have been doing a great job giving consistent tools to the scientific community.

In relation to Data Mining and Machine Learning amongst Python offers is Scikit-Learn, an extremely well documented module, with some youtube tutorials and a very smart team behind it. When I first started to wonder the machine learning world, my main drive soon became neural networks, and not wanting to depend on Matlab I gave my hopes to Scikit-learn. But things didn't go so fluidly. It seems that they initially had plans to support neural nets and at some point began to implement them, but soon decided not to and threw the ball to PyBrain; a package specialised on NN and AI.

After searching on many forums, and trying many modules, I finally settled with PyBrain. Their "modular" philosophy is great: you build neural networks as if they were Legos by you creating layers, make connections amongst them in any (consistent) order you want,  adding them to a network, and finally training it. On the long run, a package like this is needed to do modern machine learning because of its capability to create deep networks with a custom architecture.
As pretty as PyBrain may be, it has a huge achilles heel: PYTHON IS SLOW! PyBrain is written in pure Python and you will hit a dead end if you need wings. The truth is that large scale neural nets are one of those real world examples where you can have a function with millions of parameters that will require optimisation by running through thousands training cases. But you don't need to go to that extreme to see the wall, just create a network with about 2 or 3 hidden layers, each with about 5 neurons, and you will feel the pain of waiting lots of seconds for the console to pop-up the answer. Becuase of this limitation, PyBrain is at best and educational -maybe not even scientific- package, since using PyBrain in a real world scenario would be unreasonable.

PyBrain's philosophy is great, but PyBrain itself may not suite my purposes. That is why I decided to start CyBrain, a neural networks module inspired by PyBrain, written in Cython. Now for those who don't know what Cython is I will just say its an inch close to being the "perfect language". Formally Cython is superset of the Python language that compiles pythonic code to optimised C. By superset it means that (except for generators) every python statement a valid cython statement, however, not all cython statements are valid python statements. The real deal is that Cython give you the opportunity to write pythonistic code and pseudo-C code in any mixed way you want, specifically Cython lets you write C TYPES!

When you write cython code you feel you are connecting two foreign realms and at first it is thrilling and confusing. The first thing you automatically do is test the speed; its like driving Ferrari, even if you don't like cars you are bound to hit the accelerator. Cython is fast, C fast. The first bit is a little rough, since cython is a compiled language you have to arrange all the parts in a setup. The documentation helps in this first stage but I really took of after this 4 part youtube tutorial from a guy at Enthought. Pass that initial trial, relax and watch you python code run from 1.4x to 7x faster; then add some type and feed the wind of 100x+ speed!!!

Back to CyBrain, I just finished the basic parts for what you could call a minimum viable product. I don't a lot about Pybrain's internals, while I did download the project since it's open source, looking at unknown code become boring after a few minutes unless you want to fix it. My main design inspiration came from a section in the docs where they taught you to create you own custom neurons by subclassing the Neuron class and overriding some functions; these functions where the ones that gave me the hints.
In Cython you can do some nasty tricks like use pointers... pointers!!! This is heaven and hell at the same time. You have to malloc them (ahh!!!) but then you can insert them in C++ vectors -which Cython supports- and for free have variables like floats act like modern objects. This might not seem directly useful, state sharing is very efficient for some applications: weight sharing technique is really easy with this.

Any way, this was a long digression. I haven't compared speed yet, but results seem promising. If you want to fork the code, go ahead, here is the github link to CyBrain. Feedback is welcomed.