The Python language has become one of the premier computational languages for scientific research on account of its many useful in-built data handling methods. Additionally, there are a number of science-oriented packages that rival industry-standard computational packages (I’m mainly thinking of Matlab). The most popular add-on Python science packages are NumPy and SciPy.
Python has a steeper learning curve than Matlab, but once the user has gained enough experience there’s a surprising wealth of modules that can be wielded for powerful results. Many of these Python add-ons came from academic institutions who decided to release their tools into the Python community for free use.
Data-mining in Python has become very popular. Two tools that I am briefly reviewing here are OpenCV and SciKits.learn.
I have already benefited from OpenCV, an open source machine vision package. The package is actually a collection of C++ libraries, but Boost Python wrappers have been written to open up the libraries to Python. Learning algorithms include boosting, decision tree learning, expectation-maximization algorithm, the k-nearest neighbor algorithm, the naive Bayes classifier, artificial neural networks, random forest, and support vector machine (SVM).
I’ve also recently come across scikits.learn. This is a more general-purpose collection of machine learning modules written for Python. As of this writing the project is relatively new, but already has a well-developed set of supervised learning modules: support vector machines and generalized linear models; and is developing a set of unsupervised learning modules: clustering, gaussian mixture models, manifold learning, ICA, and gaussian processes.
Just so you know, I wanted to point out that SciKits is actually a group of modules (sckikits.learn being one) built using SciPy. It includes a statistical computation module, image processing routines and vector plotting algorithms among many, many others.
Are there any data-mining/pattern recognition Python packages that you can add to this list?
A big thanks to Ben Racine who alerted me to:
- Machine Learning Python — aka “mlk”. This package has been developed via an Italian research center, Fondazione Bruno Kessler. From the package homepage, I see it includes: SVM (Support Vector Machine), KNN (K Nearest Neighbor), FDA, SRDA, PDA, DLDA (Fisher, Spectral Regression, Penalized, Diagonal Linear Discriminant Analysis) for classification and feature weighting, I-RELIEF, DWT and FSSun for feature weighting, *RFE (Recursive Feature Elimination) and RFS (Recursive Forward Selection) for feature ranking, OLS, (Kernel) Ridge Regression, LASSO, LARS, Gradient Descent for Regression, Elastic Net, DWT, UWT, CWT (Discrete, Undecimated, Continuous Wavelet Transform), KNN imputing, DTW (Dynamic Time Warping), Hierarchical Clustering, k-medoids, k-means, Resampling Methods, Metric Functions, Canberra indicators
- Machine Learning Tooklkit — aka “MILK” — was written by MIT author Luis Pedro Coelho. Its focuses on supervised classification via SVMs (based on libsvm), k-NN, random forests, and decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems. For unsupervised learning, milk supports k-means clustering and affinity propagation.