Archive for the 'data mining' Category

Earthquakes vs. Time of Day

66,725 Earthquakes from 1973 - 2011 with color indicating magnitude

Time and tide wait for no one. Add to that: earthquakes. I live in the San Francisco Bay Area, a.k.a. “earthquake country”, in a small house built in the 1950s before earthquake building codes had been created. Within the next 30 years, the USGS tells us we can expect a “big one” in the East Bay right along the fault where I dwell.

So here’s my question: If there’s a 30-year window for the next big one to occur, can I at least know the most likely time of day? This is not a crazy question. The time of day is really just a way of expressing where the sun is located with respect to your geographical location. The moon is responsible for sweeping the tides all around the Earth. So it seems reasonable to think that the moon or the sun may at least be an influence in “Earth tides” which might act as a trigger for earthquakes. Here’s a quick sketch showing my hypothesis:

This question turned out to be fairly straight-forward to answer and I’ll cut to the chase and say, no, not really. There are some hours that are a teeny bit more earthquake prone than others, but the variations proved to be statistically insignificant.

The Method

I downloaded all earthquakes magnitude 5 or greater from 1973 through mid-2011 from the USGS Global Earthquake Search website. This gave me a list of 66,725 earthquakes — a reasonable sized dataset. I mapped the positions and color-coded magnitudes of all 66,725 earthquakes (green = mag 5.0 up through red = mag 9.1), shown at the top of this blog entry.

It’s an interesting fact that these earthquakes span a time period of 338,117 hours which implies a chance of 20% for an earthquake (mag 5 or greater) during any hour. The chance during any hour for a magnitude 6 or greater earthquake drops to only 1.6%. By the time you get to a magnitude 7 or greater it’s much less than 1%/hr.

The next step was to calculate the right ascension of the sun and moon and translate the longitudinal position of each earthquake into right ascension to match. Below shows an illustration of what I’m describing.

The important thing to note is that the latitude of the sun, moon and earthquake can all be different and I am calculating the difference with respect solely to longitude. This is because I’m wondering about the most earthquake-prone time of day so the longitudes are the relevant quantities rather than the latitudes.

After these coordinates had been calculated for all 66,725 earthquakes, I found the difference of the position of the sun/moon with respect to each earthquake in terms of right ascension. Following that, I grouped the differences by “relative hour” (by which I mean the relative position as described above) and graphed the resulting histogram.

Position of the Sun vs. Earthquake Time

The histogram of the earthquakes relative to the position of the sun looked like this.

The blue bars show what the histogram should look like if there was an equal probability of having an earthquake regardless of the position of the sun and the red shows the actual data. You can see the bars are very close in size. There’s a small peak around the 17th hour. But is it statistically significant?

The mean number of earthquakes during any given relative hour was 2780 earthquakes. The standard deviation was +/- 60 earthquakes. The number of earthquakes in the 17th relative hour was 2902 earthquakes — just outside two standard deviations which shows it to be an outlier by 2 earthquakes. Not a strong outlier! In fact, the p-value using the Watson U-Squared test is a paltry 0.48 which is well outside the threshold for being a significant result. Translation: not publishable!

Position of the Moon vs. Earthquake Time

Since it is the moon, not the sun that is primarily responsible for sweeping the ocean tides around the Earth, perhaps I am looking at the wrong entity (actually, the sun is responsible for a smaller, secondary ocean tide, but the magnitude pales in comparison to the moon’s effect). The time of day has nothing to do with the moon’s position so if there worked out to be a correlation then you’d need to consult moon charts every day!

I redid my previous analysis, grouping earthquakes by position relative to the moon. The following histogram was the result:

As can easily be seen with the eye, the deviation of the data from the mean is even smaller this time (+/- 53). In fact, there were no points outside two standard deviations and the p-value was 0.33. Again, statistically insignificant!

Conclusion

In short, there’s no particular hour to be extra wary of earthquakes. Unfortunately, I’ll just have to settle for the USGS’s “sometime in the next 30 years”.

Data extracted from:
References:

-Lyndie Chiou

KDD Cup 2011

This year’s KDD Cup contest is similar in style to the Netflix competition, except 1st place is only $5000 instead of a million. Bummer. But it still seems like a fun contest.

There are two parts. The first track attempts to predict user music ratings and the second track tries to predict whether or not a user will rate a particular track at all. Contestants can choose either track or both.

The deadline for the KDD cup is June 30th, 2011.

-Lyndie Chiou

Python data-mining and pattern recognition packages

The Python language has become one of the premier computational languages for scientific research on account of its many useful in-built data handling methods. Additionally, there are a number of science-oriented packages that rival industry-standard computational packages (I’m mainly thinking of Matlab). The most popular add-on Python science packages are NumPy and SciPy.

Python has a steeper learning curve than Matlab, but once the user has gained enough experience there’s a surprising wealth of modules that can be wielded for powerful results. Many of these Python add-ons came from academic institutions who decided to release their tools into the Python community for free use.

Data-mining in Python has become very popular. Two tools that I am briefly reviewing here are OpenCV and SciKits.learn.

I have already benefited from OpenCV, an open source machine vision package. The package is actually a collection of C++ libraries, but Boost Python wrappers have been written to open up the libraries to Python. Learning algorithms include boosting, decision tree learning, expectation-maximization algorithm, the k-nearest neighbor algorithm, the naive Bayes classifier, artificial neural networks, random forest, and support vector machine (SVM).

I’ve also recently come across scikits.learn. This is a more general-purpose collection of machine learning modules written for Python. As of this writing the project is relatively new, but already has a well-developed set of supervised learning modules: support vector machines and generalized linear models; and is developing a set of  unsupervised learning modules: clustering, gaussian mixture models, manifold learning, ICA, and gaussian processes.

Just so you know, I wanted to point out that SciKits is actually a group of modules (sckikits.learn being one) built using SciPy. It includes a statistical computation module, image processing routines and vector plotting algorithms among many, many others.

Are there any data-mining/pattern recognition Python packages that you can add to this list?

Updates:

A big thanks to Ben Racine who alerted me to:

  • Machine Learning Python — aka “mlk”. This package has been developed via an Italian research center, Fondazione Bruno Kessler. From the package homepage, I see it includes: SVM (Support Vector Machine), KNN (K Nearest Neighbor), FDA, SRDA, PDA, DLDA (Fisher, Spectral Regression, Penalized, Diagonal Linear Discriminant Analysis) for classification and feature weighting, I-RELIEF, DWT and FSSun for feature weighting, *RFE (Recursive Feature Elimination) and RFS (Recursive Forward Selection) for feature ranking, OLS, (Kernel) Ridge Regression, LASSO, LARS, Gradient Descent for Regression, Elastic Net, DWT, UWT, CWT (Discrete, Undecimated, Continuous Wavelet Transform), KNN imputing, DTW (Dynamic Time Warping), Hierarchical Clustering, k-medoids, k-means, Resampling Methods, Metric Functions, Canberra indicators
  • Machine Learning Tooklkit — aka “MILK” — was written by MIT author Luis Pedro Coelho.  Its focuses on supervised classification via  SVMs (based on libsvm), k-NN, random forests, and decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems. For unsupervised learning, milk supports k-means clustering and affinity propagation.

 

-Lyndie Chiou

Most genetic breakthroughs false?

I was once involved in a discussion about the lack of negative results in research. Negative results occur when the assumed hypothesis is proved false. Or in other words, what the researcher was trying to prove turned out to be wrong.

This new knowledge is just as important as positive results, but such studies are rarely published. To prove my point, I looked through a sample of online scientific studies to find a paper where the main result had a p-value of greater than 0.05. The p-value describes how likely it is that the hypothesis model describes the data. The larger the p-value, the less likely that the results are true as opposed to being a fluke. A p-value of 0.05 or greater is the standard cut-off.

The result? Zero. Nada. No papers in my casual, small survey presented a negative result.

One might conclude that scientific intuition is always correct. But in fact people already know that negative results get trashed. Or even worse, they get recycled. Recycling can happen when the researcher keeps tweaking the data and testing with different statistical measures until he/she finds one that gives a low enough p-value to get published.

In the end, the real research in such results is the study to see which statistical method best correlates the data with the hypothesis.

The main problem with this approach is that when larger datasets are used to check the results, the conclusions don’t hold up. There are a number of historical examples where this exact scenario has happened.

In The Guardian, it was reported that several genes linked to behavior turned out to have no correlation in larger datasets. For instance, one study claimed that an enzyme used to produce seratonin in the brain correlated with depression. The study was widely reported not just in scientific journals, but also the mainstream media. Unfortunately, the results of several larger, more controlled study turned out to show no such correlation. And, of course, these more carefully controlled larger studies were ignored by the mass media (although they were reported in scientific journals).

I wonder if one day a Journal of Negative Results will gain just as much traction and generate just as much interest as our current positive result bias? Such a journal could push the boundary of knowledge just as much as all the positive results journals.  Given the relatively cheap distribution model of the Internet, it seems like a quality negative results  journal shouldn’t be too hard to birth. And I’d be willing to bet anyone a coffee that there’s a veritable MOUNTAIN of papers just waiting to grace its pages.

References:

Munafo M. et al, Genetic ‘breakthroughs’ in medicine are often nothing of the sort, guardian.co.uk, 9 Nov 2009

-Lyndie Chiou

Your friends could be the reason why you were denied credit!

We all know that our credit reports are used to determine our future credit risk. Last year it emerged that American Express was using the shops where we make purchases to deny credit to customers. Now a new company is data-mining our friends on social networking sites to determine credit risk.

Yes, really! You Facebook-friended that loser just to be nice to him, but now he’s caused your credit card company to deny you credit.

This is the work of companies such as Rapleaf of San Francisco that uses the perfectly legal method of getting friend lists from social networks like Facebook, MySpace, and Twitter to help determine your credit worthiness.

This bleeding-edge business space is called SMM – social media monitoring. Everything about you is added to your particular data profile including what books you have reviewed on Amazon to the comments you left on the blogs you visit. Then in-house algorithms are used to serve up suggestions to customers on everything from advertisements you might like to your credit worthiness. The privacy laws on the books were certainly not written with our current information age in mind!

I’ve thought for awhile that every digital fingerprint you leave online is really building an online brand about you. This can go two ways and might make for an interesting psychology PhD for a grad student somewhere. Does the profile that you build online match your concept of yourself? Perhaps your online data profile really describes more about you than you realize… You can see what Rapleaf thinks about you by visiting their website to check out your online profile (you have to register for a free account). You can also opt-out.

References:
How Rapleaf Is Data-Mining Your Friend Lists to Predict Your Credit Risk

-Lyndie Chiou

Data-mining in school districts

Starting a decade ago more and more schools across the US began implementing data mining programs to improve school performance.

It was claimed that data-mining real-time and after the school year had ended could identify students who were in danger of not graduating. Real-time data mining would be a way to find these students early on so they could be provided with targeted assistance in their weak subjects.

Data-mining after the school year was over provided a method to check how programs performed and to evaluate the weaknesses and strengths of individual teachers and schools.

After a media splash, more and more school districts turned to data-mining as a way to improve their overall performance.

I decided to follow up on schools to try to data-mine the effect of data-mining. Ok, that was originally what I wanted to do, but very quickly I realized it would be a long research project, not just a short project. So this analysis is not that sophisticated. I just looked to see if its graduation rates improved at one school since it started its data-mining program.

One of the earliest schools to jump onto the data-mining bandwagon was Broward County School District in Florida. This is a large school with a low graduation rate. Broward County School District is in the nations’s top 10 largest school districts with almost 250,000 students. They were featured in an article in 2000 in CNN, detailing their plans to provide data-mining services via a $2 million grant from IBM.

The year the data-mining project began, the graduation rate was 62.3%. In 2008-2009 school year, the most recent school year with reported rates, it was 76.3%. Below is a graph of the graduation rate from 1998-2008 (data-mining began in 2000).

Given the Broward High School’s class size of about 1,200 kids/grade, that’s about 1,300 extra students who graduated during the 8 year time span who otherwise would not have made it — about the same size as a whole class.

Of course, it’s hard to tell if this can all be put down to the benefits of real-time data-mining. To do this study properly, all the schools in the country that had implemented data-mining programs should be compared against all the schools that hadn’t. This brings me to another point… Data in the education domain is very difficult to obtain and seemingly unreliable! I found at least 3 different websites with conflicting data for the same clearly-defined measure of graduation rate. In the end, I went with the stats listed on the Florida Department of Education which differed from the Broward County website data which was also different from the National Center for Education Statistics data! These are all government sources and therefore reliable, one would have thought…

Thus I drop the ball of examining the usefulness of data-mining in education right here… I wish education data was more centralized and therefore easier to access and hopefully more reliable. Any parent who is trying to decide where to live in order to put their kids in the best schools probably has wished the exact same thing, as well as researchers looking for ways to improve education.

UPDATE: After I wrote this post, I discovered a mine-load of data (although mainly test scores, incidents, etc., not graduation rates) at this Florida Dept of Education link. I may revisit this subject to form a more proper conclusion about Broward District’s results sometime in the future!

-Lyndie Chiou

Data mining contests

I’m a sucker for competitions with lots of prize money… So I went fishing on the web looking for data mining contests. I only found three results – do you know of any others? Comment on this post and I can update the list for everyone. Here’s the competitions I found:

1. Of course, round 1 of the Netflix competition has ended, but did you know there’s a round 2 — also with a $1 million prize? Round 2 will be a time-limited contest involving sparse datasets. The full details for the Netflix 2 prize will be announced in the near future on their website. Once the contest has been officially started, it will have a progress prize at 6 months and then finish at 18 months.

2. There’s a statistical methods competition called the OMOP Cup: Method Competition. It’s organized by the Observational Medical Outcomes Partnership. The purpose is to improve on current methods of utilizing real-time data to ensure drug safety. There are two parts to the competition (taken from the website):

  • Challenge 1 explores how well your method works when provided an entire dataset, so the goal is accurate classification of which drugs are associated with which outcomes.
  • Challenge 2 evaluates the timeliness of detection of drug-event associations by having your methods run against data sequentially as it accumulates over time.

The total prize money is $20,000. Visit the OMOP Cup: Method Competition website for full details.

3. Every year, KDD (Knowledge Discovery and Data-Mining ) sponsors a data-mining competition with a cash prize of around $5000. The competition is usually announced in Spring, so apologies for mentioning it now – you will have to wait until 2010. You can look at info on past competitions here.

-Lyndie Chiou

How to solve ancient mysteries in one day

newton-principia-mathematica_smallAn algorithm from researchers at Cornell has managed to data-mine the underlying laws of physics in just under one day.

For the past 50 years, it has been postulated that computer learning algorithms would out-pace the human mind in deriving laws of behavior from large, complicated datasets. Prizes (like the Leibniz Prize) were even been offered for the first program to fundamentally change mathematics. Despite many earnest attempts, Hal has yet to be created.

However, the modern availability of cheap memory and speed has meant that the processing power of researchers has grown exponentially, allowing for more sophisticated learning algorithms. Researchers have been able to learn from these algorithms, refine them and create even more elaborate methods until voila! A computer has been able to derive laws of physics with nothing more than experimental data. 

To be more specific, the researchers, Schmidt and Lipson, fed experimental data on several simple systems such as a weight and spring or a 2-arm pendulum into the program. The program also had knowledge of simple mathematical operations. Using a method similar to Monte Carlo simulations, it kept trying and optimizing different forms of equations until it had derived equations that described the systems. In the process, it also expressed some underlying laws like conservation of momentum and Newton’s 2nd law of motion.

indusvalleyseals_smallIn another example of data-mining, researchers were able to answer one part of a hieroglyphic mystery that has perplexed archeologists to this day. The Indus Script from 4,000 years ago has remained undeciphereable. Some linguists have insisted that it is no language at all, but merely political ciphers (like the Democtratic donkey or Republican elephant) that were important in that day. The problem is that the longest chain of Indus Script contains only 27 characters, making it extremely difficult to crack. A group of researchers has now managed to “prove” that it is a real language by showing that the entropy level of the order of the Indus characters is very similar to human language. 

Rao, a machine learning expert, and his team, fed their learning algorithm several different modern and ancient languages and measured the entropy of the word order. They also fed it non-verbal communication like DNA and FORTRAN code. These non-verbal communications turned out to have either extremely low or high entropy. The Indus language matched the mid-entropy level of human languages.

knots_smallIn other fascinating news, did you know that you can’t tie your shoelaces? If you don’t believe me, visit Ian’s shoelace site. Apparently, most of us use Granny knots to tie our shoelaces instead of the “correct” Shoelace knot.

Now you know.

 

 

 

Citations:

“Distilling Free-Form Natural Laws from Experimental Data.” By Michael Schmidt and Hod Lipson.  Science, Vol. 324, April 3, 2009.

Wired Science article on the topic: http://www.wired.com/wiredscience/2009/04/newtonai

“Entropic Evidence for Linguistic Structure in the Indus
Script.” By Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh
Joglekar, R. Adhikari and Iravatham Mahadevan. Science, Vol. 324 Issue
5926, April 24, 2009.

Wired Science article on the Indus language:   http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563

Ian’s shoelace website: http://www.fieggen.com/shoelace/index.htm

-Lyndie Chiou




7 visitors online now
2 guests, 5 bots, 0 members
Max visitors today: 7 at 10:05 am UTC
This month: 12 at 02-02-2012 01:28 pm UTC
This year: 23 at 01-04-2012 10:32 pm UTC
All time: 44 at 11-08-2010 02:08 am UTC