Archive for the 'data mining' Category

Your friends could be the reason why you were denied credit!

We all know that our credit reports are used to determine our future credit risk. Last year it emerged that American Express was using the shops where we make purchases to deny credit to customers. Now a new company is data-mining our friends on social networking sites to determine credit risk.

Yes, really! You Facebook-friended that loser just to be nice to him, but now he’s caused your credit card company to deny you credit.

This is the work of companies such as Rapleaf of San Francisco that uses the perfectly legal method of getting friend lists from social networks like Facebook, MySpace, and Twitter to help determine your credit worthiness.

This bleeding-edge business space is called SMM – social media monitoring. Everything about you is added to your particular data profile including what books you have reviewed on Amazon to the comments you left on the blogs you visit. Then in-house algorithms are used to serve up suggestions to customers on everything from advertisements you might like to your credit worthiness. The privacy laws on the books were certainly not written with our current information age in mind!

I’ve thought for awhile that every digital fingerprint you leave online is really building an online brand about you. This can go two ways and might make for an interesting psychology PhD for a grad student somewhere. Does the profile that you build online match your concept of yourself? Perhaps your online data profile really describes more about you than you realize… You can see what Rapleaf thinks about you by visiting their website to check out your online profile (you have to register for a free account). You can also opt-out.

References:
How Rapleaf Is Data-Mining Your Friend Lists to Predict Your Credit Risk

Data-mining in school districts

Starting a decade ago more and more schools across the US began implementing data mining programs to improve school performance.

It was claimed that data-mining real-time and after the school year had ended could identify students who were in danger of not graduating. Real-time data mining would be a way to find these students early on so they could be provided with targeted assistance in their weak subjects.

Data-mining after the school year was over provided a method to check how programs performed and to evaluate the weaknesses and strengths of individual teachers and schools.

After a media splash, more and more school districts turned to data-mining as a way to improve their overall performance.

I decided to follow up on schools to try to data-mine the effect of data-mining. Ok, that was originally what I wanted to do, but very quickly I realized it would be a long research project, not just a short project. So this analysis is not that sophisticated. I just looked to see if its graduation rates improved at one school since it started its data-mining program.

One of the earliest schools to jump onto the data-mining bandwagon was Broward County School District in Florida. This is a large school with a low graduation rate. Broward County School District is in the nations’s top 10 largest school districts with almost 250,000 students. They were featured in an article in 2000 in CNN, detailing their plans to provide data-mining services via a $2 million grant from IBM.

The year the data-mining project began, the graduation rate was 62.3%. In 2008-2009 school year, the most recent school year with reported rates, it was 76.3%. Below is a graph of the graduation rate from 1998-2008 (data-mining began in 2000).

Given the Broward High School’s class size of about 1,200 kids/grade, that’s about 1,300 extra students who graduated during the 8 year time span who otherwise would not have made it — about the same size as a whole class.

Of course, it’s hard to tell if this can all be put down to the benefits of real-time data-mining. To do this study properly, all the schools in the country that had implemented data-mining programs should be compared against all the schools that hadn’t. This brings me to another point… Data in the education domain is very difficult to obtain and seemingly unreliable! I found at least 3 different websites with conflicting data for the same clearly-defined measure of graduation rate. In the end, I went with the stats listed on the Florida Department of Education which differed from the Broward County website data which was also different from the National Center for Education Statistics data! These are all government sources and therefore reliable, one would have thought…

Thus I drop the ball of examining the usefulness of data-mining in education right here… I wish education data was more centralized and therefore easier to access and hopefully more reliable. Any parent who is trying to decide where to live in order to put their kids in the best schools probably has wished the exact same thing, as well as researchers looking for ways to improve education.

UPDATE: After I wrote this post, I discovered a mine-load of data (although mainly test scores, incidents, etc., not graduation rates) at this Florida Dept of Education link. I may revisit this subject to form a more proper conclusion about Broward District’s results sometime in the future!

Data mining contests

I’m a sucker for competitions with lots of prize money… So I went fishing on the web looking for data mining contests. I only found three results – do you know of any others? Comment on this post and I can update the list for everyone. Here’s the competitions I found:

1. Of course, round 1 of the Netflix competition has ended, but did you know there’s a round 2 — also with a $1 million prize? Round 2 will be a time-limited contest involving sparse datasets. The full details for the Netflix 2 prize will be announced in the near future on their website. Once the contest has been officially started, it will have a progress prize at 6 months and then finish at 18 months.

2. There’s a statistical methods competition called the OMOP Cup: Method Competition. It’s organized by the Observational Medical Outcomes Partnership. The purpose is to improve on current methods of utilizing real-time data to ensure drug safety. There are two parts to the competition (taken from the website):

  • Challenge 1 explores how well your method works when provided an entire dataset, so the goal is accurate classification of which drugs are associated with which outcomes.
  • Challenge 2 evaluates the timeliness of detection of drug-event associations by having your methods run against data sequentially as it accumulates over time.

The total prize money is $20,000. Visit the OMOP Cup: Method Competition website for full details.

3. Every year, KDD (Knowledge Discovery and Data-Mining ) sponsors a data-mining competition with a cash prize of around $5000. The competition is usually announced in Spring, so apologies for mentioning it now – you will have to wait until 2010. You can look at info on past competitions here.

How to solve ancient mysteries in one day

newton-principia-mathematica_smallAn algorithm from researchers at Cornell has managed to data-mine the underlying laws of physics in just under one day.

For the past 50 years, it has been postulated that computer learning algorithms would out-pace the human mind in deriving laws of behavior from large, complicated datasets. Prizes (like the Leibniz Prize) were even been offered for the first program to fundamentally change mathematics. Despite many earnest attempts, Hal has yet to be created.

However, the modern availability of cheap memory and speed has meant that the processing power of researchers has grown exponentially, allowing for more sophisticated learning algorithms. Researchers have been able to learn from these algorithms, refine them and create even more elaborate methods until voila! A computer has been able to derive laws of physics with nothing more than experimental data. 

To be more specific, the researchers, Schmidt and Lipson, fed experimental data on several simple systems such as a weight and spring or a 2-arm pendulum into the program. The program also had knowledge of simple mathematical operations. Using a method similar to Monte Carlo simulations, it kept trying and optimizing different forms of equations until it had derived equations that described the systems. In the process, it also expressed some underlying laws like conservation of momentum and Newton’s 2nd law of motion.

indusvalleyseals_smallIn another example of data-mining, researchers were able to answer one part of a hieroglyphic mystery that has perplexed archeologists to this day. The Indus Script from 4,000 years ago has remained undeciphereable. Some linguists have insisted that it is no language at all, but merely political ciphers (like the Democtratic donkey or Republican elephant) that were important in that day. The problem is that the longest chain of Indus Script contains only 27 characters, making it extremely difficult to crack. A group of researchers has now managed to “prove” that it is a real language by showing that the entropy level of the order of the Indus characters is very similar to human language. 

Rao, a machine learning expert, and his team, fed their learning algorithm several different modern and ancient languages and measured the entropy of the word order. They also fed it non-verbal communication like DNA and FORTRAN code. These non-verbal communications turned out to have either extremely low or high entropy. The Indus language matched the mid-entropy level of human languages.

knots_smallIn other fascinating news, did you know that you can’t tie your shoelaces? If you don’t believe me, visit Ian’s shoelace site. Apparently, most of us use Granny knots to tie our shoelaces instead of the “correct” Shoelace knot.

Now you know.

 

 

 

Citations:

“Distilling Free-Form Natural Laws from Experimental Data.” By Michael Schmidt and Hod Lipson.  Science, Vol. 324, April 3, 2009.

Wired Science article on the topic: http://www.wired.com/wiredscience/2009/04/newtonai

“Entropic Evidence for Linguistic Structure in the Indus
Script.” By Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh
Joglekar, R. Adhikari and Iravatham Mahadevan. Science, Vol. 324 Issue
5926, April 24, 2009.

Wired Science article on the Indus language:   http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563

Ian’s shoelace website: http://www.fieggen.com/shoelace/index.htm