Your friends could be the reason why you were denied credit!

We all know that our credit reports are used to determine our future credit risk. Last year it emerged that American Express was using the shops where we make purchases to deny credit to customers. Now a new company is data-mining our friends on social networking sites to determine credit risk.

Yes, really! You Facebook-friended that loser just to be nice to him, but now he’s caused your credit card company to deny you credit.

This is the work of companies such as Rapleaf of San Francisco that uses the perfectly legal method of getting friend lists from social networks like Facebook, MySpace, and Twitter to help determine your credit worthiness.

This bleeding-edge business space is called SMM – social media monitoring. Everything about you is added to your particular data profile including what books you have reviewed on Amazon to the comments you left on the blogs you visit. Then in-house algorithms are used to serve up suggestions to customers on everything from advertisements you might like to your credit worthiness. The privacy laws on the books were certainly not written with our current information age in mind!

I’ve thought for awhile that every digital fingerprint you leave online is really building an online brand about you. This can go two ways and might make for an interesting psychology PhD for a grad student somewhere. Does the profile that you build online match your concept of yourself? Perhaps your online data profile really describes more about you than you realize… You can see what Rapleaf thinks about you by visiting their website to check out your online profile (you have to register for a free account). You can also opt-out.

References:
How Rapleaf Is Data-Mining Your Friend Lists to Predict Your Credit Risk

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Data-mining in school districts

Starting a decade ago more and more schools across the US began implementing data mining programs to improve school performance.

It was claimed that data-mining real-time and after the school year had ended could identify students who were in danger of not graduating. Real-time data mining would be a way to find these students early on so they could be provided with targeted assistance in their weak subjects.

Data-mining after the school year was over provided a method to check how programs performed and to evaluate the weaknesses and strengths of individual teachers and schools.

After a media splash, more and more school districts turned to data-mining as a way to improve their overall performance.

I decided to follow up on schools to try to data-mine the effect of data-mining. Ok, that was originally what I wanted to do, but very quickly I realized it would be a long research project, not just a short project. So this analysis is not that sophisticated. I just looked to see if its graduation rates improved at one school since it started its data-mining program.

One of the earliest schools to jump onto the data-mining bandwagon was Broward County School District in Florida. This is a large school with a low graduation rate. Broward County School District is in the nations’s top 10 largest school districts with almost 250,000 students. They were featured in an article in 2000 in CNN, detailing their plans to provide data-mining services via a $2 million grant from IBM.

The year the data-mining project began, the graduation rate was 62.3%. In 2008-2009 school year, the most recent school year with reported rates, it was 76.3%. Below is a graph of the graduation rate from 1998-2008 (data-mining began in 2000).

Given the Broward High School’s class size of about 1,200 kids/grade, that’s about 1,300 extra students who graduated during the 8 year time span who otherwise would not have made it — about the same size as a whole class.

Of course, it’s hard to tell if this can all be put down to the benefits of real-time data-mining. To do this study properly, all the schools in the country that had implemented data-mining programs should be compared against all the schools that hadn’t. This brings me to another point… Data in the education domain is very difficult to obtain and seemingly unreliable! I found at least 3 different websites with conflicting data for the same clearly-defined measure of graduation rate. In the end, I went with the stats listed on the Florida Department of Education which differed from the Broward County website data which was also different from the National Center for Education Statistics data! These are all government sources and therefore reliable, one would have thought…

Thus I drop the ball of examining the usefulness of data-mining in education right here… I wish education data was more centralized and therefore easier to access and hopefully more reliable. Any parent who is trying to decide where to live in order to put their kids in the best schools probably has wished the exact same thing, as well as researchers looking for ways to improve education.

UPDATE: After I wrote this post, I discovered a mine-load of data (although mainly test scores, incidents, etc., not graduation rates) at this Florida Dept of Education link. I may revisit this subject to form a more proper conclusion about Broward District’s results sometime in the future!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Data mining contests

I’m a sucker for competitions with lots of prize money… So I went fishing on the web looking for data mining contests. I only found three results – do you know of any others? Comment on this post and I can update the list for everyone. Here’s the competitions I found:

1. Of course, round 1 of the Netflix competition has ended, but did you know there’s a round 2 — also with a $1 million prize? Round 2 will be a time-limited contest involving sparse datasets. The full details for the Netflix 2 prize will be announced in the near future on their website. Once the contest has been officially started, it will have a progress prize at 6 months and then finish at 18 months.

2. There’s a statistical methods competition called the OMOP Cup: Method Competition. It’s organized by the Observational Medical Outcomes Partnership. The purpose is to improve on current methods of utilizing real-time data to ensure drug safety. There are two parts to the competition (taken from the website):

  • Challenge 1 explores how well your method works when provided an entire dataset, so the goal is accurate classification of which drugs are associated with which outcomes.
  • Challenge 2 evaluates the timeliness of detection of drug-event associations by having your methods run against data sequentially as it accumulates over time.

The total prize money is $20,000. Visit the OMOP Cup: Method Competition website for full details.

3. Every year, KDD (Knowledge Discovery and Data-Mining ) sponsors a data-mining competition with a cash prize of around $5000. The competition is usually announced in Spring, so apologies for mentioning it now – you will have to wait until 2010. You can look at info on past competitions here.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

How to solve ancient mysteries in one day

newton-principia-mathematica_smallAn algorithm from researchers at Cornell has managed to data-mine the underlying laws of physics in just under one day.

For the past 50 years, it has been postulated that computer learning algorithms would out-pace the human mind in deriving laws of behavior from large, complicated datasets. Prizes (like the Leibniz Prize) were even been offered for the first program to fundamentally change mathematics. Despite many earnest attempts, Hal has yet to be created.

However, the modern availability of cheap memory and speed has meant that the processing power of researchers has grown exponentially, allowing for more sophisticated learning algorithms. Researchers have been able to learn from these algorithms, refine them and create even more elaborate methods until voila! A computer has been able to derive laws of physics with nothing more than experimental data. 

To be more specific, the researchers, Schmidt and Lipson, fed experimental data on several simple systems such as a weight and spring or a 2-arm pendulum into the program. The program also had knowledge of simple mathematical operations. Using a method similar to Monte Carlo simulations, it kept trying and optimizing different forms of equations until it had derived equations that described the systems. In the process, it also expressed some underlying laws like conservation of momentum and Newton’s 2nd law of motion.

indusvalleyseals_smallIn another example of data-mining, researchers were able to answer one part of a hieroglyphic mystery that has perplexed archeologists to this day. The Indus Script from 4,000 years ago has remained undeciphereable. Some linguists have insisted that it is no language at all, but merely political ciphers (like the Democtratic donkey or Republican elephant) that were important in that day. The problem is that the longest chain of Indus Script contains only 27 characters, making it extremely difficult to crack. A group of researchers has now managed to “prove” that it is a real language by showing that the entropy level of the order of the Indus characters is very similar to human language. 

Rao, a machine learning expert, and his team, fed their learning algorithm several different modern and ancient languages and measured the entropy of the word order. They also fed it non-verbal communication like DNA and FORTRAN code. These non-verbal communications turned out to have either extremely low or high entropy. The Indus language matched the mid-entropy level of human languages.

knots_smallIn other fascinating news, did you know that you can’t tie your shoelaces? If you don’t believe me, visit Ian’s shoelace site. Apparently, most of us use Granny knots to tie our shoelaces instead of the “correct” Shoelace knot.

Now you know.

 

 

 

Citations:

“Distilling Free-Form Natural Laws from Experimental Data.” By Michael Schmidt and Hod Lipson.  Science, Vol. 324, April 3, 2009.

Wired Science article on the topic: http://www.wired.com/wiredscience/2009/04/newtonai

“Entropic Evidence for Linguistic Structure in the Indus
Script.” By Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh
Joglekar, R. Adhikari and Iravatham Mahadevan. Science, Vol. 324 Issue
5926, April 24, 2009.

Wired Science article on the Indus language:   http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563

Ian’s shoelace website: http://www.fieggen.com/shoelace/index.htm

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Human DNA

There’s been a big push to gather together a vast genomic library of human DNA. The purpose for the dataset ranges for everything from personal genomic analysis to research into the genetic causes of diseases. Will any of these datasets be available for the public? The answer is nope.

Rotating DNABack in 2006, the National Human Genome Research Institute began collecting human DNA and posting it online for researchers to freely download. The datasets were downloaded 491 times before access was restricted. The reason? Fears about protecting patient privacy.

The basic reasoning goes something like this: say you donated your DNA to be analyzed and an unscrupulous person downloaded your data. This person could then synthesize your data and plant it at a crime scene. Investigators taking evidence would find your synthesized DNA and compare it with the online database from the National Human Genome Research Institute. If there was a match, they could then legally compel the Institute to turn over your identity for prosecution.

Does this seem a bit far-fetched? Over 99.9% of the human genome is identical from one human to the next. So forensic detectives just analyze the unique sections of the human genome to identify a person. Only these sections need to be synthesized. There is a well-known technique that allows a researcher to take a section of DNA and substitute a specific genomic pattern. Scientists such as Craig Venter are using this approach to try to create “artificial” life.

Of course, there’s the much more mundane threat that an insurance company could data mine the online DNA profiles to screen applicants. Or perhaps an employer could look up your genetic risk for alcoholism.

You might argue that the National Human Genome Research Institute should only provide averaged data. However in August 2008, a group of extremely clever researchers published a paper describing how to extract individual genomes in highly averaged data. So even averaged data does not protect the individual.

For the time-being, in order to access the human genome datasets at the National Human Genome Research Institute you have to be approved.

The number of for-profit personal DNA analysis companies is also growing.  A quick session with Google easily found five (for the list of retailers I found, see the bottom of this post). These companies charge anywhere from $399 to $99,500 for access to your personal genome. The results of their analyses can be very entertaining, informing you of your genetic lineage as well as a range of genetic diseases for which you may carry a susceptible gene or two. 23andMe.com even won the honor of Time’s Best Invention of 2008.

However, some question what safeguards are in place regarding their customers’ DNA. Do the retail genomic companies offer the same protection that financial companies use to protect customer records? Of course, we all know how well those protections have worked. If someone were to hack 23andMe.com, you don’t have the luxury of changing your DNA the way you can change your credit card number.

There have been a number of interesting articles written about the subject. Probably the best one was recently published in the American Scientist.

Retail genomic companies

  • 23andMe.com - $399. Genotype information for about 600,000 SNPs. They claim to estimate the genetic risk of the patient for over 80 diseases as well as ancestry analyses.
  • deCODEme.com - $985 – very similar to 23andMe but performs an analysis of 1 million SNPs and estimates the risk of 38 diseases.
  • RetailGenomics.com - $1000. Lists 72 conditions for which it tests. A relative new-comer and can’t seem to find much information on it.
  • Navigenics - $2500. Uses Affymetrix Genome-Wide Human SNP Array 6.0 , which tests some 900,000 SNPs and provides results on 20 diseases.
  • Knome - $99,500. Provides whole genome (98% genome) sequencing services. After analysis, the customer must travel to company headquarters where the scientists who developed their results discuss the analysis with them.
  • After having used any of these services, there is a free program called Promethease that will further analyze your personal genome for you. The service is FREE! 

Non-retail genomic organizations

  • PersonalGenomes.org – A charitable health & disease research project. Volunteers must permit their DNA to be made freely available to the public, however, the volunteers receive personal analysis as well as access to their own genome.
  • Research Program on Genes, Environment and Health – Kaiser Permanente’s project  to sequence the genes of its members in Northern California. Volunteers are supposedly anonymized and do not receive any genetic results. Limited access for approved researchers.
  • National Human Genome Research Institute – a government funded genomics project. Don’t see a link for volunteers. The volunteer DNA is made available for approved researchers.
Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

One researcher’s peek into the online gaming world

EverQuest II image (from Wikipedia)

Professor Noshir Contractor from Northwestern University and his colleagues recently data-mined the massive 60-terabyte dataset released by EverQuest II. In case someone out there has never heard of EverQuest, it’s an online roll-playing game occupying the spare time of about 45 million people.

Noshir found some interesting results as a result of his efforts. First off, those 45 million people are NOT mostly teens. The average age was much higher. And despite having the entire world literally at their finger-tips, people still tended to network with their geographical neighbors.

“People end up playing with people nearby, often with people they already know,” Contractor said. “It’s not creating new networks. It’s reinforcing existing networks. You can talk to anyone anywhere, and yet individuals 10 kilometers away from each other are five times more likely to be partners than those who are 100 kilometers away from each other.”

I would have expected some grouping due to language barriers, but the geographical localization far exceeds that. I wonder if the even stronger statement could be made that players tend to play with people they previously knew?

A survey was also distributed to 7,000 players. Using the results of the survey, Professor Contractor found disproportionate rates of self-reported depression vs. the general population. Additionally, he found that players tended to understimate how much time they devoted to the game, and that women are generally the most devoted and content (but apparently don’t like to play with other women!).

Hmmm…. So perhaps the guys are depressed because after long hours of playing EverQuest with mainly other guys they realize after the game has been switched off they still don’t have a girlfriend? Whereas women spend their time mostly getting attention from lots of admiring EverQuest men so it makes up for not having a boyfriend?

One thing this shows is that patterns can exist in data, but the reasons behind those patterns may still be a mystery. I wonder if Professor Contractor has plans to follow up his research with a further survey to try to tease out of players why they interact the way they do – now that he knows how they interact!

Whatever the next step, it was an interesting series of finds extracted from a massive dataset. Please refer to the original article written by Megan Fellman from Northwestern University’s news center. You can also visit Professor Noshir Contractor’s blog.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Surprising uses of corporate data mining

amexpcard

The NY Times recently described an interesting policy by American Express. The credit card company had been lowering the credit limits of customers who had shopped at certain retail outlets. Using their proprietary dataset gathered over their customer base, American Express had identified certain retailers whose customers had a hard time paying their credit card bills. They concluded that all customers who shopped at these stores were a credit risk and correspondingly lowered their credit limits. The catch here is that American Express did not reveal their retailer black list. So customers had no way of knowing which stores to avoid. Walmart? Niemann-Marcus? Baskin-Robbins?

Once the story appeared into the press American Express recanted its policy. In fact, they went one step further and insisted this had never been their policy, despite thousands of letters to “curtailed” customers that explicitly detailed otherwise. Read the original NY Times report of the surprising use of American Express data mining.

In other news about creative uses of data mining…

Microsoft Live search is thinking about inserting social technology into its searching service. They found that a relatively new technique called “groupization” turned up more relevant results in an internal test-run. The idea is to use a person’s social network to influence the results that are returned to a user. A user searches using a set of keywords which are then correlated with the results the user’s social group found relevant. While Microsoft was keen on the idea, they were also worried that the implementation on a large scale might be nigh impossible. You can read a summary of the idea at the online website ars technica.

Personally, I have my doubts that this would really contribute to my personal search results. I’ve noticed that the advertisements on the social networking site Facebook are 180-degrees different from what I’m actually interested in. I get a bunch of ads for the acai berry diet, movies,  and how to get long eye-lashes. In case anyone from Facebook is reading this, I don’t want to detangle my eye-lashes in the morning! I think the low relevance of the ads on Facebook have to do with the fact that it’s a social networking site and therefore relatively “fluffy”. Perhaps Microsoft could create its own pre-defined community titles and a user could click their interests/hobbies when they create an account profile. Then based on these broad categories, Microsoft could perform the “groupization”. This might be a scalable approach to their strategy.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Data Scraping

I came across a useful post on the blog ouseful.wordpress.com. The blogger, Tony Hirst, blogs about whatever he finds interesting. He figured out a way to scrape the data off of Wikipedia using the Google spreadsheet function =importHTM(””,”table”,N).

The  blogger gives detailed instructions on how to extract a table containing population data from England using the Google spreadsheet function. He then uses Yahoo! pipes to geocode the data and create a Google mashup. It’s a very ingenious method of extracting limited datasets!

Another blogger, GoogleMapsMania, suggested using Batchgeocode to get the latitude and longitudinal data and then applying that data in the Google Spreadsheet Map Wizard to map out the data

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

A fun blog to read…

Have you seen the blog Flowing Data? This blog is written by Nathany. a graduate student in data visualization and statistics. He gives almost daily updates on data analysis he has performed along with well-thought out methods of presenting the data. The blog reads like a nicely written magazine.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit

Google’s Palimpset cancelled

I was surprised to read that Google has decided to cancel its data hosting service, nee Palimpsest. The name Palimpsest came from the book, Archimedes Palimpsest. To read more on the project which recently restored Archimedes Palimpset click here.

Google was originally planning to host Terabytes of data, including astronomical and large governmental data sets. Many bloggers had been hailing Google’s service as the start of a new era of transparency for the USA.

The decision to cancel Google’s Palimpset came just a week ago. The sharp downturn in the economy played a key role. Google’s stock fell from an all-time high of about $700 a year ago to around $300 today. This caused Google to sharply curtail “experimental” programs that have no guaranteed revenues.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • RSS
  • StumbleUpon
  • Twitter
  • Reddit