<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Research Pipeline Blog &#187; data mining</title>
	<atom:link href="http://www.researchpipeline.com/wordpress/tag/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.researchpipeline.com/wordpress</link>
	<description>Notes about data and analysis</description>
	<lastBuildDate>Wed, 03 Feb 2010 05:34:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Data mining contests</title>
		<link>http://www.researchpipeline.com/wordpress/2009/10/13/data-mining-contests/</link>
		<comments>http://www.researchpipeline.com/wordpress/2009/10/13/data-mining-contests/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 05:20:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[datasets]]></category>
		<category><![CDATA[contests]]></category>

		<guid isPermaLink="false">http://www.researchpipeline.com/wordpress/?p=121</guid>
		<description><![CDATA[I&#8217;m a sucker for competitions with lots of prize money&#8230; So I went fishing on the web looking for data mining contests. I only found three results &#8211; do you know of any others? Comment on this post and I can update the list for everyone. Here&#8217;s the competitions I found:
1. Of course, round 1 [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a sucker for competitions with lots of prize money&#8230; So I went fishing on the web looking for data mining contests. I only found three results &#8211; do you know of any others? Comment on this post and I can update the list for everyone. Here&#8217;s the competitions I found:</p>
<p>1. Of course, round 1 of the <a href="http://www.netflixprize.com/">Netflix competition has ended</a>, but did you know there&#8217;s a round 2 &#8212; also with a $1 million prize? Round 2 will be a time-limited contest involving sparse datasets. The full details for the <a href="http://www.netflixprize.com//community/viewtopic.php?id=1520" target="_blank">Netflix 2 prize</a> will be announced in the near future on their <a href="http://www.netflixprize.com//community/viewtopic.php?id=1520" target="_blank">website</a>. Once the contest has been officially started, it will have a progress prize at 6 months and then finish at 18 months.</p>
<p>2. There&#8217;s a statistical methods competition called the <a href="http://omopcup.orwik.com/" target="_blank">OMOP Cup: Method Competition</a>. It&#8217;s organized by the Observational Medical Outcomes Partnership. The purpose is to improve on current methods of utilizing real-time data to ensure drug safety. There are two parts to the competition (taken from the website):</p>
<ul>
<li> Challenge 1 explores how well your method works when provided an entire dataset, so the goal is accurate classification of which drugs are associated with which outcomes.</li>
<li>Challenge 2 evaluates the timeliness of detection of drug-event associations by having your methods run against data sequentially as it accumulates over time.</li>
</ul>
<p>The total prize money is $20,000. Visit the <a href="http://omopcup.orwik.com/" target="_blank">OMOP Cup: Method Competition</a> website for full details.</p>
<p>3. Every year, <a href="http://www.sigkdd.org/kddcup/index.php" target="_blank">KDD (Knowledge Discovery and Data-Mining )</a> sponsors a data-mining competition with a cash prize of around $5000. The competition is usually announced in Spring, so apologies for mentioning it now &#8211; you will have to wait until 2010. You can look at info on past competitions <a href="http://www.sigkdd.org/kddcup/index.php" target="_blank">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.researchpipeline.com/wordpress/2009/10/13/data-mining-contests/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to solve ancient mysteries in one day</title>
		<link>http://www.researchpipeline.com/wordpress/2009/04/27/how-to-solve-ancient-mysteries-in-one-day/</link>
		<comments>http://www.researchpipeline.com/wordpress/2009/04/27/how-to-solve-ancient-mysteries-in-one-day/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 22:29:22 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[knots]]></category>

		<guid isPermaLink="false">http://www.researchpipeline.com/wordpress/?p=95</guid>
		<description><![CDATA[An algorithm from researchers at Cornell has managed to data-mine the underlying laws of physics in just under one day.
For the past 50 years, it has been postulated that computer learning algorithms would out-pace the human mind in deriving laws of behavior from large, complicated datasets. Prizes (like the Leibniz Prize) were even been offered [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-96" title="newton-principia-mathematica_small" src="http://www.researchpipeline.com/wordpress/wp-content/uploads/2009/04/newton-principia-mathematica_small.jpg" alt="newton-principia-mathematica_small" width="150" height="210" />An algorithm from researchers at Cornell has managed to data-mine the underlying laws of physics in just under one day.</p>
<p>For the past 50 years, it has been postulated that computer learning algorithms would out-pace the human mind in deriving laws of behavior from large, complicated datasets. Prizes (like the <a href="http://www.ams.org/prizes/atp-prizes.html" target="_blank">Leibniz Prize</a>) were even been offered for the first program to fundamentally change mathematics. Despite many earnest attempts, Hal has yet to be created.</p>
<p>However, the modern availability of cheap memory and speed has meant that the processing power of researchers has grown exponentially, allowing for more sophisticated learning algorithms. Researchers have been able to learn from these algorithms, refine them and create even more elaborate methods until voila! A computer has been able to derive laws of physics with nothing more than experimental data. </p>
<p>To be more specific, the researchers,<a href="http://www.sciencemag.org/cgi/content/abstract/sci;324/5923/81?maxtoshow=&amp;HITS=10&amp;hits=10&amp;RESULTFORMAT=&amp;fulltext=&quot;Distilling+Free-Form+Natural+Laws+from+Experimental+Data.&quot;+By+Michael+Schmidt+and+Hod+Lipson&amp;searchid=1&amp;FIRSTINDEX=0&amp;resourcetype=HWCIT" target="_blank"> Schmidt and Lipson</a>, fed experimental data on several simple systems such as a weight and spring or a 2-arm pendulum into the program. The program also had knowledge of simple mathematical operations. Using a method similar to Monte Carlo simulations, it kept trying and optimizing different forms of equations until it had derived equations that described the systems. In the process, it also expressed some underlying laws like conservation of momentum and Newton&#8217;s 2nd law of motion.</p>
<p><img class="alignright size-full wp-image-97" title="indusvalleyseals_small" src="http://www.researchpipeline.com/wordpress/wp-content/uploads/2009/04/indusvalleyseals_small.jpg" alt="indusvalleyseals_small" width="300" height="150" />In another example of data-mining, researchers were able to answer one part of a hieroglyphic mystery that has perplexed archeologists to this day. The Indus Script from 4,000 years ago has remained undeciphereable. Some linguists have insisted that it is no language at all, but merely political ciphers (like the Democtratic donkey or Republican elephant) that were important in that day. The problem is that the longest chain of Indus Script contains only 27 characters, making it extremely difficult to crack. A group of researchers has now managed to &#8220;prove&#8221; that it is a real language by showing that the entropy level of the order of the Indus characters is very similar to human language. </p>
<p>Rao, a machine learning expert, and his team, fed their learning algorithm several different modern and ancient languages and measured the entropy of the word order. They also fed it non-verbal communication like DNA and FORTRAN code. These non-verbal communications turned out to have either extremely low or high entropy. The Indus language matched the mid-entropy level of human languages.</p>
<p><img class="alignleft size-full wp-image-98" title="knots_small" src="http://www.researchpipeline.com/wordpress/wp-content/uploads/2009/04/knots_small.png" alt="knots_small" width="200" height="196" />In other fascinating news, did you know that you can&#8217;t tie your shoelaces? If you don&#8217;t believe me, visit <a href="http://www.fieggen.com/shoelace/index.htm" target="_blank">Ian&#8217;s shoelace site</a>. Apparently, most of us use Granny knots to tie our shoelaces instead of the &#8220;correct&#8221; Shoelace knot.</p>
<p>Now you know.</p>
<p> </p>
<p> </p>
<p> </p>
<p>Citations:</p>
<p><em>&#8220;Distilling Free-Form Natural Laws from Experimental Data.&#8221; By Michael Schmidt and Hod Lipson.  </em>Science<em>, Vol. 324, April 3, 2009.</em></p>
<p><em>Wired Science article on the topic:<a href="http://www.wired.com/wiredscience/2009/04/newtonai"><span style="color: #000000; text-decoration: none;"> </span></a><a href="http://www.wired.com/wiredscience/2009/04/newtonai">http://www.wired.com/wiredscience/2009/04/newtonai</a></em></p>
<p><em>&#8220;Entropic Evidence for Linguistic Structure in the Indus<br />
Script.&#8221; By Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh<br />
Joglekar, R. Adhikari and Iravatham Mahadevan. Science, Vol. 324 Issue<br />
5926, April 24, 2009.</em></p>
<p style="text-align: left; "><em>Wired Science article on the Indus language:  <a href="http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563"><span style="color: #000000; text-decoration: none;"> </span></a><a href="http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563">http://www.wired.com/wiredscience/2009/04/indusscript#comment-152291563</a></em></p>
<p><em>Ian&#8217;s shoelace website: <a href="http://www.fieggen.com/shoelace/index.htm">http://www.fieggen.com/shoelace/index.htm</a></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.researchpipeline.com/wordpress/2009/04/27/how-to-solve-ancient-mysteries-in-one-day/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>One researcher&#8217;s peek into the online gaming world</title>
		<link>http://www.researchpipeline.com/wordpress/2009/02/22/one-researchers-peek-into-the-online-gaming-world/</link>
		<comments>http://www.researchpipeline.com/wordpress/2009/02/22/one-researchers-peek-into-the-online-gaming-world/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 08:14:30 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[datasets]]></category>
		<category><![CDATA[data mining]]></category>

		<guid isPermaLink="false">http://www.researchpipeline.com/wordpress/?p=47</guid>
		<description><![CDATA[
Professor Noshir Contractor from Northwestern University and his colleagues recently data-mined the massive 60-terabyte dataset released by EverQuest II. In case someone out there has never heard of EverQuest, it&#8217;s an online roll-playing game occupying the spare time of about 45 million people.
Noshir found some interesting results as a result of his efforts. First off, [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-49 alignleft" title="EverQuest II image (from Wikipedia)" src="http://www.researchpipeline.com/wordpress/wp-content/uploads/2009/02/250px-eq2_level_60_mount.jpg" alt="EverQuest II image (from Wikipedia)" width="250" height="188" /></p>
<p>Professor Noshir Contractor from Northwestern University and his colleagues recently data-mined the massive 60-terabyte dataset released by EverQuest II. In case someone out there has never heard of EverQuest, it&#8217;s an online roll-playing game occupying the spare time of about 45 million people.</p>
<p>Noshir found some interesting results as a result of his efforts. First off, those 45 million people are NOT mostly teens. The average age was much higher. And despite having the entire world literally at their finger-tips, people still tended to network with their geographical neighbors.</p>
<p>“People end up playing with people nearby, often with people they already know,” Contractor said. “It’s not creating new networks. It’s reinforcing existing networks. You can talk to anyone anywhere, and yet individuals 10 kilometers away from each other are five times more likely to be partners than those who are 100 kilometers away from each other.”</p>
<p>I would have expected some grouping due to language barriers, but the geographical localization far exceeds that. I wonder if the even stronger statement could be made that players tend to play with people they previously knew?</p>
<p>A survey was also distributed to 7,000 players. Using the results of the survey, Professor Contractor found disproportionate rates of self-reported depression vs. the general population. Additionally, he found that players tended to understimate how much time they devoted to the game, and that women are generally the most devoted and content (but apparently don&#8217;t like to play with other women!).</p>
<p>Hmmm&#8230;. So perhaps the guys are depressed because after long hours of playing EverQuest with mainly other guys they realize after the game has been switched off they still don&#8217;t have a girlfriend? Whereas women spend their time mostly getting attention from lots of admiring EverQuest men so it makes up for not having a boyfriend?</p>
<p>One thing this shows is that patterns can exist in data, but the reasons behind those patterns may still be a mystery. I wonder if Professor Contractor has plans to follow up his research with a further survey to try to tease out of players why they interact the way they do &#8211; now that he knows how they interact!</p>
<p>Whatever the next step, it was an interesting series of finds extracted from a massive dataset. Please refer to the original article written by Megan Fellman from <a href="http://www.northwestern.edu/newscenter/stories/2009/02/virtualworlds.html">Northwestern University&#8217;s news center</a>. You can also visit <a href="http://www.iknowinc.com/wordpress/">Professor Noshir Contractor&#8217;s blog</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.researchpipeline.com/wordpress/2009/02/22/one-researchers-peek-into-the-online-gaming-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Surprising uses of corporate data mining</title>
		<link>http://www.researchpipeline.com/wordpress/2009/02/14/surprising-uses-of-corporate-data-mining/</link>
		<comments>http://www.researchpipeline.com/wordpress/2009/02/14/surprising-uses-of-corporate-data-mining/#comments</comments>
		<pubDate>Sat, 14 Feb 2009 23:10:28 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[datasets]]></category>
		<category><![CDATA[American Express]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[groupization]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://www.researchpipeline.com/wordpress/?p=41</guid>
		<description><![CDATA[
The NY Times recently described an interesting policy by American Express. The credit card company had been lowering the credit limits of customers who had shopped at certain retail outlets. Using their proprietary dataset gathered over their customer base, American Express had identified certain retailers whose customers had a hard time paying their credit card [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-89" title="amexpcard" src="http://www.researchpipeline.com/wordpress/wp-content/uploads/2009/02/amexpcard.jpg" alt="amexpcard" width="244" height="154" /></p>
<p>The NY Times recently described an interesting policy by American Express. The credit card company had been lowering the credit limits of customers who had shopped at certain retail outlets. Using their proprietary dataset gathered over their customer base, American Express had identified certain retailers whose customers had a hard time paying their credit card bills. They concluded that all customers who shopped at these stores were a credit risk and correspondingly lowered their credit limits. The catch here is that American Express did not reveal their retailer black list. So customers had no way of knowing which stores to avoid. Walmart? Niemann-Marcus? Baskin-Robbins?</p>
<p>Once the story appeared into the press American Express recanted its policy. In fact, they went one step further and insisted this had never been their policy, despite thousands of letters to &#8220;curtailed&#8221; customers that explicitly detailed otherwise. Read the original NY Times report of the surprising use of <a href="http://www.nytimes.com/2009/01/31/your-money/credit-and-debit-cards/31money.html?_r=2&amp;ref=business">American Express data mining</a>.</p>
<p>In other news about creative uses of data mining&#8230;</p>
<p>Microsoft Live search is thinking about inserting social technology into its searching service. They found that a relatively new technique called &#8220;groupization&#8221; turned up more relevant results in an internal test-run. The idea is to use a person&#8217;s social network to influence the results that are returned to a user. A user searches using a set of keywords which are then correlated with the results the user&#8217;s social group found relevant. While Microsoft was keen on the idea, they were also worried that the implementation on a large scale might be nigh impossible. You can read a summary of the idea at the online website <a href="http://arstechnica.com/microsoft/news/2009/02/microsoft-looking-at-using-groupization-to-bolster-search.ars">ars technica</a>.</p>
<p>Personally, I have my doubts that this would really contribute to my personal search results. I&#8217;ve noticed that the advertisements on the social networking site Facebook are 180-degrees different from what I&#8217;m actually interested in. I get a bunch of ads for the acai berry diet, movies,  and how to get long eye-lashes. In case anyone from Facebook is reading this, I don&#8217;t want to detangle my eye-lashes in the morning! I think the low relevance of the ads on Facebook have to do with the fact that it&#8217;s a social networking site and therefore relatively &#8220;fluffy&#8221;. Perhaps Microsoft could create its own pre-defined community titles and a user could click their interests/hobbies when they create an account profile. Then based on these broad categories, Microsoft could perform the &#8220;groupization&#8221;. This might be a scalable approach to their strategy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.researchpipeline.com/wordpress/2009/02/14/surprising-uses-of-corporate-data-mining/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
