Archive for the 'datasets' Category

Earthquakes vs. Time of Day

66,725 Earthquakes from 1973 - 2011 with color indicating magnitude

Time and tide wait for no one. Add to that: earthquakes. I live in the San Francisco Bay Area, a.k.a. “earthquake country”, in a small house built in the 1950s before earthquake building codes had been created. Within the next 30 years, the USGS tells us we can expect a “big one” in the East Bay right along the fault where I dwell.

So here’s my question: If there’s a 30-year window for the next big one to occur, can I at least know the most likely time of day? This is not a crazy question. The time of day is really just a way of expressing where the sun is located with respect to your geographical location. The moon is responsible for sweeping the tides all around the Earth. So it seems reasonable to think that the moon or the sun may at least be an influence in “Earth tides” which might act as a trigger for earthquakes. Here’s a quick sketch showing my hypothesis:

This question turned out to be fairly straight-forward to answer and I’ll cut to the chase and say, no, not really. There are some hours that are a teeny bit more earthquake prone than others, but the variations proved to be statistically insignificant.

The Method

I downloaded all earthquakes magnitude 5 or greater from 1973 through mid-2011 from the USGS Global Earthquake Search website. This gave me a list of 66,725 earthquakes — a reasonable sized dataset. I mapped the positions and color-coded magnitudes of all 66,725 earthquakes (green = mag 5.0 up through red = mag 9.1), shown at the top of this blog entry.

It’s an interesting fact that these earthquakes span a time period of 338,117 hours which implies a chance of 20% for an earthquake (mag 5 or greater) during any hour. The chance during any hour for a magnitude 6 or greater earthquake drops to only 1.6%. By the time you get to a magnitude 7 or greater it’s much less than 1%/hr.

The next step was to calculate the right ascension of the sun and moon and translate the longitudinal position of each earthquake into right ascension to match. Below shows an illustration of what I’m describing.

The important thing to note is that the latitude of the sun, moon and earthquake can all be different and I am calculating the difference with respect solely to longitude. This is because I’m wondering about the most earthquake-prone time of day so the longitudes are the relevant quantities rather than the latitudes.

After these coordinates had been calculated for all 66,725 earthquakes, I found the difference of the position of the sun/moon with respect to each earthquake in terms of right ascension. Following that, I grouped the differences by “relative hour” (by which I mean the relative position as described above) and graphed the resulting histogram.

Position of the Sun vs. Earthquake Time

The histogram of the earthquakes relative to the position of the sun looked like this.

The blue bars show what the histogram should look like if there was an equal probability of having an earthquake regardless of the position of the sun and the red shows the actual data. You can see the bars are very close in size. There’s a small peak around the 17th hour. But is it statistically significant?

The mean number of earthquakes during any given relative hour was 2780 earthquakes. The standard deviation was +/- 60 earthquakes. The number of earthquakes in the 17th relative hour was 2902 earthquakes — just outside two standard deviations which shows it to be an outlier by 2 earthquakes. Not a strong outlier! In fact, the p-value using the Watson U-Squared test is a paltry 0.48 which is well outside the threshold for being a significant result. Translation: not publishable!

Position of the Moon vs. Earthquake Time

Since it is the moon, not the sun that is primarily responsible for sweeping the ocean tides around the Earth, perhaps I am looking at the wrong entity (actually, the sun is responsible for a smaller, secondary ocean tide, but the magnitude pales in comparison to the moon’s effect). The time of day has nothing to do with the moon’s position so if there worked out to be a correlation then you’d need to consult moon charts every day!

I redid my previous analysis, grouping earthquakes by position relative to the moon. The following histogram was the result:

As can easily be seen with the eye, the deviation of the data from the mean is even smaller this time (+/- 53). In fact, there were no points outside two standard deviations and the p-value was 0.33. Again, statistically insignificant!

Conclusion

In short, there’s no particular hour to be extra wary of earthquakes. Unfortunately, I’ll just have to settle for the USGS’s “sometime in the next 30 years”.

Data extracted from:
References:

-Lyndie Chiou

CDC Flu App Challenge

The CDC is getting into apps. They are challenging developers to design an app to fight the flu using mashups with their datasets as well as other publicly available datasets. Cash prizes! It seems like they are looking for nice ways to present and interact with flu data that will keep people educated about the flu year-round. You can design apps for just about any platform, including websites and mobile phones. The deadline is May 27, 2011.

-Lyndie Chiou

Economic Recovery in Red vs Blue States

I was chatting with my Dad recently and he brought up a debate he’d heard on the radio between a Republican and Democratic candidate.

The Republican candidate said that in our present-day recession economy, Republican states were better off than Democratic states. My Dad seemed to particularly relish how the Democratic candidate scrambled to defend his party but didn’t contradict anything the Republican guy was saying.

Politicians are known for saying anything to win elections. Is it really true that Republicans manage the economies of their states better?

I found a treasure trove of state political data on Wikipedia. I also found information on unemployment for the month of February 2011 at the website for the US Bureau for Labor Statistics. And finally, I was able to get an estimate of the budget gap/head via The Center on Budget and Policy Priorities.

And so I put everything together into a spreadsheet and stared at the data.

One assumption I’d always had (thanks to the New Deal era of President FDR) was that government spending was the best way to keep the country afloat during a recession. My data allowed me to plot unemployment vs. the state budget gap per person. Each dot in the graph below represents a state. The data is obviously very noisy, but there appeared to be a correlation between higher state budget gaps and greater unemployment. You could fit a line to this data, but the correlation was weak (only ~0.24).

The colors of the dots in this graph also show whether the state governor was a Republican or Democrat (if the governor’s party changed because of the November 2010 election, I went with the previous governor).

One might also conclude from this graph that Republican governors ran up larger state budget gaps and had higher unemployment than Democratic governors.  Just for the record, the state on the extreme right is Alaska, home of Sarah Palin. The state with the highest unemployment is Nevada.

Not wishing to make conclusions too quickly, we can use another metric to decide how to categorize the “Republicaness” of a state — the ratio of Republicans and Democrats in the upper and lower state legislatures.

It turned out that lower legislatures were all majority Democratic. That was surprising! In the chart below, red dots are more highly Republican legislatures and blue are more highly Democratic. Shades of purple show the degree of mixture. You can see the dots are all blue and shades of purple.

On the other hand, the upper legislatures in the states varied between majority Republican and Democrat.

The colors of the dots in this image now reflect the ratio of Republicans vs. Democrats in the upper legislatures. I’ve added threshold lines showing the highest value of the budget gap associated with each party. Clearly, the Democratic legislatures had larger budget gaps, but only narrowly. What struck me as a stronger relationship here was that purple, or split states, actually had the highest budget gap/person.

In fact, if I rotate the figure and fit a Gaussian bounding the outer edges…

The higher the budget gap, the more mixed the legislature. Note that the reverse was not true: a more mixed legislature did not necessarily imply that the budget gap was larger. In fact, there were several purple states very close to and even on top of the $0/head mark. Perhaps a graph of the standard deviation would also be enlightening:

Is it obvious that the more homogeneous the legislature, the more fiscally responsible its actions? Democrats spend more, but also tax more. Republicans tax less, but also spend less. An even mixture of the two bodies can lead to the extremes of spending less and taxing more (the $0/head Montana) or spending more and taxing less (the $1830/head Alaska).

It seems clear that majority Republican states are not better-off than majority Democratic states. But… Both the Republican and Democratic candidates could have gotten away with claiming that their states were better off than (some of) the purple states!

For your enjoyment, I uploaded my spreadsheet of state Republican vs. Democrat data to this website’s wiki: Republican and Democratic Economic Data, Feb 2011.

Footnotes:

Data extracted and combined from:

-Lyndie Chiou

Ratio of Republicans in State Upper Legislatures

Here’s an interesting relationship. The graph shows the percentage Republicans in the states Lower Legislatures versus the Upper Legislatures.

It would appear that if you’re a Republican, you have the best chances of winning an election in the lower house in states where the upper house is split. You will have a hard time in states where the upper house is either mostly Republican or Democratic.

Go figure…

The data for this spreadsheet has been uploaded to this website’s wiki: Republican and Democratic Economic Data, Feb 2011

-Lyndie Chiou

International Women’s Day

Thanks to Google’s ngrams project page I have wasted my scarce spare hours looking at micro trends in literature. A couple of months ago, the Google ngrams project presented a database of all the words from Google’s extensive book collection. Making the books freely available presents copyright issues, but a database of word frequency in a collection of books is legal. They even created a simple graphing tool so you can basically play with the data. Or you can download the entire dataset for your own purposes. Micro-trends in literature might not sound very exciting, but once I started trying words, it became an addictive tool to try to prove my zany cultural theories.

One graph seemed very appropriate for today, International Women’s Day.  I plotted the words “men”, “women” and “children” versus time. And look!

“men” in blue, “women” in red, “children” in green

The years range from 1800 to 2008 and you can see clearly that the word “men” (the blue line) rules by a long shot up until about 1920. To be fair, “men” can be used in a generalized sense to mean both men and women similar to the word “mankind”. Since there’s no context I can’t distinguish what percentage of the words actually refer to both sexes.

But the interesting part of the graph is the uptick in the usage of “women” starting during the era of 1960s feminism. Even more interesting, “women” overtakes “men” in the mid-1990s.

Shortly afterwards, “women” decreases and “men” once again rule. A decline in feminism? Or perhaps the bubble in the 1990s was due to the peak in so-called chick-lit which has since gone out of favor. To provide a cultural reference point, Bridget Jones Diary, the epitome of chick-lit, came out in 1996.

“Children” seem to have a steady increase all the way from the 1800s to the present day. The slow rate of increase in the word “children” surprises me since there’s been an explosion of children’s books since the days of Beatrix Potter. Perhaps Google has disdained uploading children’s literature into its database? I also tried the words “boy” and “girl” and they show a lower percentage of usage than “children”:

“children” in blue, “boy” in red, “girl” in green

Happy International Women’s Day.

-Lyndie Chiou

Google Brings Data Back

If you follow my blog (wink!) you’ll recall that I was surprised that Google cancelled its data hosting service, Palimpset. Well, they’ve brought it back big time, albeit under a new moniker. The Google Public Data Explorer was announced yesterday. You can upload any dataset you like, so long as you format the data using DSPL which is related to XML. The hosting service is totally free.

They’ve also got a service called DataWiki listed in their Google Labs section which allows users to upload “structured” data. I’m not sure how these two services differ and whether Google really needs both. But at least duplication is far better than the situation in 2009 when they cancelled the one data-hosting service they ran.

In addition to the data hosting, there is a set of tools which can be used to display the data. Right now, the home page shows a graphic for lifespan vs. number of offspring by geographical region. The graphic includes a cool slider which shows the fertility bubbles jumping around as time progresses.

I’m guessing this is a harbinger for Google to return to its research roots and take a step back from the profit juggernaut it has turned into.

A related snippet of news that I came across was the announcement by Intel that it is starting a center at Stanford devoted to visual computing. And GE recently came out with visualizing.org, a website devoted to data visualization which also incorporates other elements such as website contests. I noticed they have a contest for visualzing eco data with a $5,000 prize! Their ads have been appearing everywhere, including on this blog.

 

-Lyndie Chiou

Most genetic breakthroughs false?

I was once involved in a discussion about the lack of negative results in research. Negative results occur when the assumed hypothesis is proved false. Or in other words, what the researcher was trying to prove turned out to be wrong.

This new knowledge is just as important as positive results, but such studies are rarely published. To prove my point, I looked through a sample of online scientific studies to find a paper where the main result had a p-value of greater than 0.05. The p-value describes how likely it is that the hypothesis model describes the data. The larger the p-value, the less likely that the results are true as opposed to being a fluke. A p-value of 0.05 or greater is the standard cut-off.

The result? Zero. Nada. No papers in my casual, small survey presented a negative result.

One might conclude that scientific intuition is always correct. But in fact people already know that negative results get trashed. Or even worse, they get recycled. Recycling can happen when the researcher keeps tweaking the data and testing with different statistical measures until he/she finds one that gives a low enough p-value to get published.

In the end, the real research in such results is the study to see which statistical method best correlates the data with the hypothesis.

The main problem with this approach is that when larger datasets are used to check the results, the conclusions don’t hold up. There are a number of historical examples where this exact scenario has happened.

In The Guardian, it was reported that several genes linked to behavior turned out to have no correlation in larger datasets. For instance, one study claimed that an enzyme used to produce seratonin in the brain correlated with depression. The study was widely reported not just in scientific journals, but also the mainstream media. Unfortunately, the results of several larger, more controlled study turned out to show no such correlation. And, of course, these more carefully controlled larger studies were ignored by the mass media (although they were reported in scientific journals).

I wonder if one day a Journal of Negative Results will gain just as much traction and generate just as much interest as our current positive result bias? Such a journal could push the boundary of knowledge just as much as all the positive results journals.  Given the relatively cheap distribution model of the Internet, it seems like a quality negative results  journal shouldn’t be too hard to birth. And I’d be willing to bet anyone a coffee that there’s a veritable MOUNTAIN of papers just waiting to grace its pages.

References:

Munafo M. et al, Genetic ‘breakthroughs’ in medicine are often nothing of the sort, guardian.co.uk, 9 Nov 2009

-Lyndie Chiou

Data-mining in school districts

Starting a decade ago more and more schools across the US began implementing data mining programs to improve school performance.

It was claimed that data-mining real-time and after the school year had ended could identify students who were in danger of not graduating. Real-time data mining would be a way to find these students early on so they could be provided with targeted assistance in their weak subjects.

Data-mining after the school year was over provided a method to check how programs performed and to evaluate the weaknesses and strengths of individual teachers and schools.

After a media splash, more and more school districts turned to data-mining as a way to improve their overall performance.

I decided to follow up on schools to try to data-mine the effect of data-mining. Ok, that was originally what I wanted to do, but very quickly I realized it would be a long research project, not just a short project. So this analysis is not that sophisticated. I just looked to see if its graduation rates improved at one school since it started its data-mining program.

One of the earliest schools to jump onto the data-mining bandwagon was Broward County School District in Florida. This is a large school with a low graduation rate. Broward County School District is in the nations’s top 10 largest school districts with almost 250,000 students. They were featured in an article in 2000 in CNN, detailing their plans to provide data-mining services via a $2 million grant from IBM.

The year the data-mining project began, the graduation rate was 62.3%. In 2008-2009 school year, the most recent school year with reported rates, it was 76.3%. Below is a graph of the graduation rate from 1998-2008 (data-mining began in 2000).

Given the Broward High School’s class size of about 1,200 kids/grade, that’s about 1,300 extra students who graduated during the 8 year time span who otherwise would not have made it — about the same size as a whole class.

Of course, it’s hard to tell if this can all be put down to the benefits of real-time data-mining. To do this study properly, all the schools in the country that had implemented data-mining programs should be compared against all the schools that hadn’t. This brings me to another point… Data in the education domain is very difficult to obtain and seemingly unreliable! I found at least 3 different websites with conflicting data for the same clearly-defined measure of graduation rate. In the end, I went with the stats listed on the Florida Department of Education which differed from the Broward County website data which was also different from the National Center for Education Statistics data! These are all government sources and therefore reliable, one would have thought…

Thus I drop the ball of examining the usefulness of data-mining in education right here… I wish education data was more centralized and therefore easier to access and hopefully more reliable. Any parent who is trying to decide where to live in order to put their kids in the best schools probably has wished the exact same thing, as well as researchers looking for ways to improve education.

UPDATE: After I wrote this post, I discovered a mine-load of data (although mainly test scores, incidents, etc., not graduation rates) at this Florida Dept of Education link. I may revisit this subject to form a more proper conclusion about Broward District’s results sometime in the future!

-Lyndie Chiou

Data mining contests

I’m a sucker for competitions with lots of prize money… So I went fishing on the web looking for data mining contests. I only found three results – do you know of any others? Comment on this post and I can update the list for everyone. Here’s the competitions I found:

1. Of course, round 1 of the Netflix competition has ended, but did you know there’s a round 2 — also with a $1 million prize? Round 2 will be a time-limited contest involving sparse datasets. The full details for the Netflix 2 prize will be announced in the near future on their website. Once the contest has been officially started, it will have a progress prize at 6 months and then finish at 18 months.

2. There’s a statistical methods competition called the OMOP Cup: Method Competition. It’s organized by the Observational Medical Outcomes Partnership. The purpose is to improve on current methods of utilizing real-time data to ensure drug safety. There are two parts to the competition (taken from the website):

  • Challenge 1 explores how well your method works when provided an entire dataset, so the goal is accurate classification of which drugs are associated with which outcomes.
  • Challenge 2 evaluates the timeliness of detection of drug-event associations by having your methods run against data sequentially as it accumulates over time.

The total prize money is $20,000. Visit the OMOP Cup: Method Competition website for full details.

3. Every year, KDD (Knowledge Discovery and Data-Mining ) sponsors a data-mining competition with a cash prize of around $5000. The competition is usually announced in Spring, so apologies for mentioning it now – you will have to wait until 2010. You can look at info on past competitions here.

-Lyndie Chiou

Human DNA

There’s been a big push to gather together a vast genomic library of human DNA. The purpose for the dataset ranges for everything from personal genomic analysis to research into the genetic causes of diseases. Will any of these datasets be available for the public? The answer is nope.

Rotating DNABack in 2006, the National Human Genome Research Institute began collecting human DNA and posting it online for researchers to freely download. The datasets were downloaded 491 times before access was restricted. The reason? Fears about protecting patient privacy.

The basic reasoning goes something like this: say you donated your DNA to be analyzed and an unscrupulous person downloaded your data. This person could then synthesize your data and plant it at a crime scene. Investigators taking evidence would find your synthesized DNA and compare it with the online database from the National Human Genome Research Institute. If there was a match, they could then legally compel the Institute to turn over your identity for prosecution.

Does this seem a bit far-fetched? Over 99.9% of the human genome is identical from one human to the next. So forensic detectives just analyze the unique sections of the human genome to identify a person. Only these sections need to be synthesized. There is a well-known technique that allows a researcher to take a section of DNA and substitute a specific genomic pattern. Scientists such as Craig Venter are using this approach to try to create “artificial” life.

Of course, there’s the much more mundane threat that an insurance company could data mine the online DNA profiles to screen applicants. Or perhaps an employer could look up your genetic risk for alcoholism.

You might argue that the National Human Genome Research Institute should only provide averaged data. However in August 2008, a group of extremely clever researchers published a paper describing how to extract individual genomes in highly averaged data. So even averaged data does not protect the individual.

For the time-being, in order to access the human genome datasets at the National Human Genome Research Institute you have to be approved.

The number of for-profit personal DNA analysis companies is also growing.  A quick session with Google easily found five (for the list of retailers I found, see the bottom of this post). These companies charge anywhere from $399 to $99,500 for access to your personal genome. The results of their analyses can be very entertaining, informing you of your genetic lineage as well as a range of genetic diseases for which you may carry a susceptible gene or two. 23andMe.com even won the honor of Time’s Best Invention of 2008.

However, some question what safeguards are in place regarding their customers’ DNA. Do the retail genomic companies offer the same protection that financial companies use to protect customer records? Of course, we all know how well those protections have worked. If someone were to hack 23andMe.com, you don’t have the luxury of changing your DNA the way you can change your credit card number.

There have been a number of interesting articles written about the subject. Probably the best one was recently published in the American Scientist.

Retail genomic companies

  • 23andMe.com - $399. Genotype information for about 600,000 SNPs. They claim to estimate the genetic risk of the patient for over 80 diseases as well as ancestry analyses.
  • deCODEme.com - $985 – very similar to 23andMe but performs an analysis of 1 million SNPs and estimates the risk of 38 diseases.
  • RetailGenomics.com - $1000. Lists 72 conditions for which it tests. A relative new-comer and can’t seem to find much information on it.
  • Navigenics - $2500. Uses Affymetrix Genome-Wide Human SNP Array 6.0 , which tests some 900,000 SNPs and provides results on 20 diseases.
  • Knome - $99,500. Provides whole genome (98% genome) sequencing services. After analysis, the customer must travel to company headquarters where the scientists who developed their results discuss the analysis with them.
  • After having used any of these services, there is a free program called Promethease that will further analyze your personal genome for you. The service is FREE! 

Non-retail genomic organizations

  • PersonalGenomes.org – A charitable health & disease research project. Volunteers must permit their DNA to be made freely available to the public, however, the volunteers receive personal analysis as well as access to their own genome.
  • Research Program on Genes, Environment and Health – Kaiser Permanente’s project  to sequence the genes of its members in Northern California. Volunteers are supposedly anonymized and do not receive any genetic results. Limited access for approved researchers.
  • National Human Genome Research Institute – a government funded genomics project. Don’t see a link for volunteers. The volunteer DNA is made available for approved researchers.

-Lyndie Chiou




6 visitors online now
1 guests, 5 bots, 0 members
Max visitors today: 7 at 10:05 am UTC
This month: 12 at 02-02-2012 01:28 pm UTC
This year: 23 at 01-04-2012 10:32 pm UTC
All time: 44 at 11-08-2010 02:08 am UTC