Difference between revisions of "Ngrams"

From RP
Jump to: navigation, search
(Created page with "==About Google's Ngram Dataset== center Google compiled a word database from its collection of online books. Because of legal issues, Google could not post ...")
 
Line 1: Line 1:
 
==About Google's Ngram Dataset==
 
==About Google's Ngram Dataset==
  
[[ngramslogo1.jpg|center]]
+
[[ngramslogo1.png|center]]
  
 
Google compiled a word database from its collection of online books. Because of legal issues, Google could not post the unedited content of all the books and instead published word frequency.
 
Google compiled a word database from its collection of online books. Because of legal issues, Google could not post the unedited content of all the books and instead published word frequency.
Line 17: Line 17:
 
Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren't ordered with respect to one another. A French two word phrase starting with 'm' will be in the middle of one of the French 2gram files, but there's no way to know which without checking them all.
 
Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren't ordered with respect to one another. A French two word phrase starting with 'm' will be in the middle of one of the French 2gram files, but there's no way to know which without checking them all.
  
[[ngramslogo2.jpg|center]]
+
[[ngramslogo2.png|center]]
  
 
[[Category: Datasets]]
 
[[Category: Datasets]]
 
[[Category: Language]]
 
[[Category: Language]]

Revision as of 02:59, 16 March 2011

About Google's Ngram Dataset

center

Google compiled a word database from its collection of online books. Because of legal issues, Google could not post the unedited content of all the books and instead published word frequency.

Access the Data

http://ngrams.googlelabs.com/datasets

Contents of the Data

These datasets were generated in July 2009; Google will update these datasets as their book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).

Each of the links will directly download a fragment of the given corpus. For instance, the first hundred links below collectively comprise the 1-gram (i.e., individual words) counts for English, as collected from Google's scanned books around July 15, 2009.

Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren't ordered with respect to one another. A French two word phrase starting with 'm' will be in the middle of one of the French 2gram files, but there's no way to know which without checking them all.

center