CULTUROMICS EXPERIMENTING ON GOOGLE BOOKS WITH NGRAM VIEWER

CULTUROMICS: EXPERIMENTING ON GOOGLE BOOKS WITH N-GRAM VIEWER Susan Haynes Eastern Michigan University MASAL 2013

Definition of culturomics • Computational analysis of cultural trends or social trends using quantitative analysis of digitized text. • First coined in 2010 by researchers that helped create Google’s Ngram Viewer. • Example research: • Detecting censorship and suppression (e. g. , frequency trends of • • “Marc Chagall”) Trends in fame Trends in interest in “current events” (see following slide) N-gram viewer to attempt a physics-motivated statistical model of fluctuations of English word usage from “birth” and “death” news archives to identify cultural events (Arab Spring 2011).

“FAME”

The Data • Google books project: • As of March 2012, more than 20 million scanned and digitized books • Dates range back to 1500 s. (Before 1700, approx 0. 5 million English books published) • Access to text: • If in public domain, complete text is viewable • If out of copyright or copyright owner grants permission, viewer may “preview” a subset of pages • If copyright owner does not give “preview permission” then can give preview permission for “snippets” (2 or 3 lines) • Unknown owner – “snippets” • Books with full view or “preview” (including snippets) permission are searchable, though they may not be viewable. • Books with neither full view nor “preview” permission, are not searchable. Google Books gives only the book title. • NB, One opinion: “[…] Google Books searches are an unreliable indicator of the prevalence of specific usages or terms, because many authoritative works fall into the unsearchable category” [ http: //en. wikipedia. org/wiki/ Google_books ] • OCR is done very fast (1000 pages/hour), and there are many errors, including upside down pages, books interleaved, wrong characters. Feedback of error information from users has been curtailed due to volume of data. • The 2012 data are much improved over the previous set of data.

N-grams • An n-gram is a sequence of n items from the books. • Size 1, 2, 3 are ‘unigram’, ‘bigram’, trigram (respectively) • Larger size n uses the value of n: 4 -gram, 5 -gram, … • Using Google Books, an “item” is a sequence of alpha- numerics. • In computational linguistics, an n-gram can comprise any fixed set: words, syllable, letters, phonemes, … • In other areas where probabilistic methods are used to predict the next item in a sequence, an n-gram could be, e. g. , amino acid (for proteomics), DNA base pair (genomics). • A common model for predicting the next item in an n-gram is a Markov model ( order n-1 ). (See next slide for “math”)

Markov chain order m: (m is the “amount of memory” needed to predict the next state) A Markov chain of order m is a statistical process satisfying: P(Xn=xn | Xn-1=xn-1, Xn-2=xn-2, … Xn-m=xn-m) for n > m Xi is a random variable, xi is a value That is, the state at index n depends on the previous m states’ values. The cat is _____ P(X 4 = “pretty” | X 3=“is”, X 2=“cat”, X 1 X 2 X 3 X 4 X 1=“The”)

Google Books Ngram viewer • The data --- k-gram: k=1 … 5 • Each is divided in several files to make data downloads more manageable. For example in the English 20120701 corpus, here is a list of the 3 -grams files: 0 1 2 3 4 5 6 7 8 9 _ADJ_ _ADP_ _ADV_ _CONJ_ _DET_ _NOUN_ _NUM_ _PRON_ _PRT_ _VERB_ a_ aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az b_ ba bb bc bd be bf bg bh bi bj bk bl bm bn bo bp bq br bs bt bu bv bw bx by bz c_ ca cb cc cd ce cf cg ch ci cj ck cl cm cn co cp cq cr cs ct cu cv cw cx cy cz d_ da db dc dd de df dg dh di dj dk dl dm dn do dp dq dr ds dt du dv dw dx dy dz e_ ea eb ec ed ee ef eg eh ei ej ek el em en eo ep eq er es et eu ev ew ex ey ez f_ fa fb fc fd fe ff fg fh fi fj fk fl fm fn fo fp fq fr fs ft fu fv fw fx fy fz g_ ga gb gc gd ge gf gg gh gi gj gk gl gm gn go gp gq gr gs gt gu gv gw gx gy gz h_ ha hb hc hd he hf hg hh hi hj hk hl hm hn ho hp hq hr hs ht hu hv hw hx hy hz i_ ia ib ic id ie if ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz j_ ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju jv jw jx jy jz k_ ka kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz l_ la lb lc ld le lf lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz m_ ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz n_ na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz o_ oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot other ou ov ow ox oy oz p_ pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt pu punctuation pv pw px py pz q_ qa qb qc qd qe qf qg qh qi qj qk ql qm qn qo qp qq qr qs qt qu qv qw qx qy qz r_ ra rb rc rd re rf rg rh ri rj rk rl rm rn ro rp rq rr rs rt ru rv rw rx ry rz s_ sa sb sc sd se sf sg sh si sj sk sl sm sn so sp sq sr ss st su sv sw sx sy sz t_ ta tb tc td te tf tg th ti tj tk tl tm tn to tp tq tr ts tt tu tv tw tx ty tz u_ ua ub uc ud ue uf ug uh ui uj uk ul um un uo up uq ur us ut uu uv uw ux uy uz v_ va vb vc vd ve vf vg vh vi vj vk vl vm vn vo vp vq vr vs vt vu vv vw vx vy vz w_ wa wb wc wd we wf wg wh wi wj wk wl wm wn wo wp wq wr ws wt wu wv ww wx wy wz x_ xa xb xc xd xe xf xg xh xi xj xk xl xm xn xo xp xq xr xs xt xu xv xw xx xy xz y_ ya yb yc yd ye yf yg yh yi yj yk yl ym yn yo yp yq yr ys yt yu yv yw yx yy yz z_ za zb zc zd ze zf zg zh zi zj zk zl zm zn zo zp zq zr zs zt zu zv zw zx zy zz • The ngram file format is ngram TAB year TAB match_count TAB volume_count • Each file is tsv (tab separated values) Two sample (contiguous lines) in the y_ file (2 -grams)are • Y 2 O 3 have 1972 1974 1 1 • The first line means the bigram Y 2 O 3 appeared in 1972 one time in one volume.

Simple example of n-gram viewer • Computer science, computer technology, information technology • Corpus, start date, end date, smoothing

Areas of computer science

Simple example 3

Social impact of computing “digital divide”

Some n-gram viewer operations • Some arithmetic operations: +, -, /, ( ) , * • ( and/or) retrieves #“and/or”, while and/or returns #and / #or • * useful for comparing trends that have very different frequences • Choose a corpus : • faux: eng_2012, faux: fre_2012 Will show frequency trends for “faux” in the English corpus and the French corpus. • Part of speech tagging • _NOUN_, _VERB_, … • E. g. , race_NOUN_, race_VERB_ • Dependencies => (see next slide for example)

Example: how often “computer” modifies “crime” versus “computer crime”

Non-computer technology trends(? )

Example: computer terminology

Cultural trends in impact of computers(? )

Y 2 K, identity theft

Using the Google books ngram corpora for data-mining • Inventors of the Ngram Viewer state that it is just another, albeit quantitative, tool for research in culture, linguistics, etc. • From my own perspective, weaknesses include: • The books that are used to build the n-grams are not all accessible – this makes it impossible to do clustering or other data-mining techniques on those inaccessible data. • There is no pointer back to the books an n-gram was obtained from (just number of hits and number of volumes) – this makes it impossible to use the Ngram Viewer to identify specific books. • Can’t really do phrase “discovery” in a particular domain (e. g. , computer) in the n-gram files – have to know the phrase ahead of time (e. g. , researchers used “Marc Chagall” to identify censorship, not “censorship” to discover that Marc Chagall was censored)

Problems • Books, in fast moving areas, do not make a good medium for analysis. For computer-related topics, the web itself may be better. • The web data is unmanageable given a relative (to books) lack of context. The ngram data sets may be useful for crude identification of trends over time. • Trends over time only, not over space or other possible dimensions (…. ) • NOT A PROBLEM: There is a Python utility and a Windows utility that will retrieve the data that Ngram Viewer uses to construct the graphs.

References • Quantitative Analysis of Culture Using Millions of Digitized Books. Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Science 331 (2011) [http: //www. sciencemag. org/content/331/6014/176. full. pdf] • http: //www. culturomics. org/ • Books. google. com/ngrams • J. Bohannon, Digital Data. Google Books, Wikipedia and the future of culturomics, Science, January 14, 2011 [http: //www. sciencemag. org/content/331/6014/135. long] • Brian Haynes, Bit Lit, American Scientist, 99(3), may-June 2011) • Letcher, David W. (April 6, 2011) Culturomics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases". American Institute of Higher Education 6 th International Conference Proceedings 4 (1): 228 [http: //www. amhighed. com/documents/charleston 2011/AIHE 2011_Proceedings. pdf#page= 228] • Ben Zimmer, “When physicists do linguistics”, Boston Globe, February 10, 2013. [http: //bostonglobe. com/ideas/2013/02/10/when-physicistslinguistics/Zo. HNxh. E 6 uunm. M 7976 n. Ws. RP/story. html] • http: //firstmonday. org/htbin/cgiwrap/bin/ojs/index. php/fm/article/view/3663/ 3040