TextGarden Software Suite Quick Overview Marko Grobelnik Dunja
Text-Garden Software Suite Quick Overview Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute Ljubljana, Slovenia
Outline n n n n What is Text-Garden? How Text-Garden is being built? Major functionalities Technical aspects Future developments Availability Text-Garden & SMART
What is Text-Garden? n Text-Garden is a software library and collection of software tools for solving large scale tasks dealing with structured, semistructured and unstructured data q n …in particular, emphasis of functionality is on dealing with text It can be used in various ways covering research and applicative scenarios q Being used by several institutions such as BT, MSR, CMU, …
Some history… n n n The work started in 1996 as a set of C++ classes for dealing with text and to perform text-learning tasks (two people working on it) …till 2002 it developed slowly according to the academic tasks being on our agenda From 2003 on Text-Garden became central software platform JSI is used in many research and applicative projects (~10 people contributing) q …all the solutions and results JSI is working on eventually become part of Text-Garden environment
…local JSI development of Text-Garden Projects SMART STREP PASCAL No. E SEKT IP JSI team Text-Garden NEON IP …
Major functionalities
Functionality blocks Lexical text processing (tokenization, stop-words, stemming, n-grams, Wordnet) Unsupervised learning (KMeans, Hierarchical-KMeans, One. Class. SVM) Semi-Supervised learning (Uncertainty sampling) Supervised learning (SVM, Winnow, Perceptron, NBayes) Dimensionality reduction (LSI, PCA) Visualization (Graph based, Tiling, Density based, …) Named Entity Extraction (capitalization based) Cross Correlation (KCCA, matching text with other data) Keyword Extraction (contrast, centroid, taxonomy based) Large Taxonomies (dealing with DMoz, Medline) Crawling Web and Search Eng. (for large scale data acquisition) Scalable Search (inverted index)
Lexical processing n n n Includes transformation from various formats into bag-of-words representation q …text/html, many custom formats (Svm-light, Reuters, …) q …will support most text encodings Lexical processing includes q Tokenization q Stop-words removal q Stemming (Porter stemmer for English, we have ML for learning stemmers from lexicons for other languages) q Frequent N-Gram features (consecutive words co-appearing) q Proximity features (words co-appearing within window) q Wordnet integration (words co-appearing in synsets or in e. g. hypernym relations) Output of this level is. BOW (Bag-Of-Words) file q …with dictionary and sparse vectors for documents
Unsupervised learning n Algorithms: q K-means clustering n q Hierarchical-K-Means n q …creating hierarchy of clusters One-Class-SVM n n …clustering into flat clusters …learning from positive class only …result is. Bow. Part (BOW Partition) file
Supervised learning n Following algorithms are implemented: q q q n SVM (two class, regression) Winnow Perceptron Naive. Bayes K-Nearest-Neighbor (KNN) … …result is. Bow. Md (BOW Model) file which is used for further classification
Dimensionality reduction n n Transform original space into low dimensional one and project original data Two classical methods q Latent Semantic Indexing (LSI) n q n …efficient implementation, working with sparse matrices Principal Component Analysis (PCA) …result is. Sem. Space (Semantic Space) file which is used for projecting the data q We can e. g. project original. BOW file into transformed lower dimensional. BOW file
Named Entity Extraction n Simple and efficient NEE q …it is based on word capitalization n q Candidate NEs (words and sequences of words) need to be capitalized Heuristic rule: capitalized candidates must appear within text at least once …we handle exceptions separately Works well with some user interaction
Crawler & Search Engine n Crawler q n …Slovenian internet archive being crawled by Text-Garden Crawler Highly scalable indexing & search of text documents q E. g. indexing of 10 M documents in several hours
Support for selected external sources n Text-Garden has special support for the following databases and services q q q q q n Google Search (Web/News/Scholar) DMoz/Open Directory Project Medline Word. Net Yahoo! Finance CIA World-Fact-Book Cordis project database Cyc (Open. Cyc/Research. Cyc) Reuters datasets (old, new), ACM Tech. News, … In preparation q q Wikipedia, Euro. Voc, Agro. Voc (FAO) MSN Search, Yahoo Search
Technical Aspects
Technical aspects n Text Garden is almost entirely written in portable C++ q q q …it compiles under Windows (Microsoft Visual C++, Borland C++) and Unix/Linux (GNU C) …it runs under 32 bit and 64 bit platforms …it consists of ~200. 000 relatively compact lines of code
How to use Text-Garden functionality? n Text-Garden functionality can be accessed in a number of ways: q As plain C++ classes n q As DLL library of ~250 functions n q ~60 command line utilities getting connected in pipeline Through GUI tools n q Simplified extract of major functionality As command line utilities n q Complete functionality (e. g. Doc. Atlas, Onto. Gen, …) Through interfaces to several platforms n (Java, Matlab, …) – next slide
Multiplatform Text-Garden n Text-Garden has the following interfaces with the same API: q q q n API has ~40 classes and ~250 functions q n C/C++ - through simplified DLL & native C++ Java – through JNI. NET – e. g. accessible through C#, VB, … Matlab – through standard Matlab interface Python – through standard Python interface Mathematica, Prolog, R – in preparation …interfaces to the all above platforms are generated automatically from the master Text-Garden header file …next slides include some examples in Matlab and Java
Text Parsing to TFIDF – Matlab Bow. Doc. Bs. Id = New. Bow. Doc. Bs('en 523', 'porter'); DId(1) = Add. Bow. Doc. Bs_Html. Doc. Str(Bow. Doc. Bs. Id, 'Economics', 'There are several basic and incomplete questions that must be answered in order to resolve the problems of economics satisfactorily. The scarcity problem, for example, requires answers to basic questions, such as: what to produce, how to produce it, and who gets what is produced. An economic system is a way of answering these basic questions. Different economic systems answer them differently. ', '', 1); DId(2) = Add. Bow. Doc. Bs_Html. Doc. Str(Bow. Doc. Bs. Id, 'Oscar Wilde', 'Oscar Fingal OFlahertie Wills Wilde (October 16, 1854 -- November 30, 1900) was an Irish playwright, novelist, poet, short story writer and Freemason. One of the most successful playwrights of late Victorian London, and one of the greatest celebrities of his day, known for his barbed and clever wit, he suffered a dramatic downfall and was imprisoned after being convicted in a famous trial for gross indecency (homosexual acts). ', '', 1); Bow. Doc. Wgt. Bs. Id = Gen. Bow. Doc. Wgt. Bs(Bow. Doc. Bs. Id); WIds = Get. Bow. Docwgt. Bs_Doc. WIds(Bow. Doc. Wgt. Bs. Id, DId(1)); for WId. N = 0: (WIds-1) WId = Get. Bow. Doc. Wgt. Bs_Doc. WId(Bow. Doc. Wgt. Bs. Id, DId(1), WId. N); Word. Str = Get. Bow. Doc. Bs_Word. Str(Bow. Doc. Bs. Id, WId); Word. Wgt = Get. Bow. Wgt. Doc. Bs_Doc. WWgt(Bow. Doc. Wgt. Bs. Id, DId(1), WId); sprintf('%s: %. 5 f', Word. Str, Word. Wgt) end
import si. ijs. jtextgarden. *; public class SVM { public static void main(String[] args) { System. out. println("Loading JText. Garden. Lib. . . "); JText. Garden. Lib tg = new JText. Garden. Lib(); SVM Classification – Java System. out. println("Loading bow. . . "); int Bow. Doc. Bs. Id = tg. Load. Bow. Doc. Bs(". /data/topic 50 k. bow"); tg. Save. Bow. Doc. Bs. Stat(Bow. Doc. Bs. Id, ". /res/Bow. Doc. Bs. Stat. txt"); System. out. println("Training linear SVM binary classifier. . . "); int ECat. Id = tg. Get. Bow. Doc. Bs_CId(Bow. Doc. Bs. Id, "ECAT"); int Bin. SVMBow. Md. Id = tg. Get. Bin. SVMBow. Md(Bow. Doc. Bs. Id, ECat. Id); System. out. println("Testing model. . . "); String Doc 1 = "There are several basic and incomplete questions that " + "must be answered in order to resolve the problems of economics " + "satisfactorily. The scarcity problem, for example, requires answers " + "to basic questions, such as: what to produce, how to produce it, " + "and who gets what is produced. An economic system is a way of " + "answering these basic questions. Different economic systems answer " + "them differently. "; double Cfy. Res 1 = tg. Get. Bow. Md. Cfy. From. Html(Bin. SVMBow. Md. Id, Bow. Doc. Bs. Id, Doc 1); System. out. println("Cfy. Res 1 = " + Cfy. Res 1); String Doc 2 = "Oscar Fingal O'Flahertie Wills Wilde (October 16, " + "1854 -- November 30, 1900) was an Irish playwright, novelist, poet, short " + "story writer and Freemason. One of the most successful playwrights of late " + "Victorian London, and one of the greatest celebrities of his day, known " + "for his barbed and clever wit, he suffered a dramatic downfall and was " + "imprisoned after being convicted in a famous trial for gross indecency " + "(homosexual acts). "; double Cfy. Res 2 = tg. Get. Bow. Md. Cfy. From. Html(Bin. SVMBow. Md. Id, Bow. Doc. Bs. Id, Doc 2); System. out. println("Cfy. Res 2 = " + Cfy. Res 2); } }
import si. ijs. jtextgarden. *; public class Google { Google Web&News Querying – Java public static void main(String[] args) { System. out. println("Loading JText. Garden. Lib. . . "); JText. Garden. Lib tg = new JText. Garden. Lib(); System. out. println("Web Search. . . "); int Web. RSet. Id = tg. Google. Web. Search("slovenia", 50); System. out. println("Number of hits: " + tg. Get. RSet_Hits(Web. RSet. Id)); for (int Hit. N = 0; Hit. N < 10; Hit. N++) { System. out. println("Hit " + (Hit. N+1) + ": " + tg. Get. RSet_Hit. Title. Str(Web. RSet. Id, Hit. N)); } System. out. println("News Search. . . "); int News. RSet. Id = tg. Google. News. Search("slovenia basketball"); System. out. println("Number of hits: " + tg. Get. RSet_Hits(News. RSet. Id)); for (int Hit. N = 0; Hit. N < 10; Hit. N++) { System. out. println("Hit " + (Hit. N+1) + ": " + tg. Get. RSet_Hit. Url. Str(News. RSet. Id, Hit. N)); } } }
import si. ijs. jtextgarden. *; import java. io. *; public class Active. Learning { public static void main(String[] args) throws IOException { System. out. println("Loading JText. Garden. Lib. . . "); JText. Garden. Lib tg = new JText. Garden. Lib(); Active Learning – Java System. out. println("Loading bow. . . "); int Bow. Doc. Bs. Id = tg. Load. Bow. Doc. Bs(". /data/Cia. WFB. partly. bow"); String Cat. Nm = "Europe"; int Cat. Id = tg. Get. Bow. Doc. Bs_CId(Bow. Doc. Bs. Id, Cat. Nm); int Bow. ALId = tg. New. Bow. AL(Bow. Doc. Bs. Id, Cat. Id); System. out. println("Starting Active Learning loop. . . "); Buffered. Reader br = new Buffered. Reader(new Input. Stream. Reader(System. in)); while (true) { if (!tg. Get. Bow. AL_Query. DId. V(Bow. ALId )) { break; } double Query. Mn. Dist = tg. Get. Bow. AL_Query. Dist(Bow. ALId, 0); int Query. DId = tg. Get. Bow. AL_Query. DId(Bow. ALId , 0); String Doc. Nm = tg. Get. Bow. Doc. Bs_Doc. Nm(Bow. Doc. Bs. Id, Query. DId); System. out. println("Does the following document belong to the '" + Cat. Nm + "' category? "); System. out. println(Doc. Nm + " [" + Query. Mn. Dist + "]"); System. out. println("1 -yes, 2 -no, 3 -stop"); int User. Response = Integer. parse. Int(br. read. Line()); if (User. Response == 3) { break; } tg. Mark. Bow. AL_Query. DId(Bow. ALId, Query. DId, (User. Response==1)); } System. out. println("Finishing Active Learning. . . "); System. out. println("Mark the rest of the documents? (1 - yes, 2 - no)"); int User. Response = Integer. parse. Int(br. read. Line()); if (User. Response == 1) { tg. Mark. Bow. AL_Unlabeled. Pos. Docs(Bow. ALId ); int Docs = tg. Get. Bow. Doc. Bs_Docs(Bow. Doc. Bs. Id); for (int DId = 0; DId < Docs; DId++) { String Doc. Nm = tg. Get. Bow. Doc. Bs_Doc. Nm(Bow. Doc. Bs. Id, DId); int Doc. CIds = tg. Get. Bow. Doc. Bs_Doc. CIds(Bow. Doc. Bs. Id, DId); for (int Doc. CId. N=0; Doc. CId. N < Doc. CIds; Doc. CId. N++) { int Doc. CId = tg. Get. Bow. Doc. Bs_Doc. CId(Bow. Doc. Bs. Id, Doc. CId. N); String Doc. Cat. Nm = tg. Get. Bow. Doc. Bs_Cat. Nm(Bow. Doc. Bs. Id, Doc. CId); if (Doc. Cat. Nm. equals(Cat. Nm)) { System. out. println(Doc. Nm + " [" + Cat. Nm + "]"); } } }
Future developments n Around Text-Garden is being prepared a text-book for text-mining q n …Text-Garden will serve as software covering most of the topics within the book Text-Garden is getting extended by other sets of functionalities q q …Graph-Garden – dealing with graphs and networks (collaboration with CMU) …Media-Garden – dealing with images and videos …Semantic-Garden – dealing with Semantic-Web issues …Stream-Garden – dealing with streams of data
Availability n n Text-Garden is under LGPL license It is available from www. textmining. net q …new complete release will be uploaded soon
…some images from Text-Garden Document-Atlas Content-Land SEKTbar Semantic-Graphs Contexter Onto-Genesis
- Slides: 25