Using Ngram and Word Network Features for Native

- Slides: 1
Using N-gram and Word Network Features for Native Language Identification Shibamouli Lahiri Rada Mihalcea Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA shibamoulilahiri@my. unt. edu, rada@cs. unt. edu Native Language Identification § Native speakers of L 1 speaking L 2. § Classify L 2 samples according to L 1. § Related to author profiling and authorship attribution. Results on Training Set A Word Network brown fox (10 -fold Cross-validation Accuracy (%)) jumped Dataset TOEFL 11 corpus contains 12, 100 English essays (L 2), of which 9, 900 are for training, quick over 1, 100 are for test, and 1, 100 are for validation. There are 11 L 1’s (Arabic, Chinese, French, German, Hindi, Italian, lazy dog Japanese, Korean, Spanish, Telugu and Turkish). Word Network Features N-gram Features § Word, POS, and character n-grams § Degree, coreness, neighborhood size and clustering coefficient of words. § (n = 1, 2, 3). § Explored variations including directed and § Explored variations including and undirected versions of the above. excluding punctuations and spaces. § Most frequent 100, 200, 500, 1000 n- § Most frequent 100, 200, 500, 1000 words grams on the train + development set. the N-gram Feature Top 100 Top 200 Top 500 Top 1000 Word unigram 45. 07 52. 85 60. 14 62. 46 Word bigram 39. 54 44. 75 51. 70 56. 06 Word trigram 30. 62 35. 26 41. 56 44. 97 Word Network Feature Clustering coefficient Degree Coreness Neighborhood size (order 1) Top 100 Top 200 Top 500 Top 1000 Words 15. 31 17. 73 19. 96 20. 71 41. 05 35. 32 41. 83 50. 74 45. 84 50. 68 58. 17 53. 54 57. 40 60. 21 57. 18 60. 41 Information Gain Ranking of Word Network Features on Training Set Submitted Systems 10 -fold CV Accuracy on Training set (%) Accuracy on Test Set (%) System Description Rank Word Network Feature Information Gain UNT-closed-1. csv 64. 50 63. 20 Raw frequency of all words in the training set including stop words. Naïve Bayes classifier. 1 Degree of “a” 0. 1058 2 Neighborhood size of “a” 0. 1054 Raw frequency of all words in the training set except stop words. Naïve Bayes classifier. 3 Out-neighborhood size of “a” 0. 1050 Raw frequency of 1000 most frequent words in the training+dev set including punctuation. SVM (SMO) classifier. 4 Out-degree of “a” 0. 1049 5 In-neighborhood size of “a” 0. 1017 UNT-closed-2. csv UNT-closed-3. csv 65. 10 62. 46 63. 70 64. 50