Adjective Noun Types 20181122 Wikipedia Pages Adj Noun

  • Slides: 31
Download presentation
Adjective + Noun 到知识库中特定Types 王远 2018/11/22

Adjective + Noun 到知识库中特定Types 王远 2018/11/22

Wikipedia Pages Adj + Noun dbo: wiki. Page. ID DBpedia Entity Adj + Noun

Wikipedia Pages Adj + Noun dbo: wiki. Page. ID DBpedia Entity Adj + Noun : 1, 836, 620 Adj : 75, 867 Noun : 70, 140 SDType DBpedia Ontology YAGO Ontology 生成候选Type 测试数据集: 1. QALD 1 -7抽取了49个<adj, noun>对 2. 根据QALD提供的标准SPARQL对每个<adj, noun>标注了一个标准的Type 测试结果: Top-100 Top-50 Top-20 Top-10 Average 40 35 21 4 28 结果分析: 1. 3个<adj, noun>中的noun是组合型名词,且没有包含在Word. Net中。 例:Grunge#; #record label、 Australian#; #metalcore band 2. 2个<adj, noun>没有在资源库中(可能被过滤掉了; 也有可能Wikipeidia不存在这个adj + noun) 例:Swedish#; #holiday 3. 4个<adj, noun>的标准Type不在Top-100; 例: anti-apartheid#; #activist( ? uri rdf: type text: "anti-apartheid activist". )

 • Type 的 Local. Name 中包含有 Adj + Noun n n DBpedia :

• Type 的 Local. Name 中包含有 Adj + Noun n n DBpedia : 60/414 Yago : 135, 219/369, 144 musical + group 1292 american 1753 people 4494 political + party 880 musical 1388 descent 3011 military + unit 627 military 1230 group 1578 gaelic + footballer 482 political 1198 school 1176 religious + building 459 british 1052 party 1131 archaeological + site 431 defunct 1039 football 793 american + people 340 canadian 942 unit 704 american + football 301 french 893 building 639 human + rights 266 german 881 footballer 638 …… …… • 后续 作 n n n 对每个候选Type生成特征数据 人 标注 100~200组数据 训练一个二分类模型来进行过滤排序 ……

 • thks

• thks

Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data. ISWC 2013. Motivation In

Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data. ISWC 2013. Motivation In DBpedia, common reasons for missing type statements are -- Missing infoboxes. an article without an infobox is not assigned any type. -- Too general infoboxes. if an article about an actor uses a person infobox instead of the more specic actor infobox, the instance is assigned the type dbpediaowl: Person, but not dbpedia-owl: Actor. -- Wrong infobox mappings. the videogame infobox is mapped to dbpediaowl: Video. Game, not dbpedia-owl: Game, and dbpedia-owl: Video. Game is not a subclass of dbpedia-owl: Game in the DBpedia ontology. -- Unclear semantics. dbpedia-owl: College in British and US English, can denote private secondary schools, universities, or institutions within universities Reasoning seems the straight forward approach to tackle the problem of completing missing types. Standard RDFS reasoning via entailment rules -- ? x a ? t 1 rdfs: sub. Class. Of ? t 2 -- ? x ? r ? y. ? r rdfs: domain ? t -- ? y ? r ? x. ? r rdfs: range ? t entails ? x a ? t 2 ? x a ? t

dbr: Germany Types : country, award, city, sports team, mountain, stadium, record label, person,

dbr: Germany Types : country, award, city, sports team, mountain, stadium, record label, person, and military conflict…… dbpedia: Mze dbpedia-owl: source. Mountain dbpedia: Germany. dbpedia: XII Corps (United Kingdom) dbpedia-owl: battle dbpedia: Germany. SDType : An approach for inducing types which is tolerant with respect to erroneous and noisy data.

Evaluation Random samples of 10, 000 instances from Dbpedia and Open. Cyc. Using only

Evaluation Random samples of 10, 000 instances from Dbpedia and Open. Cyc. Using only ingoing properties. In DBpedia, outgoing properties and types are generated in the same step, so the correct type can be trivially predicted from outgoing properties. The reason for that is that DBpedia, with its stronger focus on coverage than on correctness, contains more faulty statements. When more links are present, the influence of each individual statement is reduced, which allows for correcting errors.

Evaluation From all 550, 048 untyped resources in DBpedia, the classifier identies 519, 900

Evaluation From all 550, 048 untyped resources in DBpedia, the classifier identies 519, 900 (94. 5%) as typeable. Generating types for those resources and evaluated them manually on a sample of 100 random resources.

Estimating Type Completeness in DBpedia types are at most 63. 7% complete, with at

Estimating Type Completeness in DBpedia types are at most 63. 7% complete, with at least 2. 7 million missing type statements (while YAGO types, which can be assessed accordingly, are at most 53. 3% complete)

候选Type的生成 Adj + Noun Wikipedia Pages dbo: wiki. Page. ID DBpedia Entity SDType rdf:

候选Type的生成 Adj + Noun Wikipedia Pages dbo: wiki. Page. ID DBpedia Entity SDType rdf: type DBpedia Ontology YAGO Ontology

 • thks

• thks

Wikipedia文本中的 adj + noun 抽取 • Wikipedia n n 版本: 2018/10 规模: 15. 9

Wikipedia文本中的 adj + noun 抽取 • Wikipedia n n 版本: 2018/10 规模: 15. 9 G • 抽取Adj + NP(NP中只包含有noun) n n 过滤掉 adj 为数字类型的序数词 过滤掉adj/noun包含了特殊字符的情况 u n 过滤比较级和最高级 u n adj + noun : 12, 819, 292; adj : 831, 202; noun : 2, 361, 237 过滤掉adj + nouns (noun的个数 > 1)的情况 u n adj + noun : 13, 309, 280; adj : 834, 028; noun : 2, 436, 532 adj + noun : 8, 095, 419; adj : 747, 043; noun : 396, 040 过滤掉adj + noun 为实体或noun为专有名词

adj + noun 统计 adj + noun : 8, 095, 419 adj + noun

adj + noun 统计 adj + noun : 8, 095, 419 adj + noun 频率分布图 5 011 465 5 000 4 000 3 000 2 000 1 103 513 1 000 99 679 78 599 63 607 52 239 10 132 664 9 185 580 8 523 126 280 268 7 486 052 72 897 5 730 3 92 00 10 00 ~1 66 10 1~ 10 00 ~1 11 6 5 4 3 2 1 0

l adj + noun 中 adj 的统计 n adj : 747, 043 first other

l adj + noun 中 adj 的统计 n adj : 747, 043 first other new second same several many own

l adj + noun 中 adjs 的统计 n adj : 747, 043 Word. Net

l adj + noun 中 adjs 的统计 n adj : 747, 043 Word. Net 中 Adjs “Adj + Noun 中 Adjs” 与 “Word. Net 中 Adjs” 的overlap Total 21, 557 16, 880 adj. all 17, 777 13, 785 adj. pert 4, 379 3, 055 adj. ppl 76 40 “Word. Net 中 adj”有4, 677个不在“adj + noun 中 adj”

l adj + noun 中 noun 的统计 n noun : 396, 040 time years

l adj + noun 中 noun 的统计 n noun : 396, 040 time years season life school people group state style approach version form system

l adj + noun 中 noun 的统计 n noun : 396, 040 Word. Net

l adj + noun 中 noun 的统计 n noun : 396, 040 Word. Net 中 Nouns Total noun. person noun. artifact noun. communication noun. attribute noun. state noun. cognition noun. animal noun. substance noun. plant noun. group noun. food noun. location noun. body noun. event noun. object noun. quantity noun. process noun. feeling noun. possession noun. time noun. phenomenon noun. shape noun. relation noun. Tops noun. motive “Adj + Noun 中 Nouns” 与 “Word. Net 中 Nouns” 的overlap 119188 35629 18899 16381 9459 8300 4802 5622 4429 14324 4639 17809 3972 3595 4907 3572 1663 2303 2031 1127 773 1520 1689 986 540 679 83 78 6703 6655 5274 3882 3255 2726 2465 2351 1949 1614 1381 1237 1194 1132 1064 867 844 665 610 563 532 416 357 312 63 41

noun part_of_speech Synset (320) attribute Synset 1 (620) part_of_speech Word. Net overlap Adjs +

noun part_of_speech Synset (320) attribute Synset 1 (620) part_of_speech Word. Net overlap Adjs + [attribute] 698 656 Nouns + [attribute] 606 502 adjective

THKS

THKS

背景 l Adj + Noun 也是问句理解中重要的部分。比如,大部分的KBQA的问答系统( 例如:g. Answer)都将”adjective + noun” 映射到”special classes” l Adj

背景 l Adj + Noun 也是问句理解中重要的部分。比如,大部分的KBQA的问答系统( 例如:g. Answer)都将”adjective + noun” 映射到”special classes” l Adj + Noun –> special classes 的一般方法 n 通过计算lexical similarity between the “adj + noun” and the class name nuclear weapon n yago: Nuclear. Weapons yago: Nuclear. Weapon 103834604 lexical similarity 常用方法 Ø 编辑距离 Ø Word 2 Vec Ø Sim. Hash Ø Jaro Distance

Motivation l 一般方法的问题 n 当 “adj + noun” 的字面与 class name 相差比较大时就会映射不上 Ø 例如:”

Motivation l 一般方法的问题 n 当 “adj + noun” 的字面与 class name 相差比较大时就会映射不上 Ø 例如:” atomic weapon” 就无法准确映射到 yago: Nuclear. Weapons n 只依靠 lexical similarity 会导致映射错误 Ø 例如: public library yago: Public. Libraries 6个实体 yago: Public. Library 107978170 262 个实体 问句中上下文信息难以利用 l 类在知识库中上下文信息 l Which Greek goddesses dwelt on Mount Olympus? Which European countries have a constitutional monarchy? Give me all American presidents in the last 20 years. Give me all chemical elements. n n l 类与类之间的信息 实体与类之间的信息 利用Wikipedia将adj + noun与知识库中的实体/类关联起来 n 在线检索 + 统计学习 n 离线构建资源库

Wikipedia文本中的adj + noun 抽取 文本语料: 4, 641, 892 Wikipedia articles 具:Stanford NLP POS 过滤规则:

Wikipedia文本中的adj + noun 抽取 文本语料: 4, 641, 892 Wikipedia articles 具:Stanford NLP POS 过滤规则: 1. 过滤掉adj为序数词的情况 2. 过滤掉adj + 特定名词 3. 过滤掉adj是比较级、最高级的形式 4. 过滤掉adj + noun是实体的情况 5. 过滤掉adj/noun包含了特殊字符的情况 6. 过滤掉出现频率较低的adj + noun 7. 没有考虑adj + noun的情况 adjs: 26, 693 16000 adj + noun: 288, 452 平均每个adj会修饰 10. 8个nouns 15382 14000 12000 10000 8000 6000 3503 4000 1609 2000 0 1 2 3 2626 961 669 506 373 320 229 4 5 6 7 8 9 491 10 -99 100 -999 24 >=1000

目的 • 获取adj可能修饰的nouns • 将Adj + Noun映射到知识库(DBpedia)中的classes yago: Engineer 109615807 engineer Class 2 Class

目的 • 获取adj可能修饰的nouns • 将Adj + Noun映射到知识库(DBpedia)中的classes yago: Engineer 109615807 engineer Class 2 Class 3 English Class 4 city Class 5 Class 6

1. Adj + noun 的抽取 • 语料源:Wikipedia

1. Adj + noun 的抽取 • 语料源:Wikipedia

2. Adj + noun 的候选classes生成 yago: Causal. Agent 100007347 yago: Colleague 109935990 yago: Computer.

2. Adj + noun 的候选classes生成 yago: Causal. Agent 100007347 yago: Colleague 109935990 yago: Computer. Scientist 109951070 Type yago: Computer. User 109951274 yago: Contestant 109613191 yago: Engineer 109615807 yago: Military. Officer 110317007 …… English engineer yago: Businessperson 109882716 yago: Capitalist 109609232 yago: Causal. Agent 100007347 Type yago: Civil. Authority 110541833 yago: Donor 110025730 yago: Engineer 109615807 yago: Contestant 109613191 ……

3. 候选classes的过滤和重排序 • 人 标注 + 分类 <English, engineer> 的候选class及其特征 Adj + Noun 与

3. 候选classes的过滤和重排序 • 人 标注 + 分类 <English, engineer> 的候选class及其特征 Adj + Noun 与 class 共�次数 候� Class在Ontology Class体系中所�的� � 候� Class所� ��有几个 class Noun 与 class 字面相似度 Noun 与 class ��相似度 人 �注 yago: Causal. Agent 10 0007347 1 …… …… 0 yago: Colleague 1099 35990 1 …… …… 0 yago: Computer. Scie ntist 109951070 1 …… …… 0 yago: Computer. User 109951274 1 …… …… 0 yago: Contestant 109 613191 1 …… …… 0 yago: Engineer 10961 5807 2 3 10 0. 81 0. 92 1 yago: Military. Officer 110317007 1 …… …… 0 yago: Causal. Agent 10 0007347 1 …… …… 0 yago: Colleague 1099 35990 1 …… …… 0 yago: Computer. Scie ntist 109951070 1 …… …… 0 owl: Thing 2 1 1 …… …… ……

4. 资源库的扩充 • Word. Net 和 PPDB Class 1 Noun 1 Class 2 Class

4. 资源库的扩充 • Word. Net 和 PPDB Class 1 Noun 1 Class 2 Class 3 Adj Class 4 antonymy Noun 2 Class 5 Class 6 Adj 2

 • thks

• thks