MB GB TB PB EB mebi gibi tebi
英语表示兆以上命名法 MB GB TB PB EB mebi gibi tebi pebi exbi megabyte gigabyte terabyte petabyte exabyte ( ( ( 106 ), 109 ), 1012), 1015), 1018), ( ( ( ( million, billion, trillion, 1 MB=1000 KB 兆 1 GB=1000 MB 1000兆 1 TB=1000 GB 万亿兆兆 1 PB=1000 TB 1000万亿兆 quintillion, 1 EB=1000 PB 万万亿兆 1021), sextillion, 1024), septillion, 1027), septillion, 1030), septillion, 1033), septillion, 1036), septillion, 10303), centillion, 10
目前常用前缀及其意义 Prefix Symbol(s) Power of 10 Power of 2 k or K ** 103 210 mega- M 106 220 giga- G 109 230 tera- T 1012 240 peta- P 1015 250 exa- E 1018 * 260 kilo- * Not generally used to express data speed ** k = 103 and K = 210 11
表示二进制数量的前缀(新建议) Full technical name Proposed prefix Proposed symbol Numeric multiplier kilobinary kibi- Ki 210 megabinary mebi- Mi 220 gigabinary gibi- Gi 230 terabinary tebi- Ti 240 petabinary pebi- Pi 250 exabinary exbi- Ei 260 12
网络爆炸 l l l l 9. 6 million web servers as of Dec 1999 72. 4 million web sites as of Jan 2000 275 million people online as of Mar 2000 800 million publicly indexable pages 180 million images 30% web pages are copied or mirrored 1 billion hyperlinks 15
网络信息资源的挑战 l 数量巨大 – No single search engine indexes more than 16% of web sites – All search engines combined covering only 42% l 极端异质 – – – l Variable information value Variable length Often containing grammatical mistakes and typos Content may be outdated, false, or unreliable Multiple data formats Multiple languages and alphabets 速度问题 – 15, 000 ~ 20, 000 search queries requested per minute 16
Internet 利用量 l Internet 用户: 30 to 300 million in 2001 l Internet 流量: 每 70 天翻一番 l 电子商务: 2002年 1. 3 万亿美元 l 1997年PC台数销售量超过 TV 17
信息检索的问题 l 语言问题 – 一词多意: l l l Bank: a river boundary or a savings and loans? DNA: microbiology or Digital Equipment Corporation’s Network Architecture? Free rider: Economic game theory or urban transportation systems? – 一意多词: l l Blair example (p. 295): trap correction, wire warp, shunt correction system, roman circle method, air truck, . . . Car, automobile, vehicle, sedan, horseless carriage. . . –. 19
Search Engines 20
主题树 (目录型网站) 覆盖小, 质量高 的网站 l 150 editors l 1. 2 million web links l 200 editors l 1 million web links l 700 subcategories l Overseen by professional guides l Provides Encyclopedia Britannica l Provides articles from top magazines l Contributed by the web community l 16, 000 editors, 14, 000 subcategories 21
搜索引擎 Internet内容数据库 l 340 million pages l Fastest engine with parallel processing l Offers 6, 200 full-text journals, books, etc. l Grouping of sear results in categories l 250 million pages l Image search and language translation l Uses Page. Rank algorithm l Ranking based on popularity (links) l Natural language processing technology l More than 7 million FAQs 22
搜索引擎规模 GG=Google, FAST=FAST, AV=Alta. Vista, INK=Inktomi, WT=Web. Top. com, NL=Northern Light, EX=Excite Service Searches Per Day Google Alta. Vista Inktomi Direct Hit FAST Go. To Ask Jeeves 100 million 50 million 80 million 20 million 12 million 5 million 4 million 25
Spiders for Search Engines l Create a queue of pages to be explored – Depth-first: high load on servers – Breath-first: favors smaller web servers – Best-first: based on popularity heuristic Choose a page Add to queue Fetch page content, extract all links Process page to extract information Where to explore next? l What information to keep? – Titles+headers vs. whole Database document – Manual description vs. automated abstracts 26
知道我们所不知道的 l我们应该知道什么 What we know that we should know l我们知道什么 What we know that we do know l我们不知道什么 What we know that we do not know l别人知道什么 What we know that others know l我们不知道有什么我们不知道 What we don’t know that we don’t know 30
不知道我们所不知道的 "We struggle between 1% of what we know and , 1% of what we don’ t know, but rarely come across the 98% of what we don’ t know that we don t know. " 31
Knowing leads to. . . 44
Transformational Librarianship l Data l Norm l Information l Form l Knowledge l Transform l Behaviour l Perform Success 45
获取知识 创建知识 提炼知识 提供知识 存储知识 管理知识 Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson, 6 th edition. Copyright 2001, Prentice Hall, Upper Saddle River, NJ 48
知识组织与存取 (2) XML为基础的标记格式: 文章 标题 文献类型定义(Document Type Definition): a user-defined set of rules governing an individual markup language created using the principles of XML. A DTD describes the formal rules for the structure of a class of information chunks (documents). 摘要 章节 元素(Element): a component of a document. (a contiguous chunk of useful information in an XML document marked by a start-tag and end-tag). <article> <title>知识管理技术</title> <description>信息技术在知识管理中的应用 </description> <section id=“ 1”>技术类型</section> <p pid=“ 1”>……</p> <p pid=“ 2”>……</p> <section id=“ 2”>结构类型</section> <p pid=“ 3”>……</p> <p pid=“ 4”>……</p> </article> 62
知识组织与存取 (3) l 数据库与XML标记之间的转换和兼容 – 用文献类型定义来产生数据库结构 – 用数据库结构来产生文献类型定义 Article ID Title Description I 文献类型定义 <!ELEMENT article (title, description, (section (P+))+ > <!ELEMENT title (#PCDATA) > <!ELEMENT description (#PCDATA) > <!ELEMENT section (P+)> <!ATTLIST section id CDATA #REQUIRED> <!ELEMENT P (#PCDATA)> <!ATTLIST P id CDATA #REQUIRED> 数据库 转换 M Article ID Section title PID Paragraph 63
知识主管 (CKO) 本企业知识资产最大化 § 设计和实施知识管理战略 § 有效交换知识资产 § 促进系统应用 § Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson, 6 th edition. Copyright 2001, Prentice Hall, Upper Saddle River, NJ 66
知识管理的人员保障 l 知识主管 – senior executive, builds knowledge culture, creates infrastructure l 知识项目经理 – temporary roles, lead developments and embed into processes l 知识管理专家 – permanent group, various backgrounds, variety of roles l 知识第一线 作者 – staff at all levels, producing and using knowledge in their work 67
Galaxy of News Current Issues news information Current info. Infrastructure simply can’t handle exploding scale of news information and its cross correlation. Need for an intelligent system that automatically builds the correlations and relationships between news articles Rennison ‘ 94 94
潜在应用 Internet Search Engine 95
Themescapes, Cartia PNL l Mountain height = Cluster size l 98
Map. net l http: //maps. map. net/start 99
信息抽取 Intranet Web 提问处理 ontology IE 数据库 106
信息抽取体系结构 EXAMPLE: rund 60 bis 70 Prozent der Steigerungsrate (about 60 to 70 percent increase) ASCII 文献 标识器 句法分析 Steigerungsrate: steigerung+[s]+rate bis: prep|adv POS-过滤 bis: adv 有名实体 rund 60 bis 70 Prozent: percentage-NP 语言知识库 文本图 rund: lowercase 60: two-digit-integer 片语识别 rund 60 bis 70 Prozent: NP der Steigerungsrate: NP 句子边界探测 XML-输出接口 文件 108
- Slides: 109