Information Retrieval 2003 9 24 Information Retrieval l

  • Slides: 41
Download presentation
의료정보검색 (Information Retrieval) 2003. 9. 24. 최진욱

의료정보검색 (Information Retrieval) 2003. 9. 24. 최진욱

정보검색이란 Information Retrieval l 원하는 정보를 찾는 것 l Data retrieval vs. Information retrieval

정보검색이란 Information Retrieval l 원하는 정보를 찾는 것 l Data retrieval vs. Information retrieval l 2

What is IR? l Information Retrieval is a science which deals with the knowledge

What is IR? l Information Retrieval is a science which deals with the knowledge representation, storage, organization and access of information items. 3

NLM and Medline l 10 million articles l 3, 500 journals since 1966 l

NLM and Medline l 10 million articles l 3, 500 journals since 1966 l Pub. Med, Internet Grateful Med – http: //www. nlm. nih. gov 5

IR 관련된 기술분야 l Internet l Search Engine l Vocabulary System l Information Modeling

IR 관련된 기술분야 l Internet l Search Engine l Vocabulary System l Information Modeling l Filtering and Classification l Natural Language Processing l …. … 6

Internet l TCP/IP를 사용하는 전세계적인 network l public(not free, but open to everyone) l

Internet l TCP/IP를 사용하는 전세계적인 network l public(not free, but open to everyone) l carrier of electronic mail l convenient to get free SW l terabytes of information l dynamic rerouting 7

Telephone network 8

Telephone network 8

Another network 사용료? 9

Another network 사용료? 9

Network in early stage 국방성 ARPAnet l TCP/IP 통신프로토콜 사용 l 10

Network in early stage 국방성 ARPAnet l TCP/IP 통신프로토콜 사용 l 10

TCP/IP l Protocol – rules of behavior – 한국 : 정지 신호 준비 –

TCP/IP l Protocol – rules of behavior – 한국 : 정지 신호 준비 – 독일 : 출발 준비 l TCP/IP – 2 widely used network protocols : computer network 에 접속하기 위한 100 여 가지의 규약 11

Internet Address 32 bit 8 ~ 24 bit Network 주소 Host 주소 12

Internet Address 32 bit 8 ~ 24 bit Network 주소 Host 주소 12

Internet Classes 13

Internet Classes 13

E-mail Address jinchoi@snu. ac. kr 14

E-mail Address jinchoi@snu. ac. kr 14

URL 15

URL 15

Hypertext 16

Hypertext 16

인터넷에서 정보 찾기 l Search engine 이용 l News group에 문의 l Mailing list

인터넷에서 정보 찾기 l Search engine 이용 l News group에 문의 l Mailing list 활용 18

Internet Search Engine l www. yahoo. com – www. yahoo. co. kr l www.

Internet Search Engine l www. yahoo. com – www. yahoo. co. kr l www. altavista. com l www. dreamwiz. com l www. naver. com – www. altavista. co. kr l www. excite. com l www. lycos. com, – www. lycos. co. kr 19

Internet Search Engine 20

Internet Search Engine 20

AND search l Search for Monet AND Renoir l Search for +Monet +Renoir l

AND search l Search for Monet AND Renoir l Search for +Monet +Renoir l Search for Monet Renoir – “All the words” option 22

OR search l Search for UPS U. P. S. l Search for UPS OR

OR search l Search for UPS U. P. S. l Search for UPS OR U. P. S. l Search for UPS U. P. S – “Any of the Words” option l “foreign policy” vs foreign policy 23

NOT search l Search for “bugs life” -ants l Search for “bugs life” NOT

NOT search l Search for “bugs life” -ants l Search for “bugs life” NOT ants l Search for “bugs life” AND NOT ants 24

Near Search l Korea NEAR climate – Altavista (advanced search) – two terms within

Near Search l Korea NEAR climate – Altavista (advanced search) – two terms within 10 words l Korea NEAR climate – Lycos (advanced search) – two terms within 25 words 25

USENET news server in SNU - USENET system news server 광범위한 게시판 시스템 in

USENET news server in SNU - USENET system news server 광범위한 게시판 시스템 in Melbourne - news group USENET 에 개설된 토론 그룹을 말함 26

Newsgroup Search Engine 27

Newsgroup Search Engine 27

Mailing list computer & privacy travel, weather 백악관 안터넷변천사 n automatic mailing programs l

Mailing list computer & privacy travel, weather 백악관 안터넷변천사 n automatic mailing programs l LISTSERV l Majordomo 28

IR Modeling

IR Modeling

IR steps l Text processing l Indexing – inverted file – signature file l

IR steps l Text processing l Indexing – inverted file – signature file l Organization in DB l Query processing l Evaluation 30

Information-Retrieval Process Information Need Content Indexing Database Query Formulation Retrieval Query Result Evaluation Refinement

Information-Retrieval Process Information Need Content Indexing Database Query Formulation Retrieval Query Result Evaluation Refinement 31

Indexing Process document accent spacing stop word noun group stemming automatic or manual indexing

Indexing Process document accent spacing stop word noun group stemming automatic or manual indexing structure recognition full text index terms 32

Classification of IR Classic Models U s e r T a s k Retrieval:

Classification of IR Classic Models U s e r T a s k Retrieval: Adhoc Filtering boolean vector probabilistic Structured Models non-overlapping lists proximal nodes Browsing Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Network Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext 33

Boolean Model query can be written in disjunctive normal form l q = ka

Boolean Model query can be written in disjunctive normal form l q = ka (kb kc) l qdnf = (1, 1, 1) (1, 1, 0) (1, 0, 0) l Ka Kb Kc 34

Vector Model and Weight function K = {소나타, 2000 cc, 자동변속, 흰색, …, kt}

Vector Model and Weight function K = {소나타, 2000 cc, 자동변속, 흰색, …, kt} D 1 = {20, 11, 5, … , 5} D 2 = {20, 18, 12, 4, …, 5} D 30 = {0, 20, 12, 3, …, 9} 현대차 삼성차 weight terms are assumed to be mutually independent ! 35

Boolean vs. Vector model Petroleum Mexico Oil Texas Refinery Ship Boolean (1 1 1

Boolean vs. Vector model Petroleum Mexico Oil Texas Refinery Ship Boolean (1 1 1 0) Vector (2. 8 1. 6 3. 5 3 3. 1 1) 36

Retrieval Issues (aspirin, prevention) l Indexing (prevention, …) – inverted file (aspirin, attack, heat)

Retrieval Issues (aspirin, prevention) l Indexing (prevention, …) – inverted file (aspirin, attack, heat) l Ranking – relevance에 따른 ranking – chronology에 따른 ranking l Display Item Attribute (Doc. #) Aspirin Attack Heart Prevention 1, 5, 6, 9 3, 6, 7, 8 4, 6, 7, 10 1, 2, 6, 9 37

Indexing with Inverted File Radiology Results 환자번호 : 27750177 판독번호 : 20022777035 검사코드 :

Indexing with Inverted File Radiology Results 환자번호 : 27750177 판독번호 : 20022777035 검사코드 : RC 102 검사명: Brain CT (Pre contrast) 주진단상병명 : Infarction Of Posterior Cerebral Artery Territory 검사일자 : 2002 -08 -02 검사결과 : BRAIN MRI + MRA [Finding] 양쪽 PVWM의 여러 개의 UBO는 underlying SVD의 가능성이 있을 것으로 보임. Left thalamus와 occipital lobe에 patchy high signal intensity가 있고 FLAIR image에서 마찬가지 소견임. T 1 WI에서 iso signal intensity의 portion 으로 … WORD D abort 1, 5 aberrant 3 brain 7, 3 P W … … Significant words only portion 4, 8 topology 5, 8 38

Word based indexing 문제점 l Context – meaning is affected by meaning of other

Word based indexing 문제점 l Context – meaning is affected by meaning of other words l l l high, blood, pressure low pressure at high altitude increase red blood cell Polysemy – lead vs lead l Synonymy – hypertension vs. high blood pressure l Granularity – antibiotics, penicillin l Focus of Content – Key word vs Plain word 40

The End 41

The End 41