Documents Text Editors Text Retrieval and Web Pages
Documents, Text Editors, Text Retrieval, and Web Pages Class 3 LBSC 690 Information Technology
Agenda • Questions • Unix Survival Guide • Document Creation (Word Processing and HTML) • Document Retrieval • Project Overview
Unix Survival Guide • • WAM account Directory structure (mkdir, cd, . . , /) How much space is used (du, ls -l) Eliminating unneeded files (rm) Managing mail (pine, attachments) Moving files (mv, cp, ftp) Editing files (pico, more) Web anywhere (lynx)
Document Creation • • • Editors Word Processors Desktop Publishing Structured Documents HTML/SGML/XML
Editors (Text Editing vs. Word Processing) • Purpose – Create and modify ASCII text • Examples – pico, axe, and emacs on WAM • Advantages – Compatible with virtually everything (VT-100) • Disadvantages – Limited format control, sometimes no mouse
Word Processors • Purpose – Create documents intended for human readers • Examples – Microsoft Word and Word Perfect in OWL • Advantages – Good format control – WYSIWYG (“What You See is What You Get”) • Disadvantages – No (universal) standard interchange format
Desktop Publishing • Purpose – Produce documents for wide (paper) distribution • Examples – Adobe Pagemaker in the WAM labs • Advantages – Allows very detailed layout control • Disadvantages – Requires fairly extensive user expertise
Structured Documents • Purpose – Specify logical structure of the documents • Examples – email, HTML, La. Te. X, SGML/XML • Advantages – Allows easy reformatting for different displays • Disadvantages – Hard to read unless “rendered” before viewing
Hyper-Text Markup Language (HTML) • Purpose – Structured document language for web pages • Advantages – Adapts easily to different display capabilities – Widely available rendering software (browsers) • Disadvantages – Direct control over layout is limited – The HTML “standard” is still evolving
First Steps in HTML • Find a web page you like • Select “Document Source” in “View” menu • Compare HTML code with rendered version – Observe how to achieve each effect • • Select “Save As” in “File” menu FTP the file to ~/. . /pub/ on WAM Edit the file using pico http: //www. wam. umd. edu/~userid/filename
HTML Document Structure • Markup tags (open and close) bracket content <tag> … </tag> • Title shows up in the Web browser’s frame • Headers show up in the page itself • For each link, specify the URL and link text <a href=“URL”>link text</a> • Inline graphics can replace the link text <img src=oard. jpg>
Designing Web Pages • Key design issues: – Content: What do you want to publish? – Style: How do you want to present it? – Syntax: How can you achieve that presentation? • Sources of information – Online tutorials (Yahoo points to lots of these) – Technical materials (e. g. , the HTML 3. 0 spec)
Style Guidelines • Design for generic browsers – And test on every version you wish to support • Provide appropriate access points – User needs and navigation strategies differ • Design useful navigational aids – A web search may lead to the middle of a site • Include some indication of currency – Date of last update, “new” icons, etc.
HTML Editors • Goal is to create web pages, not learn HTML! • Several are available – In Explorer, “Edit-Page” for Front Page Express – In Netscape, “File-Edit Page” for Composer • You may still need to edit the HTML file – Some editors use browser-specific features – Some HTML features may be missing entirely – File names may be butchered by FTP
SGML/XML • Generalized Markup Languages – SGML - Standard Generalized Markup Language (for paper documents) – XML - e. Xtensible Markup Language (for Web documents) (see W 3 C) • These allow people to design – DTDs - Document-type definitions • A Document also needs: – DSSSL - Document Stylesheet Specification Language
Document Retrieval • Making documents is often easier than finding them! • Hypertext vs. Cataloging vs. Searching – yahoo vs. altavista • Lots of applications – Chasing down citations in papers you read – Web search engines – Managing your personal files • Two basic approaches to searching – Explicit queries (“information retrieval”) – “Watch what I do” (“adaptive filtering”)
Ways of Searching for Text • Controlled vocabulary – Manual indexing based on named concepts • Free text – Characterize documents by the words the contain • Social filtering – Exchange and interpret personal ratings
“Exact Match” Retrieval • Find all documents with some characteristic – Indexed as “Presidents -- United States” – Containing the words “Clinton” and “Peso” – Read by my boss • A set of documents is returned – Each is as likely to be useful as any other – Usually listed in date or alphabetical order
Ranked Retrieval • Put most useful documents near top of a list – Put possibly useful documents lower in the list • No need to exclude any documents – Just list those least likely to be useful last • Two basic techniques – Similarity-based – Probability-based
Similarity-Based Retrieval • Assume “most useful” = most similar to query • Lots of clues to meaning – Repeated words are good cues to meaning – Rarely used words make searches more selective • Easily combined – Compute a “weight” for each term – Add up the weights for query terms in a document
Project Overview • Goal: Solve a practical problem – One which is fairly complex • You choose the technology – Make a set of web pages (a web “site”) – Make a database (optional for summer 690) – Do something else that is equally complex • Multimedia presentation, Java program, … • Suggest two-person groups
Web Projects • Have significant content! (see “What is a Book” web site under CLIS Dean’s Award) • Multiple access points – Taxonomy, search engine, map, etc. • Be creative (in a useful way)! For example: – Choose a novel application – Engage the user with an interactive approach – Adopt an innovative organization – Implement a creative layout
Database Projects (very ambitious for Summer 690) • Your focus should be on scalability – What if the IRS decided to use your database? • The user interface is important – Designed to be used without taking 690 first! • Include enough content to allow testing – But focus on organization, not on content • The same creativity issues as web projects
Project Timeline and Deliverables (summer 690) • Project specification (1 -2 pages) • Should include User Manual (FAQ) and Test Plan components • Project demonstrations last week of class – Scheduled individually – All two/three team members get the same grade
- Slides: 24