Project Wise 101 Chapter 9 Document Indexing Gary

  • Slides: 38
Download presentation
Project. Wise 101 – Chapter 9 Document Indexing Gary Cochrane – Technical Director Geospatial

Project. Wise 101 – Chapter 9 Document Indexing Gary Cochrane – Technical Director Geospatial Sales – North America

Introduction • Project. Wise Document Indexing – Really means three things • Full Text

Introduction • Project. Wise Document Indexing – Really means three things • Full Text Indexing, in support of full text searching • Thumbnail Extraction • Document Property Extraction – We won’t cover this one in PW 101 – See Bentley Institute PW Admin course guide for this

Full Text Indexing • We did not write the engine for this – But

Full Text Indexing • We did not write the engine for this – But elected to use the one Microsoft provides • Included with every copy of Windows – That engine is called the MS Indexing Service • And it was installed in the VM as an optional Windows component – Microsoft indexes the following file formats • MSWord, Excel, PPT, HTML, XML, TXT

Pre-installed in VM Project. Wise Integration Server Project. Wise Orchestration Framework Micro. Station V

Pre-installed in VM Project. Wise Integration Server Project. Wise Orchestration Framework Micro. Station V 8 i-SS 1 Supported Database Engine Microsoft Message Queuing Service Microsoft Indexing Service Microsoft. NET Framework 2. 0 Windows Server 2003 with SP 2

Extending the MS Index Service • Microsoft provides an SDK for third parties to

Extending the MS Index Service • Microsoft provides an SDK for third parties to extend the Indexing service – So the Indexing service will know how to “filter” files from that vendor • For instance, Adobe provides an “i. Filter” that teaches the MS Index Service how to extract text from a PDF file • The Adobe PDF i. Filter is installed with Acrobat Reader V 9 x

Indexing Overview • Within PW, Indexing consists of: – Scheduling • A process that

Indexing Overview • Within PW, Indexing consists of: – Scheduling • A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep – Copy-out • Copy the file from the Storage Area, to the machine running the Indexing Service. Then add file to the extraction queue. • Remember, files may be stored on multiple servers • Also, in large installations, a machine may be dedicated to indexing

Indexing Overview – Part II • Overview – continued – Extraction • This process

Indexing Overview – Part II • Overview – continued – Extraction • This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue – Update • This process sets the flag on the file (in the PW database) that says it is “done” • New files are added with the flag set to “undone” • Check-out/in causes the flag to be set to “undone”

A note on “done” • Done does not necessarily mean it was successful –

A note on “done” • Done does not necessarily mean it was successful – It means the file has been processed • In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service? – The file is attempted… • And the indexing service says, “I don’t know how to extract text from this file” – There would be no point in trying the file again • So it is marked as “done”, even when unsuccessful

Micro. Station and Auto. CAD • Project. Wise provides a mechanism to index the

Micro. Station and Auto. CAD • Project. Wise provides a mechanism to index the text from these file types – Instead of writing an i. Filter, Bentley elected to: • Copy-out the file • Run Micro. Station in the background, extract all the text, and write it to an XML file • Send the XML file to the Indexing Engine – Since Micro. Station can parse DWG as well… • Then this method saved us from having to write two i. Filters

Summary • So within Project. Wise, we index: – Word, PPT, Excel, XML, HTML,

Summary • So within Project. Wise, we index: – Word, PPT, Excel, XML, HTML, TXT – Adobe PDF – DGN, & DWG • More good news – i. Filters can be found for many file formats • Some free, and some for purchase

PW Orchestration Framework • Remember when we installed this? – PWOF is responsible for

PW Orchestration Framework • Remember when we installed this? – PWOF is responsible for managing batch processes for Project. Wise • This includes all those processes discussed on the previous slides – For Full Text Indexing, that means • Scheduler process, Copy-out process, Extraction process, Updater process, and the Micro. Station instance running in the background

Lab 1 a • PW Orchestration Framework – Start the Windows Task Manager •

Lab 1 a • PW Orchestration Framework – Start the Windows Task Manager • Hint: Right-click on empty part of Taskbar – Examine memory usage • On the Performance tab – Switch to Processes tab • Sort by Mem Usage column (descending) • Look for ustation. exe • Look for Dms. Afp. Engine(s) – Lots of memory consumed here…

Lab 1 b • Now open Services dialog – Remember “gears” icon on Quick-Launch

Lab 1 b • Now open Services dialog – Remember “gears” icon on Quick-Launch • Locate PW Orchestration Framework service – Select the PW OF service, and choose> Stop • Watch memory usage in Task Manager – For remainder of exercise, we need PWOF running • So start it back up now • Note PWOF is configured for automatic startup – It will run each time machine is booted – Close Services and Task Manager

Lab 2 a • Open PW Administrator – Log in as> adminpw – Drill

Lab 2 a • Open PW Administrator – Log in as> adminpw – Drill down to: • Document Processors> Full Text Indexing – Right-click, choose> Properties

Lab 2 b - Full Text Indexing Accept defaut, unless Indexing is to be

Lab 2 b - Full Text Indexing Accept defaut, unless Indexing is to be run on another machine Turn on adminpw Set to 60

Lab 2 c - Full Text Indexing Enable all times in the schedule Set

Lab 2 c - Full Text Indexing Enable all times in the schedule Set to 2

Lab 2 d • Switch to File Type Associations tab – Press> Add •

Lab 2 d • Switch to File Type Associations tab – Press> Add • In the Extension field, enter> DWG • In the bottom field, enter> DGN – So that DWG files are processed as if they were DGN – Press> OK

Lab 2 e

Lab 2 e

Lab 2 f • Still on the File Type Associations tab – Again, press>

Lab 2 f • Still on the File Type Associations tab – Again, press> Add • In the Extension field, enter> itiff • In the bottom, enable> Do not process these documents – You can’t extract text from a raster so this prevents wasted file transfers – Press> OK • Press OK again – To close the Full Text Indexing Properties

Lab 2 g • Open Task Manager again – Switch to Performance tab •

Lab 2 g • Open Task Manager again – Switch to Performance tab • Within 2 minutes, you should see heavy CPU usage • Memory usage will also go up – Up to 60 documents will be indexed in the first pass • If there are more than 60 documents to be done, then they will be queued in the next pass – 2 minutes from now

Analysis • All documents will eventually be processed – When done, the index will

Analysis • All documents will eventually be processed – When done, the index will be ready for fast full text searches • Once the indexer has caught up, future load will be lighter due to only processing incremental documents

Lab 3 a • When done, close Task Manager, open PW Explorer – Log

Lab 3 a • When done, close Task Manager, open PW Explorer – Log in as user 1 • From the main tool box, select> Find Documents – Binocular icon • Change to Full Text tab – Enter Look For> detail • Press OK to start search – Then Close the Search dialog • Your results should include: DGN’s, DWG’s, and PDF’s

Lab 3 b • Browse to: – User 1/Document Indexing/MS-SHT • These files were

Lab 3 b • Browse to: – User 1/Document Indexing/MS-SHT • These files were not successful because they have an unknown extension • But they were attempted, and flagged as done • Return to PW Administrator – Select datasource name (pwdemo) • • Right-click, choose> Properties Change to Statistics tab Choose Refresh Review Full Text Statistics – Close dialog

Lab 3 c • While still in PW Administrator – Open Full Text Indexing

Lab 3 c • While still in PW Administrator – Open Full Text Indexing Properties again • Switch to the File Type Associations tab – Press Add • In the Extension field, enter> SHT • In the bottom Extension field, enter> DGN – So that SHT files will be processed as if they were DGN files • Press OK to complete the Extension mapping – Press OK again to close the Properties dialog

Lab 3 d • Once new file type has been added… – Now a

Lab 3 d • Once new file type has been added… – Now a small problem • These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in • And even that won’t work unless you actually makes changes… • PW compares files to version on server, and doesn’t transfer back if there are no changes

Lab 3 e • Rather than check them all out, and back in –

Lab 3 e • Rather than check them all out, and back in – From PW Administrator • Right-click Full Text Indexing – Choose> • Mark folder Documents for Reprocessing – Browse “…” to • USer 1/Document Indexing/MS-SHT – Press OK • Press OK again

Analysis • Within 2 minutes, these documents will be reprocessed – If you run

Analysis • Within 2 minutes, these documents will be reprocessed – If you run the search again (in a few minutes), you should also get SHT files in your results – Re-visit Datasource statistics to see if it Full Text categories have changed

Summary • Once the index is created, – You can stop the PW Orchestration

Summary • Once the index is created, – You can stop the PW Orchestration Framework service • It is used to create the index, but not to search the index – This will save memory, and CPU cycles • So in a demo, your machine will run faster • BUT, new, (or modified) files will not be re-indexed – Up until now, the PWOF was not being used at all • Full Text Indexing is the first time we’ve needed PWOF, even though it has been running since installation

PW Thumbnails • PW Thumbnails is not “indexing” in the proper sense, but it

PW Thumbnails • PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text – PW Thumbnails extracts a thumbnail from the document, and stores a copy in the PW database • This allows one to browse PW Explorer, and see thumbnails in the Preview Pane – Not all file types support thumbnails • Among those that do, some don’t do it per the industry standard

Thumbnails – Part II • Important to remember – Project. Wise does not create

Thumbnails – Part II • Important to remember – Project. Wise does not create thumbnails • It only extracts what might be in the file – A good test is to check to see if Windows Explorer displays a thumbnail for the file • If it does, then PW should as well

Lab 4 a • Open Windows Explorer – Browse to: • C: PW-101 Class

Lab 4 a • Open Windows Explorer – Browse to: • C: PW-101 Class FilesDocument IndexingMS-V 8 – Change to Thumbnail display • Micro. Station V 8 files have thumbnails

Lab 4 b • Browse through remaining Document Indexing folders – Note which include

Lab 4 b • Browse through remaining Document Indexing folders – Note which include thumbnails – Additional notes • PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail • Auto. CAD doesn’t adhere to the Industry standard – These files only display correctly because Micro. Station is installed, and is responsible for displaying a thumbnail – Autodesk may have fixed this in later versions?

Lab 5 a • Open PW Administrator – Log in as> adminpw – Drill

Lab 5 a • Open PW Administrator – Log in as> adminpw – Drill down to: • Document Processors> Thumbnail Extraction – Right-click, choose> Properties • Similar to Full Text Indexing – But actually less involved

Lab 5 b Turn on adminpw Set to 60

Lab 5 b Turn on adminpw Set to 60

Lab 5 c Enable all times in the schedule Set to 2

Lab 5 c Enable all times in the schedule Set to 2

Lab 5 d • No changed required on the File Type Associations tab –

Lab 5 d • No changed required on the File Type Associations tab – Press OK to complete the configuration and close the dialog • Within a few minutes, thumbnails should show up in the preview pane

Analysis • Thumbnails are extracted and stored in the PW database – Because document

Analysis • Thumbnails are extracted and stored in the PW database – Because document storage may not be local • Thus “touching” the document to see thumbnail in real-time is not practical – Thumbnail notes • Requires less processing than full text – Micro. Station not running in this process – Requires PWOF to extract, but not to display

Review • Topics covered in this Chapter – – – Full text Indexing –

Review • Topics covered in this Chapter – – – Full text Indexing – Configuration Full Text Searches Project. Wise Orchestration Framework Thumbnail Extraction Microsoft Indexing Service • And i. Filters to extend default supported file types • (I have a free Visio, and MSG i. Filter from Microsoft)