Reading Microsoft Word XML files with SAS August



































- Slides: 35

Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005

3 scenarios • Extracting text along with associated properties (styles and attributes) • Extracting all data from tables • Extracting coordinates of objects in drawings

XML - syntax Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent <? xml version="1. 0" ? > <Larry. Root. Tag> <Empty. Tag/> <nested. Tag> Some content </nested. Tag > <nested. Tag an. Attribute="wha"> Other content </nested. Tag > </Larry. Root. Tag>

Word XML

Word XML

Extracting text and properties • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract

Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w: word. Document/w: body /wx: sect/w: p. Pr Run property: /w: word. Document/w: body /wx: sect/w: p/w: r. Pr.

Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath">/w: word. Document/w: bo dy/wx: sect/w: p/w: r/w: t</TABLE-PATH>

Columns – the text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath">/w: word. Document/w: body /wx: sect/w: p/w: r/w: t</PATH>

Columns – the text element number • A sequential number for the text element is: <COLUMN name="t. Num" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w: word. Document/w: body /wx: sect/w: p/w: r/w: t</INCREMENT-PATH>

Columns – the paragraph number • A sequential number for the paragraph is: <COLUMN name="p. Num" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w: word. Document/w: body /wx: sect/w: p</INCREMENT-PATH>

Columns –paragraph color <COLUMN name="PColor. Val" retain="YES"> <PATH syntax="XPath">/w: word. Document/w: body/w x: sect/w: p. Pr/w: r. Pr/w: color/@val</PATH>

Columns – run color <COLUMN name="RColor. Val" retain="YES"> <PATH syntax="XPath">/w: word. Document/w: body/w x: sect/w: p/w: r. Pr/w: color/@val</PATH>

Our dataset

Tables

All Tables Into One Dataset

Tables – Word XML

Tables - Data. Set Rows <TABLE-PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: tbl/w: tr/w: tc/w: p/w: r/w: t </TABLE-PATH>

Tables – Table Number <COLUMN name="tbl. Num" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w: word. Document/w: body/wx: sect/w: tbl </INCREMENT-PATH>

Tables – Row Number <COLUMN name="tr. Num" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w: word. Document/w: body/wx: sect/w: tbl/w: tr </INCREMENT-PATH>

We Could Add Properties if Needed

Nested tables

Nested Tables – Absolute Path for Rows <TABLE-PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: tbl/w: tr/w: tc/w: p/w: r/w: t </TABLE-PATH>

Nested Tables – Rootless Path for Rows <TABLE-PATH syntax="XPath"> w: tbl/w: tr/w: tc/w: p/w: r/w: t </TABLE-PATH>

Drawing Objects VML – Vector Markup Language • Drawings in Word get stored as XML also • We’ll just look at lines

VML – Vector Markup Language

Dataset – One Row for Each Line <TABLE-PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: p/w: r/w: pict/v: group/v: line </TABLE-PATH>

Dataset – Column: From <COLUMN name="from"> <PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: p/w: r/w: pict/v: group/v: line </PATH> /@from

Dataset – Column: To <COLUMN name="from"> <PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: p/w: r/w: pict/v: group/v: line </PATH> /@to

Dataset – Column: Stroke. Color <COLUMN name="from"> <PATH syntax="XPath"> /w: word. Document/w: body/wx: sect/w: p/w: r/w: pict/v: group/v: line/@strokecolor </PATH>

The Dataset

Usage Example: Annotate dataset if prxmatch(xy. Pattern, from) then do; function='move'; x= input(PRXPOSN (xy. Pattern, 1, from), 10. ); if prxmatch('/flip: y/', style) then y= -1* input(PRXPOSN (xy. Pattern, 2, to), 10. ); else y= -1* input(PRXPOSN (xy. Pattern, 2, from), 10. ); output;

Plotted in SAS

Contact Information Larry Hoyle Policy Research Institute, University of Kansas Larry. Hoyle@ku. edu http: //www. ku. edu/pri/ksdata/sashttp/sugi 31