Parsing XML sequence We have i 2 xml

  • Slides: 13
Download presentation
Parsing XML sequence? • We have i 2 xml filter (exercise) – we want

Parsing XML sequence? • We have i 2 xml filter (exercise) – we want xml 2 i also • Don’t have to write XML parser, Python provides one • Thus, algorithm: – Open file – Use Python parser to obtain the DOM tree – Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA Ignoring whitespace nodes, we have to search a tree like this: SEQ (type) NAME ID SEQ (type) DATA NAME ID DATA 1

We're still being systematic: Usual name for parse method Obtain a parse tree with

We're still being systematic: Usual name for parse method Obtain a parse tree with the xml data for free xml 2 i. py (part 1) Convert this SEQ subtree to an Isequence object SEQUENCEDATA SEQ (type)

Way of getting to all attributes of a node xml 2 i. py (part

Way of getting to all attributes of a node xml 2 i. py (part 2) Way of getting to a specific named attribute Recall: text kept in a #text node underneath SEQ (type) NAME ID #text. . DATA

What if the XML sequence format changes? • Now the name of the finder

What if the XML sequence format changes? • Now the name of the finder of the sequence is stored as a new tag: SEQUENCEDATA SEQ (type) NAME FOUNDBY ID SEQ (type) DATA FOUNDBY ID DATA NAME 4

Robustness of XML format • Our xml 2 i filter still works because the

Robustness of XML format • Our xml 2 i filter still works because the DOM parser still works – Can’t extract the finder information: ignores the foundby node: – But: doesn’t crash! Still extracts other information – Easy to update filter to incorporate new info FOUNDBY SEQ (type) ID DATA NAME 5

Compare with extending Fasta format Say that the Fasta format is modified so the

Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >Bi. RC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCTCTATTACGCGCGATCGACCC. . Our Fasta parser would go wrong! 6

XML robust • So, the good thing about XML is that it is robust

XML robust • So, the good thing about XML is that it is robust because of its well-defined structure • Widely used, i. e. this overall tag structure won’t change and other applications can read your XML data • Parser available in Python already: – Read XML into a DOM tree – DOM tree can be traversed but also manipulated (see next slide) 7

See all the methods and attributes of a DOM tree on pages 537 ff

See all the methods and attributes of a DOM tree on pages 537 ff Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc. ) 8

Convert old format XML sequence to new format SEQUENCEDATA Old format: sequence type has

Convert old format XML sequence to new format SEQUENCEDATA Old format: sequence type has its own tag TYPE SEQ TYPE NAME ID DATA SEQUENCEDATA SEQ (type) NAME ID DATA New format: sequence type is attribute of SEQ tag 9

old_xml 2 i. py Add new method to original xml 2 i. py and

old_xml 2 i. py Add new method to original xml 2 i. py and call it after parsing the XML file

old_xml 2 phylip. py Import new module Check that type information is saved in

old_xml 2 phylip. py Import new module Check that type information is saved in the Isequence (not used in phylip format)

Testing on old format XML sequence <? xml version = "1. 0"? > <SEQUENCEDATA>

Testing on old format XML sequence <? xml version = "1. 0"? > <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U 03518</ID> <DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatc cgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgcc ccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgatt gaatgcaatcagttaaaactttcaacaatggatctcttggttccggc</DATA> </SEQUENCEDATA> U 03518 b. xml python old_xml 2 phylip. py U 03518 b. xml U 03518 b sequence is of type dna 12

Remark: book uses old version of DOM parser • XML examples in book won’t

Remark: book uses old version of DOM parser • XML examples in book won’t work (except the revised fig 16. 04) • Look in the presented example programs to see what you have to import • All the methods and attributes of a DOM tree on pages 537 ff are the same 13