Parsing XML sequence We have i 2 xml
- Slides: 13
Parsing XML sequence? • We have i 2 xml filter (exercise) – we want xml 2 i also • Don’t have to write XML parser, Python provides one • Thus, algorithm: – Open file – Use Python parser to obtain the DOM tree – Traverse tree to extract sequence information, build Isequence objects SEQUENCEDATA Ignoring whitespace nodes, we have to search a tree like this: SEQ (type) NAME ID SEQ (type) DATA NAME ID DATA 1
We're still being systematic: Usual name for parse method Obtain a parse tree with the xml data for free xml 2 i. py (part 1) Convert this SEQ subtree to an Isequence object SEQUENCEDATA SEQ (type)
Way of getting to all attributes of a node xml 2 i. py (part 2) Way of getting to a specific named attribute Recall: text kept in a #text node underneath SEQ (type) NAME ID #text. . DATA
What if the XML sequence format changes? • Now the name of the finder of the sequence is stored as a new tag: SEQUENCEDATA SEQ (type) NAME FOUNDBY ID SEQ (type) DATA FOUNDBY ID DATA NAME 4
Robustness of XML format • Our xml 2 i filter still works because the DOM parser still works – Can’t extract the finder information: ignores the foundby node: – But: doesn’t crash! Still extracts other information – Easy to update filter to incorporate new info FOUNDBY SEQ (type) ID DATA NAME 5
Compare with extending Fasta format Say that the Fasta format is modified so the finder appears in the second line after a >: >HSBGPG Human gene for bone gla protein (BGP) >Bi. RC CGAGACGGCGCGCGTCCCCTTCGGAGGCGCTCTATTACGCGCGATCGACCC. . Our Fasta parser would go wrong! 6
XML robust • So, the good thing about XML is that it is robust because of its well-defined structure • Widely used, i. e. this overall tag structure won’t change and other applications can read your XML data • Parser available in Python already: – Read XML into a DOM tree – DOM tree can be traversed but also manipulated (see next slide) 7
See all the methods and attributes of a DOM tree on pages 537 ff Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc. ) 8
Convert old format XML sequence to new format SEQUENCEDATA Old format: sequence type has its own tag TYPE SEQ TYPE NAME ID DATA SEQUENCEDATA SEQ (type) NAME ID DATA New format: sequence type is attribute of SEQ tag 9
old_xml 2 i. py Add new method to original xml 2 i. py and call it after parsing the XML file
old_xml 2 phylip. py Import new module Check that type information is saved in the Isequence (not used in phylip format)
Testing on old format XML sequence <? xml version = "1. 0"? > <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U 03518</ID> <DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatc cgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgcc ccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgatt gaatgcaatcagttaaaactttcaacaatggatctcttggttccggc</DATA> </SEQUENCEDATA> U 03518 b. xml python old_xml 2 phylip. py U 03518 b. xml U 03518 b sequence is of type dna 12
Remark: book uses old version of DOM parser • XML examples in book won’t work (except the revised fig 16. 04) • Look in the presented example programs to see what you have to import • All the methods and attributes of a DOM tree on pages 537 ff are the same 13
- What shape has 6 rectangular faces 12 edges and 8 vertices
- Top-down parser
- Semantic parsing
- Recursive descent parsing
- Parsing methods
- Ll 1 parser
- Parsing syntax
- Error recovery in predictive parsing
- Syntax analysis
- Top down parser
- Steps of query processing
- Advantages of bottom up parsing
- Yang memeriksa sintaks dan memeriksa relasi adalah
- Parsing adalah