XML
XML is a Semi-Structured Data Model for many applications
XML
- it is a superset of HTML
- XML is a generic data format
- for machine-to-machine communication and data exchange
- especially widely used for Data Integration
- there are ways to impose some constraints via schema: DTD, XML Schema
- there are many tools
Representations
Tree Representation
In XML Trees,
- values are always at the leaf level
- all other nodes contain information about these nodes
Examples:
Object-oriented model of these trees:
- DOM or Document Object Model
Serialized Form
The following are the serialized forms of these trees:
<r>
<name>Alan</name>
<tel>32190</tel>
<email>alan@aol.ru</email>
</r>
<r>
<name>
<first>Alan</first>
<last>Black</last>
</name>
<tel>32190</tel>
<email>alan@aol.ru</email>
</r>
So this the serialized form is
- it’s a textual, linear representation of the tree
=== Examples === ```text only
```xml
<document> Hello World| </document>
``` | |```xml
<document>
<salutation> Hello World| </salutation> |</document> |
<?xml version="1.0" encoding="utf-8" ?>
<document>
<salutation color="blue"> Hello World| </salutation> |</document> |
Bigger Example:
<solar_system>
<star>
<name>Sun</name>
<spectral_type>G2</spectral_type>
<age unit="billions years">5</age>
</star>
<planet type="telluric">
<name>Earth</name>
<distance unit="km">149600000</distance>
<mass unit="kg">5.98e24</mass>
<diameter unit="km">12756</diameter>
<satellite number="1"/>
</planet>
<planet ring="yes" type="gaseous">
<name>Saturn</name>
<distance unit="UA">5.2</distance>
<mass unit="Earth mass">95</mass>
<diameter unit="Earth diameter">9.4</diameter>
<satellite number="18"/>
</planet>
<planet ring="yes" type="gaseous">
<name>Uranus</name>
<distance unit="UA">19.2</distance>
<mass unit="Earth mass">14.5</mass>
<diameter unit="Earth diameter">4</diameter>
<satellite number="15"/>
</planet>
</solar_system>
- note that digits under nodes signify the positions
- '’positions’’ is a way of identifying nodes
- to denote the position of the root we use $\epsilon$
Working with XML
Parsing
For applications
- typically an application parses a serialized form and produces a tree
- it works with it: accesses parts of the document, reorganizes it, etc
- after finishing, it serializes it back to XML
Parsers
- a parser takes a serialized form an produces a tree form (for example, DOM)
- validation
- first of all it checks if the document is well-formed
- using schemas, the parser checks if the document is valid
Schemas
Schemas are used to
- specify constraints for XML documents: order of elements, structure, data types
- augment documents: default values, white space processing, etc
- give some semantics to the documents you want to design
- to reuse and document your decisions
- designing contracts for web services
There are 3 ways of doing it:
- DTD - based
- XML Schema
- Relax NG
Schemas are build on top of Tree Automata and Regular Expressions theory
- Validation of a document = a run of a tree automaton
Attributes vs Elements
Consider the following XML documents :
<university>
<teacher subject="math" students="180">M. Durant</teacher>
<teacher subject="CS" students="130">M. Smith</teacher>
<teacher subject="CS" students="150">Mme. Martin</teacher>
</university>
<university>
<teacher>
<name>M. Durant</name>
<subject>Math</subject>
<students>180</students>
</teacher>
<teacher>
<name>M. Smith</name>
<subject>CS</subject>
<students>130</students>
</teacher>
<teacher>
<name>Mme. Martin</name>
<subject>Math</subject>
<students>150</students>
</teacher>
</university>
Which representation is better?
- it depends
- attributes should rather be used for metadata (like units of measure, etc)
-
also attributes must be used only for simple types - not for complex values ```text only
Better:
```xml
<note>
<date>
<day>10</day>
<month>01</month>
<year>2008</year>
</date>
</note>
Namespaces
Suppose we have two tags with the same name, but different meaning
For example, consider a tag
- it can be an HTML title
- a description of a book
- the title of a person
How to avoid naming conflicts?
- use namespaces: add prefixes to the names
- this way it’s possible to give unique names
- each namespace prefix is uniquely identified with some URI
| |
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
| |
<t:table>
<t:name>African Coffee</t:name>
<t:width>80</t:width>
<t:lenght>120</t:lenght>
</t:table>
<root>
<h:table xmlns:h="http://www.w3.org/TR/html4/">
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<t:table xmlns:t="http://www.foo.fr/furniture">
<t:name>African Coffee</t:name>
<t:width>80</t:width>
<t:lenght>120</t:lenght>
</t:table>
</root>
- Note that in this case
h
is defined in the element prefixed withh:
it is possible
Default namespace
- for some element we can define a default namespace
- so we don’t have to prefix anything down the tree: the default namespace is assumed
- another element down the tree can redefine the default namespace for all its children
<chapter xmlns="http://www.mydescription.com">
<paragraph>
...
</paragraph>
</chapter>
<chapter xmlns="http://www.mydescription.com/">
<paragraph xmlns="http://www.foo.fr/">
...
</paragraph>
</chapter>
Several namespaces
- it’s also possible to attach several namespaces to one element
<root xmlns:h="http://www.w3.org/TR/html4/" xmlns:t="http://www.foo.fr/furniture">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<t:table>
<t:name>African Coffee</t:name>
<t:width>80</t:width>
<t:lenght>120</t:lenght>
</t:table>
</root>