W3C XML Schema:
Background and Introduction to "Best Practices"


  1. What is XML?
  2. What is a "DTD" ?
  3. What is a "Schema" ?
  4. What is a "Namespace" ?
  5. What is an XML "Document" ?
  6. What is "Constraint-Modeled Data" ?
  7. What is "Data Binding" ?

What is XML?
XML is the acronym for "eXtensible Markup Language". It has been endorsed by the W3C (World Wide Web Consortium) to define a "generic, platform-independent syntax for marking up and structuring data by means of simple, human-readable tags." 1

Let us examine some key parts in the above statement.

By "generic" we mean that it is not tied to any particular type of data or document. XML can be used to define any type of data/document for any purpose. It looks like the HTML you are probably used to reading and writing, but it is not HTML. Rather, it may be used to define documents which may end up being displayed as HTML (or as some other type of presentation, or even not for presentation purposes). XML is not tied to a particular use, it is just a way to define a document. XML is "pure content and structure".

By "platform-independent" we mean that it is not tied to any particular machine or operating system. XML is XML, whether it is to be utilized by a Unix machine or a Windows machine or a Macintosh or any other machine/OS pair. As you may imagine, this helps make XML hightly portable. XML is neutral in the operating-systems/hardware wars.

By "designed for marking up data" we mean that an XML document contains "pure" data; for example: text, numbers, booleans, and complex mixtures of data-types. Because it is generic and platform-independent, you can probably see that the data in an XML document is pretty much ideal for interchange between machines and systems. XML is about content and structure, and not about presentation

By "designed for structuring data" we mean that the data in an XML document has a structure or ordering imposed upon it by the relationships between tags and the data contain within the tags (see below for more details). Data by itself is more or less meaningless; by imposing a structure upon data we give it meaning, we turn it into information.

By "human-readable format" we mean that both the structure and the data can be read and understood in "raw" form by people as well as by machines. There have been many discussions in many venues about the advantages of binary-encoded versus human-readable data, and I will not go into these issues here.

The majority of XML documents are "well formed" rather than "valid". The former means that there is exactly one root element, and every sub-element (and recursive sub-elements) have delimiting start- and end-tags, and that they are properly nested within each other. On the other hand, a valid document is "well-formed" and conforms to a specified set of production rules. 2

Class discussion question:
What might be some advantages or disadvantages of data-content and data-structure being human-readable?


What is a "DTD" ?
A set of rules for document construction that lies at the heart of SGML development and all valid XML document construction. Processing applications and authoring tools rely on DTDs to inform them of the parts required by a particular document type. A document with a DTD may be validated against the definition. 3
The purpose of a Document Type Definition (DTD) is to define the legal building blocks of any SGML-based (SGML = Standard Generalized Markup Language) document. It defines the document structure with a list of legal elements.
DTD's have been used since the 1970's 3

An example (and very simple) DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOCUMENT [
  <!ELEMENT DOCUMENT (#PCDATA)>
  <!ENTITY Description "This is entity content.">
]>

And an example document conforming to the above DTD:

<DOCUMENT>
  This is element text and an entity follows:
  &Description;
</DOCUMENT>


What is a "Schema" ?
Schemata (plural of schema) are a "A diagrammatic representation; an outline or model."

Something that formally describes the abstract structure of a set of data can therefor be called schema.

An XML-schema is a document that describes the valid format of an XML data-set. This definition include what elements are (and are not) allowed at any point; what the attibutes for any element may be; the number of occurances of elements; etc..

Note: XML-Schema are not known for their brevity. An XML-Schema document for a reasonably-sized XML instance-document will be fairly large. Disk space is cheap and bandwidth is not a huge bottleneck, so there is no need to worry about it.

It does mean that you will to a lot of typing though. 2

An example (and slightly more complex) Schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

 <xsd:annotation>
  <xsd:documentation xml:lang="en">
   Purchase order schema for Example.com.
   Copyright 2000 Example.com. All rights reserved.
  </xsd:documentation>
 </xsd:annotation>

 <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>

 <xsd:element name="comment" type="xsd:string"/>

 <xsd:complexType name="PurchaseOrderType">
  <xsd:sequence>
   <xsd:element name="shipTo" type="USAddress"/>
   <xsd:element name="billTo" type="USAddress"/>
   <xsd:element ref="comment" minOccurs="0"/>
   <xsd:element name="items"  type="Items"/>
  </xsd:sequence>
  <xsd:attribute name="orderDate" type="xsd:date"/>
 </xsd:complexType>

 <xsd:complexType name="USAddress">
  <xsd:sequence>
   <xsd:element name="name"   type="xsd:string"/>
   <xsd:element name="street" type="xsd:string"/>
   <xsd:element name="city"   type="xsd:string"/>
   <xsd:element name="state"  type="xsd:string"/>
   <xsd:element name="zip"    type="xsd:decimal"/>
  </xsd:sequence>
  <xsd:attribute name="country" type="xsd:NMTOKEN"
     fixed="US"/>
 </xsd:complexType>

 <xsd:complexType name="Items">
  <xsd:sequence>
   <xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name="productName" type="xsd:string"/>
      <xsd:element name="quantity">
       <xsd:simpleType>
        <xsd:restriction base="xsd:positiveInteger">
         <xsd:maxExclusive value="100"/>
        </xsd:restriction>
       </xsd:simpleType>
      </xsd:element>
      <xsd:element name="USPrice"  type="xsd:decimal"/>
      <xsd:element ref="comment"   minOccurs="0"/>
      <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
     </xsd:sequence>
     <xsd:attribute name="partNum" type="SKU" use="required"/>
    </xsd:complexType>
   </xsd:element>
  </xsd:sequence>
 </xsd:complexType>

 <!-- Stock Keeping Unit, a code for identifying products -->
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>

</xsd:schema>

And an example document conforming to the above Schema:

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <shipTo country="US">
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Cambridge</city>
        <state>MA</state>
        <zip>12345</zip>
    </shipTo>
    <billTo country="US">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Cambridge</city>
        <state>MA</state>
        <zip>12345</zip>
    </billTo>
    <items/>
</purchaseOrder>


What is a "Namespace" ?
We envision applications of Extensible Markup Language (XML) where a single XML document may contain elements and attributes (here referred to as a "markup vocabulary") that are defined for and used by multiple software modules. One motivation for this is modularity; if such a markup vocabulary exists which is well-understood and for which there is useful software available, it is better to re-use this markup rather than re-invent it.

Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the tags and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name.

These considerations require that document constructs should have universal names, whose scope extends beyond their containing document. This specification describes a mechanism, XML namespaces, which accomplishes this.

[Definition:] An XML namespace is a collection of names, identified by a URI reference [RFC2396], which are used in XML documents as element types and attribute names. XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set. These issues are discussed in"A.ĘThe Internal Structure of XML Namespaces".

[Definition:] URI references which identify namespaces are considered identical when they are exactly the same character-for-character. Note that URI references which are not identical in this sense may in fact be functionally equivalent. Examples include URI references which differ only in case, or which are in external entities which have different effective base URIs.

Names from XML namespaces may appear as qualified names, which contain a single colon, separating the name into a namespace prefix and a local part. The prefix, which is mapped to a URI reference, selects a namespace. The combination of the universally managed URI namespace and the document's own namespace produces identifiers that are universally unique. Mechanisms are provided for prefix scoping and defaulting.

URI references can contain characters not allowed in names, so cannot be used directly as namespace prefixes. Therefore, the namespace prefix serves as a proxy for a URI reference. An attribute-based syntax described below is used to declare the association of the namespace prefix with a URI reference; software which supports this namespace proposal must recognize and act on these declarations and prefixes. 4


What is an XML "Document" ?

An XML "document" might be a file, but then again it might not be a file. Why is this?

Remember that an XML "file" (or data source) is just a stream of character data which has a predictable structure and form. Because an XML "document" is just a stream with a predictable structure and form, it can "come from" anywhere on the internetwork and it can be stored (or "persisted") in any form which may be turned into a stream of character data which has the appropriate structure and form.

An XML "document" can persist as one or more entries in a DBMS, it can exist momentarily as a stream from another machine which is the result of some transformation operations (created "on-the-fly" under programmatic control), it can exist as a text file already in the correct structure and form, or it can originate from anywhere and from any source-type which can be transformed into the correct structure and form.

Class discussion question:
Describe some possible sources for XML "documents" and some ways you could see them being created "on-the-fly" under programmatic control.


What is "Constraint-Modeled Data" ?

What does Schema accomplish? Amongst other things, Schema allows you to state the "constraints" which should be applied to your data: what type(s) and what range of value(s) are allowed to exist in any given element or attibute. This is known as "constraint-modeled data", in that the constraints you establish in your Schema allow a program to create a "model" of the data-set which may be found in any given "instance document" (one XML document which meets the requirements of your Schema). If a program can tell (through examining the constraints on legal data) what the legal contents are, then that program can model that data (for example, as a set of Java classes).

Class discussion question:
Look at the classes generated by the PO example, and discuss the correspondence between them and the Schema.


What is "Data Binding" ?

Discussion items:

  1. Class generation: why do it?
  2. Unmarshalling: what is it?
  3. Marshalling: what is it?
  4. Binding Schemas: how does this take us from simple utility classes to persistence?

References
  1. JAXB Users Guide – XML Basics
  2. XML Schema, a brief introduction
  3. XML: A Primer, Third Edition, Simon St. Laurent, M&T Books, 2001, p. 532.
  4. Namespaces in XML
Last updated: 10 DEC 2002 Prof. Jeff Sonstein