xml basics

52
XML Programming

Upload: kumar

Post on 14-Feb-2017

420 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Xml basics

XML Programming

Page 2: Xml basics

XML Document

• Physical View• Logical View

Page 3: Xml basics

XML Document: Physical View

• Either Markup or Character Data• To store contents in one or more les• Collectively they represent an XML document

Page 4: Xml basics

XML Document: Logical View

• Structure of document• XML document is viewed as tree• Elements divide the document into its

constituent parts• Elements are organized into a hierarchy

Page 5: Xml basics

Document Structure

• Root element: Top element of the hierarchy• Also called Document Element• Root element defines boundary• It encloses all the other elements• Only one root element in a document• XML documents are commonly stored as text files• Any text editor may be used to create XML document• Use extension .xml for clarity• XML document does not contain formatting information

Page 6: Xml basics

Parsing XML document

• The XML parser reads the XML document, checks its syntax, reports any errors and allows programmatic access to the document’s contents.

• If the document is well formed, the parser makes the document’s data available to the application (i.e., IE5), using the XML document.

Page 7: Xml basics

An Example XML Document<!-- \Prolog" --><?xml version=“1.0"?><!DOCTYPE Greeting SYSTEM “wel.dtd"><!-- End of Prolog --><!-- This is the first XML example --><Greeting>

<from>Instructor</from><to>CSIB 313 Class</to><myGreetings>

<greetings>Welcome to XML</greetings></myGreetings>

</Greeting>

Page 8: Xml basics

Comments

• Tags: Enclosed between <>• Tags demarcate and label the parts of

documents• Comment start tag: <!--• Comment end tag: -->• May extend several lines

Page 9: Xml basics

Document Prolog

• XML document: Prolog and Body• It may hold additional information• Text encoding, Processing Instructions, and

Document Type Definition being used• Ordering between prolog and body significant

Page 10: Xml basics

XML declaration

• Syntax: <?xml name1 = “value1" name2 = “value2" ...

• Three properties can be set– version, encoding, and standalone

• <?xml version=“1.0" encoding=“US-ASCII“ standalone=“yes“ ?>

• Note: single and double quotes enclosing values

Page 11: Xml basics

Contd.

• All properties are optional• Property names must be in lowercase• Values must be quoted: single or double• Desirable to include version

Page 12: Xml basics

XML Character Set

• A character set consists of the characters that may be represented in a document.

• ASCII: 7 bit coding• ISO: 8 bit coding• ISO-8859-1,-2,-3,..• XML document may contain: carriage return,

line feed, and Unicode characters

Page 13: Xml basics

Contd.

• The Unicode consortium: www.unicode.org• Unicode: 16 bit code• Encodes major scripts of world• Universal Character System(UCS): 32 bits• ISO-10646: Lower 2 bytes are Unicode

Page 14: Xml basics

Character Encoding

• Each subset for a script• This subset for a script is mapped to 8 bit code• The characters are in same order, but starts at

lower number• The mapped subset is known as character

encoding• Most common encoding scheme: UTF-8• UTF-8: Efficient to encode documents having ASCII

characters

Page 15: Xml basics

Contd..

• UTF-8 is default XML character encoding• Some common character encodings:– US-ASCII, ISO-8859-1, ISO-8859-n– ISO-8859-1-Windows-3.1-Latin-1– UTF-7, UTF-8, UTF-16– ISO-10646-UCS-2, ISO-10646-UCS-4– UCS-2 is same as Unicode

Page 16: Xml basics

Elements• Elements are building blocks• Elements are organized in an hierarchy• Hierarchy defines the logical structure• Elements acts as containers and labels• Types of the elements are differentiated by tags• Start-tag, End-tag, Empty-tag• Empty-element tag: <name/>• Start-tag must have a matching end tag• XML IS CASE SENSITIVE• <message> ....</Message> : Wrong

Page 17: Xml basics

Attributes

• Attributes describe elements• An element may have zero or more attributes• Attributes are placed within element's start

tag• Only one occurrence of each attribute• Example: <car doors = “4"/>• Attribute doors has value “4"

Page 18: Xml basics

Elements and Attributes Names

• Can be of any length• Must begin with a letter or an underscore• May contain letters, digits, underscores,

hyphens, and periods• Names are case sensitive• Example:<instructor Designation=“Professor"> PKD</instructor>

Page 19: Xml basics

Reserved Attributes

• xml:lang: Classies an element by language• xml:lang=“en"• xml:space: Species whether whitespace

should be preserved• xml:space=“default"• xml:link and xml:attribute

Page 20: Xml basics

White Space Characters

• Spaces, tabs, line feeds and carriage return• An XML parser is required to pass all

characters in a document• Application need to decide the significance• Insignificant white spaces may be collapsed

into single white space character or removed• This process is called normalization

Page 21: Xml basics

Entity References

• Reserved characters: <, >, & , `, “, etc.• To use these characters in content: Entity

References• Entity references begin with & and ends with ;• Unicode may be used in document• &#1583; &#1571; ..

Page 22: Xml basics

Built-in Entities

• XML provides built-in entities• &amp;, &lt;, &gt;, &apos;, and &quote;• Example:<message>&lt;&gt;&quote;</

message>

Page 23: Xml basics

CDATA Section• CDATA sections are helpful for XML authors• may contain text, reserved characters (e.g., <) and whitespace

characters• It is set off by <![CDATA[ and ]]>• Everything between <![CDATA[ and the ]]> is treated as raw data• Character data in a CDATA section is not processed by the• XML parser• In CDATA < is not start tag• An acronym for “character data"• An example:

<para> Then you may say <![CDATA[ if(&x < &y) ]]> and be done with it.</para>

Page 24: Xml basics

Entities

• An entity is a placeholder for content• Declared once and may be used several times• Does not add anything semantically• Convenient to read, write and maintain XML

document• It represents physical containers such as files

or URLs

Page 25: Xml basics

Contd..• Can be used for different reasons, but always

eliminate an inconvenience• Standing in for impossible-to-type characters to

marking the place where file should be imported.• Entities can hold a single character, a string of text,

or even a chunk of XML markup• Without entities, XML would be much less useful.• For e.g. define an entity w3url to represent the

W3C's URL. Whenever you enter the entity in a document, it will be replaced with the text http://www.w3.org/.

Page 26: Xml basics

Different types of entities

• Entities are classified as: Parameter and General• Parameter entities are used in only DTDs• General entities: Character, Unparsed, and

mixed-content (General entities are placeholders for any content that occurs at the level of or inside the root element of an XML document)

• Mixed-content: Internal and External• Character: Predefined, Numbered, and Named

Page 27: Xml basics

Contd..• An entity consists of a name and a value• The value may be anything (a single character to a file of

XML markup)• As the parser scans the XML document, it encounters entity

references(ER), which are special markers derived from entity names.

• For each ER, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup,

• Two syntax for entity reference:– General entities: &entity_name;– Parameter entities: %entity_name;

Page 28: Xml basics

Example<?xml version=“1.0"?><!DOCTYPE message SYSTEM “xmldtds/message.dtd" [

<!ENTITY client “Mr. John"><!ENTITY agent “Ms. Sally"><!ENTITY phone “<number>8755944998 </number>“ >

]><message><opening>Dear &client;</opening><body>We have an exciting opportunity for you! A set of ocean-front cliff dwellings in Pi&#241;ata, Mexico have been renovated as time-share vacation homes. They're going fast! To reserve a place for your holiday, call &agent; at &phone;. Hurry, &client;. Time is running out!</body></message>

Page 29: Xml basics

Contd..

• The entities &client;, &agent;, and &phone; are declared in the internal subset of this document and referenced in the <message> element.

• A fourth entity, &#241;, is a numbered character entity that represents the character ñ.

• This entity is referenced but not declared;• no declaration is necessary because numbered

character entities are implicitly defined in XML as references to characters in the current character set.

Page 30: Xml basics

• All entities (besides predefined ones) must be declared before they are used in a document

• Two acceptable places to declare them are – in the internal subset, which is ideal for local

entities, and – in an external DTD, which is more suitable for

entities shared between documents.

Page 31: Xml basics

Character Entities

• Character entities: Contain single character• Classified into several groups• Predefined character entities: amp, apos, lt,....• Numbered character entities: &#231, &#xe7• Hexadecimal version is distinguished with x

Page 32: Xml basics

Contd..

• Named character entities• They have to be declared• Easy to remember mnemonic• Large numbers have been defined in DTDs• ISO-8879: Standardized set of named

character entities

Page 33: Xml basics

Mixed Content Entities

• Unlimited length• May include markup as well as text• Categories: Internal and External• Internal: Replacement text is defined in the

declaration• &client; &agent; &phone;

Page 34: Xml basics

Contd..

• Predefined character entities must not be used in markup if they are used in entity definition

• Exceptions: quote and apos• Entities can contain entity reference if it has

been declared• Recursive declaration not permitted

Page 35: Xml basics

• External: Replacement text in another file• Useful for sharing the contents• Breaks a document into multiple physical parts• A linking mechanism

Page 36: Xml basics

An example:<?xml version=“1.0"?><!DOCTYPE doc SYSTEM “http://www.dtds.com/generic.dtd" [<!ENTITY part1 SYSTEM “p1.xml"><!ENTITY part2 SYSTEM “p2.xml"><!ENTITY part3 SYSTEM “p3.xml">]><longdoc>

&part1;&part2;&part3;

</longdoc>

Page 37: Xml basics

• Replacement text is inserted at the time of parsing• &part1;, &part2;, and &part3; are external entities• Keyword SYSTEM is used.• File names are under single or double quotes.• System identifier: To identify resource location.• Quoted string may be URL• Alternative is to use keyword PUBLIC• XML processor will locate the resource• Public identifier may be accompanied with a system

identifier

Page 38: Xml basics

Unparsed Entities

• Kind of entity that holds replacement text/content that should not be parsed

• Because it contains something other than text and would likely confuse the parser

• Used to import graphic, sound files and other non-character data.

• Declaration looks similar to that of an external entity, with additional information

Page 39: Xml basics

Example

An Example:<!DOCTYPE doc [<!ENTITY mypic SYSTEM “photos/erick.gif“ NDATA GIF>]>< doc >

&mypic;</doc>

Page 40: Xml basics

Contd..

• Keyword NDATA used after system identifier• NDATA: Notation for data• tells the parser that the entity's content is in a

special format, or notation, other than the usual parsed mixed content.

• It is followed by notation identifier that specifies the data format.

Page 41: Xml basics

Processing Instructions

• Presentational information should be kept out of document

• For example: if we want to store page numbers in the document

• The prescription for this kind of information is a processing instruction

• It is container for data that is targeted toward a specific XML processor

Page 42: Xml basics

Contd..• PI contain two pieces of information: a target keyword and

target data• The parser passes processing instructions up to the next

level of processing. • If the processing instruction handler recognizes the target

keyword, it may choose to use the data; otherwise, the data is discarded.

• A PI starts with a two-character delimiter– consisting of an open angle bracket and a question mark (<?), – followed by a target and an optional string of characters that is

the data portion of the PI– a closing delimiter, consisting of a question mark and closing

angle bracket (?>).

Page 43: Xml basics

Contd..• The target is a keyword that an XML processor uses to

determine whether the data is meant for it or not• More than one program can use a PI, and a single

program can accept multiple PIs.• The PI can contain any data except the combination ?>• Examples:– <? flubber pg = 9 recto ?>– <? things ?>– <title> The Introduction to the XML <? lb ?> portability <? lb

?> and integratin</title>, where <? lb ?> forces line break

Page 44: Xml basics

Well-Formed Documents

• An XML parser or processor• Parses XML documents• Produce parse tree• Document is not well-formed if parser is not

able to generate tree• Every XML document must be well formed

Page 45: Xml basics

Well-Formed Documents (Contd...)

• Some of the syntactic rules checked– Case sensitive: message and Message are different

tags– Single root element– A start and end tag for each element– Properly nested tags– Quoted attribute values– Comments and processing instructions may not

appear inside tags

Page 46: Xml basics

Document Validation

• A valid document includes a Document Type Declaration

• The DTD defines rules for the documents• These are additional constraints• Validating parser compares documents to

their DTDs• The DTD lists all elements, attributes, and

entities the document use and their contexts.

Page 47: Xml basics

Document Modeling

• Define a Markup Language for documents• Define a grammar for documents• It is also called as XML Application Modeling• Markups may appear in the documents• It describes the restrictions which each

document instance has to honor

Page 48: Xml basics

Document Type Definitions• Document Type Definition is written in formal syntax• Describes elements, attributes, entities, and contents

which may appear in the documents• Validating Parser compares documents to their DTDs• The DTD does not say:– What is the document's root element?– How many instances of each element?– What character data are inside elements look like?– What is the meaning of an element?

Page 49: Xml basics

A Simple DTD Example

<!ELEMENT person (name, profession*) ><!ELEMENT name (first name, last name) ><!ELEMENT first name (#PCDATA) ><!ELEMENT last name (#PCDATA) ><!ELEMENT profession (#PCDATA) >

Page 50: Xml basics

Contd..

• The DTD describes person element• The person element has two children elements /

sub-elements• Sub-elements are: name and profession• A person may have “zero or more" profession• The name has two sub-elements

Page 51: Xml basics

Document Type Declaration

• A valid document includes a reference to its DTD

• The DTD declaration is included in prolog• <!DOCTYPE person SYSTEM

“http://www.upes.ac.in/xml/dtds/person.dtd" >

• The DTD is generally stored in separate file• Optionally, it may have extension .dtd

Page 52: Xml basics

Contd..

• Root element is person• The DTD can be found at the URL• Relative URL may be used if DTD is on same

site• File name may be used if the document and

the DTD are in the same directory• DTD may be stored at several URLs