Formatting and Indenting XML Documents

Oxygen XML Author creates XML documents using several edit modes. In Text mode, you as the author decide how the XML file is formatted and indented. In the other modes, and when you switch between modes, Oxygen XML Author must decide how to format and indent the XML. Oxygen XML Author will also format and indent your XML for you in Text mode if you use one of the Format and Indent options:

  • DocumentSourceFormat and Indent - Formats and indents the whole document.

  • DocumentSourceIndent Selection - Indents the current selection (but does not add line breaks). This action is also available in the Source submenu of the contextual menu.

  • DocumentSourceFormat and Indent Element - Formats and indents the current element (the inmost nested element that currently contains the cursor) and its child-elements. This action is also available in the Source submenu of the contextual menu.

A number of settings affect how Oxygen XML Author formats and indents XML. Many of these settings have to do with how whitespace is handled.

Significant and insignificant whitespace in XML

XML documents are text files that describe complex documents. Some of the white space (spaces, tabs, line feeds, etc.) in the XML document belongs to the document it describes (such as the space between words in a paragraph) and some of it belongs to the XML document (such as a line break between two XML elements). Whitespace belonging to the XML file is called insignificant whitespace. The meaning of the XML would be the same if the insignificant whitespace were removed. Whitespace belonging to the document being described is called significant whitespace.

Knowing when whitespace is significant or insignificant is not always easy. For instance, a paragraph in an XML document might be laid out like this:

<p>NO Free man shall be taken or imprisoned, or be stripped of his Freedom,
or Liberties, or free Customs, or be outlawed, or exiled, or any otherwise
destroyed; nor will we not pass upon him, nor condemn him, but by lawful 
judgment of his Peers, or by the <xref 
href="http://en.wikipedia.org/wiki/Law_of_the_land" format="html"
scope="external">Law of the land</xref>. 
We will sell to no man, we will not deny to any man either Justice or Right.</p>

By default, XML considers a single whitespace between words to be significant, and all other whitespace to be insignificant. The paragraph above could have been written on one line because the XML parser would see it as exactly the same paragraph since all multiple consecutive whitespaces will be replaced with a single whitespace. Removing the insignificant space in markup like this is called normalizing space.

In some cases, all the spaces inside an element should be treated as significant. For example, in a code sample:

<codeblock>
class HelloWorld
{
   public static void main(String args[])
   {
      System.out.println("Hello World");
   }
}
</codeblock>

Here every whitespace character between the codeblock tags should be treated as significant.

How Oxygen XML Author determines when whitespace is significant

When Oxygen XML Author formats and indents an XML document, it introduces or removes insignificant whitespace to produce a layout with reasonable line lengths and elements indented to show their place in the hierarchy of the document. To correctly format and indent the XML source, Oxygen XML Author needs to know when to treat whitespace as significant and when to treat it as insignificant. However it is not always possible to tell this from the XML source file alone. To determine what whitespace is significant, Oxygen XML Author assigns each element in the document to one of four categories:

Ignore space

In the ignore space category, all whitespace is considered insignificant. This generally applies to content that consists only of elements nested inside other elements, with no text content.

Normalize space

In the normalize space category, a single whitespace character between character strings is considered significant and all other spaces are considered insignificant. Therefore, all consecutive whitespaces will be replaced with a single space. This generally applies to elements that contain text content only.

Mixed content

In the mixed content category, a single whitespace between text characters is considered significant and all other spaces are considered insignificant. However,

  • Whitespace between two child elements embedded in the text is normalized to a single space (rather than to zero spaces as would normally be the case for a text node with only whitespace characters, or the space between elements generally).

  • The lack of whitespace between a child element embedded in the text and either adjacent text or another child element is considered significant. That is, no whitespace can be introduced here when formatting and indenting the file.

For example:

<p>The file is located in <i>HOME</i>/<i>USER</i>/hello. 
     This is a <strong>big</strong> 

<emphasis>deal</emphasis>.
</p>

In this example, whitespace should not be introduced around the i tags as it would introduce extra significant whitespace into the document. The space between the end </strong> tag and the beginning <emphasis> tag should be normalized to a single space, not zero spaces.

Preserve space

In the preserve space category, all whitespace in the element is regarded as significant. No changes are made to the spaces in elements in this category. However, child elements may be in another category, and may be treated differently.

Attribute values are always in the preserve space category. The spaces between attributes in an element tag are always in the default space category.

Oxygen XML Author consults several pieces of information to assign an element to one of these categories. An element is always assigned to the most restrictive category (from Ignore to Preserve) that it is assigned to by any of the sources Oxygen XML Author consults. For instance, if the element is named on the Default elements list (as described below) but it has an xml:space="preserve" attribute in the source file, it will be assigned to the preserve space category. If an element has the xml:space="default" attribute in the source, but is listed on the Mixed content elements list, it will be assigned to the mixed content category.

To assign elements to these categories, Oxygen XML Author consults information from the following sources:

xml:space
If the XML element contains the xml:space attribute, the element is promoted to the appropriate category based on the value of the attribute.
CSS whitespace property
If the CSS stylesheet controlling the Author mode editor applies the whitespace: pre setting to an element, it is promoted to the preserve space category.
CSS display property
If a text node contains only white-spaces:
  • If the node has a parent element with the CSS display property set to inline then the node is promoted to the mixed content category.
  • If the left or right sibling is an element with the CSS display property set to inline then the node is promoted to the mixed content category.
  • If one of its ancestors is an element with the CSS display property set to table then the node is assigned to the ignore space category.

Schema aware formatting

If a schema is available for the XML document, Oxygen XML Author can use information from the schema to promote the element to the appropriate category. For example:

  • If the schema declares an element to be of type xs:string, the element will be promoted to the preserve space category because the string built-in type has the whitespace facet with the value preserve.

  • If the schema declares an element to be mixed content, it will be promoted to the mixed content category.

Schema aware formatting can be turned on and off.

Preserve space elements list

If an element is listed in the Preserve space tab of the Element Spacing list in the XML formatting preferences, it is promoted to the preserve space category.

Default space elements list

If an element is listed in the Default space tab of the Element Spacing list in the XML formatting preferences, it is promoted to the default space category

Mixed content elements list

If an element is listed in the Mixed content tab of the Element Spacing list in the XML formatting preferences, it is promoted to the mixed content category.

Element content

If an element contains mixed content, that is, a mix of text and other elements, it is promoted to the mixed content category. (Note that, in accordance with these rules, this happens even if the schema declares the element to have element only content.)

If an element contains text content, it is promoted to the default space category.

Text node content
If a text node contains any non-whitespace characters then the text node is promoted to the normalize space category.

Exception to the Rule

In general, a element can only be promoted to a more restrictive category (one that treats more whitespace as significant). However, there is one exception. In Author mode, if an element is marked as mixed content in the schema, but the actual element contains no text content, it can be demoted to the space ignore category if all of its child elements are displayed as blocks by the associated CSS (that is, they have a CSS property of display: block). For example, in some schemas, a section or a table entry can be defined as having mixed content but in many cases they contain only block elements. In these cases, any whitespace they contain cannot be significant and they can be treated as space ignore elements. This exception can be turned on or off using the Schema Aware Editing option in the Schema-Aware preferences page.

How Oxygen XML Author formats and indents XML

You can control how Oxygen XML Author formats and indents XML documents. This can be particularly important if you store your XML document in a version control system, as it allows you to limit the number of trivial changes in spacing between versions of an XML document. The following preference pages include options that control how XML documents are formatted:

When Oxygen XML Author formats and indents XML

Oxygen XML Author formats and indents a document, or part of it, on the following occasions:

  • In Text mode when you select one of the format and indent actions (DocumentSourceFormat and Indent, DocumentSourceIndent Selection, or DocumentSourceFormat and Indent Element).
  • When saving documents in Author mode.
  • When switching from Author mode to another mode.
  • When saving or switching to Text mode from Grid mode, if the Format and indent when passing from grid to text or on save option is selected in the Grid preferences page.

Was this helpful?