Advertisement

Reading XML

  • Jonathan Hartwell
Chapter

Abstract

Reading XML is the cornerstone of handling XML in any application. If your application is unable to read XML, then you won't be able to do much. There are several ways to read XML, and this chapter will give you an insight into what methods are available to you.

Keywords

Parent Property Extensible Markup Root Property Root Element XPath Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Reading XML is the cornerstone of handling XML in any application. If your application is unable to read XML, then you won’t be able to do much. There are several ways to read XML, and this chapter will give you an insight into what methods are available to you.

Using XmlDocument

The XmlDocument class was the first way to handle reading and writing XML using the .NET Framework with C# and is included in the System.Xml namespace. With XmlDocument you can not only read XML but also manipulate and write XML, which will be covered in later chapters.

To start we will need to have an XML document to demostrate with. The following example is a small database of books and movies in our imaginary library:

Library.xml

<?xml version="1.0"?>
<library>
        <books>
                <book checkedout="no">
                        <title>To Kill a Mockingbird</title>
                        432799_1_EnHarper Lee</author>
                </book>
                <book checkedout="no">
                        <title>Price and Prejudice</title>
                        432799_1_EnJane Austen</author>
                </book>
                <book checkedout="yes">
                        <title>The Great Gatsby</title>
                        432799_1_EnF. Scott Fitzgerald</author>
                </book>
                <book checkedout="no">
                        <title>1984</title>
                        432799_1_EnGeorge Orwell</author>
                </book>
        </books>
        <movies>
                <movie checkedout="no">
                        <title>King Kong</title>
                        <year>1933</year>
                </movie>
                <movie checkedout="yes">
                        <title>King Kong</title>
                        <year>2005</year>
                </movie>
                <movie checkedout="yes">
                        <title>To Kill A Mockingbird</title>
                        <year>1962</year>
                </movie>
                <movie checkedout="no">
                        <title>The Green Mile</title>
                        <year>1999</year>
                </movie>
        </movies>
</library>

To be able to do anything with this XML document, we first need to load the XML into an XmlDocument instance. There are two ways to do this: by file and by string.

Loading XML from a File

XmlDocument document = new XmlDocument();
document.Load("library.xml");

Loading XML from a String

XmlDocument document = new XmlDocument();
string xml = "<input>test</input>";
document.LoadXml(xml);

Once we have a file loaded, we can begin reading from the contents, which can be done in multiple ways.

Searching with XPath

Think of using XPath as having random read access to the XML document. It can be used to retrieve a single node or a collection of nodes. When wanting to select multiple nodes, it requires the use of the SelectNodes method . Going back to the library.xml example, we can use SelectNodes to return all the books. To get at the books, however, the proper XPath query is needed.

Starting at the top level, there is the library node, so the XPath must start with library. From there, the books child node contains all of the books that are in the library, which means that the books node is going to be appended to the XPath to give us library/books. That XPath alone will give the books node, including all of the children, but that is one step above what we want to get at so we append book to the XPath query to finally give us the query library/books/book.

XmlDocument document = new XmlDocument();
document.Load("library.xml");
XmlNodeList books = document.SelectNodes("library/books/book");
foreach (XmlNode book in books)
{
    richTextBox1.AppendText(book.OuterXml + Environment.NewLine);
}

The above code creates an instance of the XmlDocument and loads the library.xml file into XmlDocument instance, document. Once the XML is loaded into the document, SelectNodes is used since we are looking for multiple book nodes instead of a specific one. If there was only one book node in the document, then it would return the single element in the XmlNodeList. The property OuterXml is used to get the XML of the current node. There are other properties that get various XML elements from an XmlNode, which will be covered later in this chapter. This example is writing to a RichTextBox that is in a Windows form application, which is included in the downloadable source and gives us the XML content of each book printed on a new line, as shown in Figure 2-1.
Figure 2-1.

Output from XML Viewer application based on the code above

With the code above, you can modify the XPath query to get an XML list of any of the nodes. For instance, if we want to get a list of the movies that are available, then we modify the XPath to library/movies/movie. That would give us the following (see Figure 2-2):
Figure 2-2.

The XML Viewer results of selecting all movie nodes

XmlDocument document = new XmlDocument();
document.Load("library.xml");
XmlNodeList movies = document.SelectNodes("library/movies/movie");
foreach (XmlNode movie in movies)
{
    richTextBox1.AppendText(movie.OuterXml +     Environment.NewLine);
}

If you know that you are only looking for a single element, then you can use the SelectSingleNode method. This method takes an XPath expression and returns a single XmlNode. This method will always return a single node as long as a node is returned, no matter what the query. If we have the code below execute using our library.xml file, we will get a single book that looks like Figure 2-3.
Figure 2-3.

Single book output produced by SelectSingleNode

XmlDocument document = new XmlDocument();
document.Load("library.xml");
XmlNode book = document.SelectSingleNode("//book");
richTextBox1.AppendText(book.InnerXml);

You’ll notice that the output above returns the first book that is under the books element. SelectSingleNode will take whatever element it sees first and return that for the given XPath expression. There are ways to get a specific node though using a query such as the following:

XmlNode book = document.SelectSingleNode("/library/books/book[title = ’The Great Gatsby’]");

The query will return the same book element that we saw above, but in this example it is using a filter to find the book that has the title To Kill a Mockingbird.

Search Using Attributes

Up until now, there has been a focus solely on searching based on elements. There is more that can be searched on than just elements. For instance, let’s say somebody asks us to find all books that are checked out. It is an easy feat if we just use SelectNodes with XPath.

Explicitly Finding Movies and Books That Are Not Checked Out

XmlDocument document = new XmlDocument();
document.Load("library.xml");
XmlNodeList movies = document.SelectNodes("library/books/book[@checkedout=’yes’] | library/movies/movie[@checkedout=’no’]");

The above code uses the attribute notation to find both movies and books that are not checked out. In order to search for both, it is necessary to add a pipe between the two XPath queries. You can think of it like the double pipe OR statement in C#. This notation is useful when there are many different children of the root element and you want to filter it down. If we had CDs in this library, we would be able to use the above code to only find books and movies and ignore the CDs.

The problem with the above code is that it is very verbose, especially if there are many children of the node that you are searching under. There is an easier way to find this information without having to type out every single possibility and that is to use the star operator and recursive search in XPath.

Using the Recursive Search and Star

XmlDocument document = new XmlDocument();
document.Load("library.xml");
XmlNodeList movies = document.SelectNodes("library//*[@checkedout=’no’]");

The above code contains two different shortcuts. First, there is the double slash. This is a way to get all children recursively. For this library XML file, this means it would look not only at the books and movies elements but also the children of those nodes. This gives us access to all elements under the library node. Be careful when using the double slash in your XPath queries as it will select all elements regardless of the type of element. If any other elements were added to the library element, those elements would be included in the results as well.

The second shortcut that is in the code above is the star. The star is a shortcut that ignores the element type. What that means is that it treats an element of book the same as the movie element. It is extremely useful when you have several different children or grandchildren under a single element and want to search all of them. We don’t have to use the | operator to combine multiple queries, which drastically cuts down on the amount of code that is needed.

The previous example used attributes to search, but it is also possible to inspect what attributes are on an element by using the Attributes property. Attributes will return an XmlAttributeCollection, which can be iterated on.

XmlDocument doc = new XmlDocument();
doc.Load("library.xml");
XmlNodeList books = doc.SelectNodes("//book");
foreach(XmlNode book in books)
{
    var attributes = book.Attributes;
    foreach(XmlAttribute attr in attributes)
    {
        richTextBox1.AppendText(attr.Value + Environment.NewLine);
    }
}

In the code above, there is the standard iteration that has been seen when dealing with XmlNodes in the past but, alas, there is one difference: the use of Value. Attributes do not have InnerText or InnerXml values. When trying to get an attribute’s text, you will be scratching your head wondering why the attribute you know has a value isn’t showing any value. When using the Value attribute, the only text that will show is the text that is surrounded by the quotes for that attribute. When running the code above, you will get the output in Figure 2-4.
Figure 2-4.

The checkedout attribute for all of the books in the library

Handling Namespaces

Up until now, there only has been straight XML with no namespaces required. While this may happen when you have full control of the XML, chances are that you will encounter namespaces and will need to know how to handle them when it comes to using the XmlDocument class. Namespaces are useful when it comes to preventing collisions with names and so the XmlDocument must take that into consideration.

<?xml version="1.0"?>
<library xmlns:network="www.library.com">
        <books>
                <book checkedout="no">
                        <title>To Kill a Mockingbird</title>
                        <author>Harper Lee</author>
                </book>
                <book checkedout="no">
                        <title>Price and Prejudice</title>
                        <author>Jane Austen</author>
                </book>
                <book checkedout="yes">
                        <title>The Great Gatsby</title>
                        <author>F. Scott Fitzgerald</author>
                </book>
                <book checkedout="no">
                        <title>1984</title>
                        <author>George Orwell</author>
                </book>
        </books>
        <movies>
                <movie checkedout="no">
                        <title>King Kong</title>
                        <year>1933</year>
                </movie>
                <movie checkedout="yes">
                        <title>King Kong</title>
                        <year>2005</year>
                </movie>
                <movie checkedout="yes">
                        <title>To Kill A Mockingbird</title>
                        <year>1962</year>
                </movie>
                <movie checkedout="no">
                        <title>The Green Mile</title>
                        <year>1999</year>
                </movie>
        </movies>
</library>

As we have added a new namespace to our library.xml example, we need to load the namespace into our XmlDocument:

Adding a Namespace to the XmlDocument before Loading XML

XmlDocument document = new XmlDocument();
XmlNamespaceManager namespaceManager = new XmlNamespaceManager(document.NameTable);
namespaceManager.AddNamespace("network", "http://www.library.com");
document.Load("library-network.xml");

In order to add a namespace to an XmlDocument, it is necessary to use the name table, which is of type XmlNameTable, from the XmlDocument. This is the class that handles keeping track of all of the namespaces. Once we have that, an XmlNamespaceManager needs to be created as that is what will allow us to add or remove namespaces. Being able to remove a namespace is just using the RemoveNamespace method of XmlNamespaceManager, which takes the same arguments as AddNamespace.

Using XPathDocument

The XPathDocument class is similar to the XmlDocument class, but the difference is that, unlike XmlDocument, XPathDocument is read only. It is excellent for reading XML when you have no intention of modifying the data. The XPathDocument class relies on two separate classes to do the actual querying: XPathNavigator and XPathNodeIterator.

The XPathNavigator class is what is used to actually query the XML. It allows the use of XPath queries or generic methods that allow you to get at elements and attributes without having to know any XPath.

In order to start using the XPathNavigator, there are two steps involved. An XPathDocument needs to be instantiated and then use that instance to create the XPathNavigator. Once the XPathNavigator is instantiated, it will open up the ability to query the XML.

Create XPathDocument and XPathNavigator Instances

XPathDocument xPathDocument = new XPathDocument("library.xml");
XPathNavigator navigator = xPathDocument.CreateNavigator();

The XPathNavigator is only the first step into being able to read and query XML. To query the document, one must create an XPathNodeIterator. The XPathNodeIterator will provide access to all the elements under the root element.

Iterating on the Children of the Root Element

XPathDocument xpathDocument = new XPathDocument("library.xml");
XPathNavigator navigator = xpathDocument.CreateNavigator();
XPathNodeIterator iterator = navigator.SelectChildren(XPathNodeType.Element);
while (iterator.MoveNext())
{
    richTextBox1.AppendText(iterator.Current.Value);
}

Reading with XmlReader

XmlReader is different from the other XML handling classes that we have used as it is stream-based. What that means for us is that it will only operate going forward and prevents querying. XmlReader is a good option if you are handling large XML files and don’t care about random access to the elements in the XML document. Because XmlReader uses streams to load the XML document, you can read in files that are too large for the XmlDocument. XmlReaders require much more setup than the other classes we’ve looked at prior, but because of their ability to handle large data it is more than worth it. We can start with a basic example:

Creating an XmlReader and Reading the Library File

StreamReader xmlStream = new StreamReader("library.xml");
using (XmlReader reader = XmlReader.Create(xmlStream))
{
    while(reader.Read())
    {
        richTextBox1.AppendText(reader.Value + Environment.NewLine);
    }
}

The output of this will present a quite different view from what we have seen already (Figure 2-5).
Figure 2-5.

The values of the XML elements from the library.xml that has gone through the XmlReader

Notice that there is a lot of blank space as well as only the values. The reason for that is the way that the XmlReader handles the underlying XML stream. Remember it is a forward-only stream. Since we used the value property, it is only going to give us the values of elements that have one. But why the space? Simple. Each one of those spaces represents an element that did not have a value. This is where XmlReader becomes more complicated than other methods of reading; it doesn’t differentiate the type of XML that is being read. The reader does store the type information, but we must manually check it. If we want to get at the type, we can use the XmlNodeType enumeration. We have only been focusing on elements and attributes, so let’s create a reader that will handle both.

Handling Both Elements and Values

StreamReader xmlStream = new StreamReader("library.xml");
using (XmlReader reader = XmlReader.Create(xmlStream))
{
    while(reader.Read())
    {
        switch (reader.NodeType)
        {
            case XmlNodeType.Element:
                richTextBox1.AppendText(reader.Name + Environment.NewLine);
                break;
            case XmlNodeType.Text:
                richTextBox1.AppendText(reader.Value + Environment.NewLine);
                break;
            case XmlNodeType.EndElement:
                richTextBox1.AppendText(reader.Name + Environment.NewLine);
                    break;
            default:
                break;
        }
    }
}

The element node type is added as well as the EndElement node type. There is a good reason that you may want to include both. If you just include element, you will get the element name and then the value of that element followed by the next element. On the other hand, if you include both EndElement and element, it will give you Figure 2-6.
Figure 2-6.

Output of the XmlReader once we put checks in for specific type being read

Remember when I said there was a reason you would want both? That reason is when you want to write the XML, you are reading from the XmlReader stream.

Using LINQ to XML

XmlDocument was introduced in .NET 2.0 and remained the only way to handle XML until .NET 3.5 was released. In the .NET 3.5 release, we saw the introduction of LINQ (Language Integrated Query) and the advent of LINQ to XML. LINQ to XML is now the preferred method of handling XML, so let’s dive in using the library XML example from the last section.

XDocument is LINQ to XML’s equivelent to the XmlDocument. The nature of LINQ gives XDocument a whole different feel, but don’t worry because you can still fall back on XPath. For instance, the following code will instantiate an XDocument as well as get the values of every book.

XDocument doc = new XDocument();
doc = XDocument.Load("library.xml");
IEnumerable<XElement> books = doc.Descendants().Where(x => x.Name == "book");
foreach(XElement book in books)
{
    richTextBox1.AppendText(book.Value + Environment.NewLine);
}

This approach gives a much cleaner interface, and it is clear on what is happening to those who aren’t well versed in XPath. That being said, the Value property of the XElement book is not as clear as one might think. Instead of giving XML, it returns all children and concatenates the values of those children, as seen in Figure 2-7.
Figure 2-7.

Output from the LINQ to XML query

Document vs. Document.Root : Getting to the Root Document

There are two ways to get at the root element using XDocument: using the instantiated XDocument class directly or using the Root property on the instantiated class. There is a subtle difference and that is the Root property is an XElement instead of an XDocument. Because of that, you are able to use all of the methods that you would normally get with XElement but by using the root element directly.

We used the XDocument in the example above because the XDocument allows access to the descendants of the root element, which is what we needed in order to traverse the XML structure and find the book elements. If we had used the Root property, it would have allowed us to not only get descendants but to add elements as well as get to the attributes of the root element.

Searching for Attributes

We saw how to search using attributes when handling XML with XmlDocument and how it required XPath to get at the attributes relatively easily; that same task becomes much easier with XML to LINQ. Let’s say we want to find all the elements that have a checkedout attribute. We can use our LINQ expressions to find all elements that have the checkedout attribute.

List<XElement> elementsWithAttributes = document.Descendants().Where(x => x.HasAttributes && x.Attribute("checkedout") != null).ToList();

There is a drawback to using LINQ, which is that it is much more verbose than using straight XPath. With XDocument we could still use XPath to get the same results. Where LINQ shines is when you have more complex queries that may be difficult to read in XPath or require more in-depth knowledge of XPath. For instance, we could look for all movie and book elements, which we did in the previous chapter by using LINQ instead of XPath.

List<XElement> booksOrMovies = document.Descendants().Where(x => x.Name == "movie" || x.Name == "book").ToList();

Or we could retrieve movies that were released between certain years. For instance, let’s look for movies that were released between 1990 and 2016. That would be an incredibly complex XPath query that would be horrible to maintain in the future. On the other hand, by using LINQ and XDocument it becomes a simple where clause to filter out the unwanted titles. We can do all that in the following code:

List<XElement> movies = document.Descendants()
    .Where(x => x.Name == "year" && (int.Parse(x.Value) >= 1990 && int.Parse(x.Value) <= 2015))
    .Select(x => x.Parent).ToList();

Now there are many things in the above code that may need an explanation because we have not seen it before or it may not be intuitive. First off, we are searching for the year element instead of directly for a movie element. This allows us to easily get at the value of that element, which is the year the movie was made, instead of having to filter down even more based on the movie element’s children. The only reason why that method is feasible is because of the Parent property. The Parent property will return the XElement of the parent of the current XElement. In this case, the element movie is the parent of the title element, so we can get back up to the movie element after we are done filtering. We are also doing some parsing of the year into integer type; however, this is not recommended in production code as this could throw exceptions. I am doing this here for demonstration purposes.

Transforming Results

LINQ to XML allows us to use all of the extension methods that come with LINQ, which gives us access to the Select method . This method can allow us to transform our results into a different class or anonymous class. We have a way to get to the books in our XML library, but we haven’t done anything with them yet. That is about to change. We are going to capture the information about the books and put them in a C# class called Book that is defined below.

class Book
{
    public string Author { get; set; }
    public string Title { get; set; }
}
List<Book> books = document.Descendants()
    .Where(x => x.Name == "book")
    .Select(x => new Book()
        { Title = x.Element("title").Value,
          Author = x.Element("author").Value
        }).ToList();

The above code filters the XML document down to just book elements and then transforms the title and author into a new Book class that we had defined. Notice that the Element method is used instead of Descendants because we know that the title and author elements are children of the book element, rather than grandchildren, so we don’t need to go any deeper.

Using XPath with XDocument

As mentioned before, it is possible to use XPath with XDocument, though not recommended. For instance, we could use XPath to get a list of all movies:

List<XElement> movies = document.XPathSelectElements("//movie").ToList();

One thing to note is that this does not return XmlNode like the XmlDocument does, but instead returns XElement. There is also the XPathSelectElement method , which is equivalent to the XmlDocument SelectSingleNode. Just like SelectSingleNode, if you use an XPath query that will return multiple results, the first element is only returned.

XElement movie = document.XPathSelectElement("//movie");

In the end, LINQ to XML allows for easy access to querying data from an XML document in a more consistent format as well.

Copyright information

© Jonathan Hartwell 2017

Authors and Affiliations

  • Jonathan Hartwell
    • 1
  1. 1.JolietUSA

Personalised recommendations