Introduction To Xml

2 minute read

describe XML data model
extract data from XML documents using XPath

XML stands for eXtensible Markup Language.

Announcements

Plan to go over next homework assignment for next class
STAT 128 in Fall 2021: virtual or hybrid?

123 GO: Would you rather do virtual or hybrid for a class like this in Fall?

References

XPath Syntax by W3 schools

Why should we learn how to extract data from XML?

Or why learn any other data format, for that matter?

    <Filer>
      <EIN>042888848</EIN>
      <BusinessName>
        <BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
      </BusinessName>
      <BusinessNameControlTxt>FREE</BusinessNameControlTxt>
      <PhoneNum>6175425942</PhoneNum>
      <USAddress>
        <AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
        <CityNm>BOSTON</CityNm>
        <StateAbbreviationCd>MA</StateAbbreviationCd>
        <ZIPCd>02110</ZIPCd>
      </USAddress>
    </Filer>

terms: document, opening and closing tags

XML represents data in a tree like, hierarchical format {.t}

Draw tree of document, nodes, element and text nodes

Many other data formats can represent hierarchical data

{
   "EIN": "042888848",
   "BusinessName": {
      "BusinessNameLine1Txt": "FREE SOFTWARE FOUNDATION INC"
   },
   "BusinessNameControlTxt": "FREE",
   "PhoneNum": "6175425942",
   "USAddress": {
      "AddressLine1Txt": "51 FRANKLIN STREET SUITE 500",
      "CityNm": "BOSTON",
      "StateAbbreviationCd": "MA",
      "ZIPCd": "02110"
   }
}

123 GO- JSON stands for Javascript Object Notation. Why?

If we didn’t know XML, how might we find and extract `BOSTON` from inside `CityNm`?

    <Filer>
      <EIN>042888848</EIN>
      <BusinessName>
        <BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
      </BusinessName>
      <BusinessNameControlTxt>FREE</BusinessNameControlTxt>
      <PhoneNum>6175425942</PhoneNum>
      <USAddress>
        <AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
--->    <CityNm>BOSTON</CityNm>
        <StateAbbreviationCd>MA</StateAbbreviationCd>
        <ZIPCd>02110</ZIPCd>
      </USAddress>
    </Filer>

We know command line tools…

If your data contains structure, then use it.

DON’T do this:

$ grep "CityNm" 201932259349302043_public.xml

        <CityNm>SOUTHBORO</CityNm>
        <CityNm>BOSTON</CityNm>
        <CityNm>BOSTON</CityNm>
          <CityNm>BOSTON</CityNm>

We might be tempted to ignore all the XML structure and just treat the data as text. This is fine for familiarizing yourself with the data, but BAD for just about anything else.

Why bad?

We actually wanted `CityNm` under the `Filer` node.

->  <Filer>
      <EIN>042888848</EIN>
      <BusinessName>
        <BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
      </BusinessName>
      <BusinessNameControlTxt>FREE</BusinessNameControlTxt>
      <PhoneNum>6175425942</PhoneNum>
      <USAddress>
        <AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
--->    <CityNm>BOSTON</CityNm>
        <StateAbbreviationCd>MA</StateAbbreviationCd>
        <ZIPCd>02110</ZIPCd>
      </USAddress>
    </Filer>

Devil’s advocate: just write a more complicated version of grep, right?

XPath identifies and navigates nodes in an XML document.

$ xmllint --xpath "//Filer//CityNm/text()" \
    201932259349302043_public.xml 

BOSTON

Let’s break this command down.

In data analysis, XPath expressions usually contain `//`.

With //:

$ xmllint --xpath "//Filer//CityNm/text()" \
    201932259349302043_public.xml 

BOSTON

Alternatively, with an absolute (full) path:

$ xmllint --xpath \
    "/Return/ReturnHeader/Filer/USAddress/CityNm/text()" \
    201932259349302043_public.xml 

BOSTON

Suppose we slightly change the structure of the document, say ReturnHeader becomes Header.

123 GO: Will the absolute path still find BOSTON?

`grep` is a general purpose tool and XPath is a specific tool {.t}

hammer and drill analogy

Programming languages have packages to process XML using XPath.

For example, with Julia:

using EzXML

doc = readxml("201932259349302043_public.xml")

city_names = findall("//CityNm", doc)

#  4-element Array{EzXML.Node,1}:
#   EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe8993d9c0>)
#   EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe899549c0>)
#   EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe899314c0>)
#   EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe89999bd0>)

123 GO - is it better to learn and use XPath, or something programming language specific?

An example of querying the document from julia.

nodecontent(city_names[1])

#  "SOUTHBORO"

[nodecontent(n) for n in city_names]

#  4-element Array{String,1}:
#   "SOUTHBORO"
#   "BOSTON"
#   "BOSTON"
#   "BOSTON"

Using our xpath above:

fc = findall("//Filer//CityNm/text()", doc)

#   1-element Array{EzXML.Node,1}:
#    EzXML.Node(<TEXT_NODE@0x00007ffe89954a40>)

nodecontent(fc[1])

#   "BOSTON"

Twitter Facebook LinkedIn

Introduction To Xml

Announcements

References

Why should we learn how to extract data from XML?

XML represents data in a tree like, hierarchical format {.t}

Many other data formats can represent hierarchical data

If we didn’t know XML, how might we find and extract `BOSTON` from inside `CityNm`?

If your data contains structure, then use it.

We actually wanted `CityNm` under the `Filer` node.

XPath identifies and navigates nodes in an XML document.

In data analysis, XPath expressions usually contain `//`.

`grep` is a general purpose tool and XPath is a specific tool {.t}

Programming languages have packages to process XML using XPath.

An example of querying the document from julia.

Using our xpath above:

You May Also Enjoy

Diversity Inclusivity Statement

General Student Advice

Homework Covid Database

Introduction Sql

Announcements

References

Why should we learn how to extract data from XML?

XML represents data in a tree like, hierarchical format {.t}

Many other data formats can represent hierarchical data

If we didn’t know XML, how might we find and extract BOSTON from inside CityNm?

If your data contains structure, then use it.

We actually wanted CityNm under the Filer node.

XPath identifies and navigates nodes in an XML document.

In data analysis, XPath expressions usually contain //.

grep is a general purpose tool and XPath is a specific tool {.t}

Programming languages have packages to process XML using XPath.

An example of querying the document from julia.

Using our xpath above:

You May Also Enjoy

Diversity Inclusivity Statement

General Student Advice

Homework Covid Database

Introduction Sql

If we didn’t know XML, how might we find and extract `BOSTON` from inside `CityNm`?

We actually wanted `CityNm` under the `Filer` node.

In data analysis, XPath expressions usually contain `//`.

`grep` is a general purpose tool and XPath is a specific tool {.t}