Introduction To Xml
- describe XML data model
- extract data from XML documents using XPath
XML stands for eXtensible Markup Language.
Announcements
- Plan to go over next homework assignment for next class
- STAT 128 in Fall 2021: virtual or hybrid?
123 GO: Would you rather do virtual or hybrid for a class like this in Fall?
References
- XPath Syntax by W3 schools
Why should we learn how to extract data from XML?
Or why learn any other data format, for that matter?
<Filer>
<EIN>042888848</EIN>
<BusinessName>
<BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
</BusinessName>
<BusinessNameControlTxt>FREE</BusinessNameControlTxt>
<PhoneNum>6175425942</PhoneNum>
<USAddress>
<AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
<CityNm>BOSTON</CityNm>
<StateAbbreviationCd>MA</StateAbbreviationCd>
<ZIPCd>02110</ZIPCd>
</USAddress>
</Filer>
terms: document, opening and closing tags
XML represents data in a tree like, hierarchical format {.t}
Draw tree of document, nodes, element and text nodes
Many other data formats can represent hierarchical data
{
"EIN": "042888848",
"BusinessName": {
"BusinessNameLine1Txt": "FREE SOFTWARE FOUNDATION INC"
},
"BusinessNameControlTxt": "FREE",
"PhoneNum": "6175425942",
"USAddress": {
"AddressLine1Txt": "51 FRANKLIN STREET SUITE 500",
"CityNm": "BOSTON",
"StateAbbreviationCd": "MA",
"ZIPCd": "02110"
}
}
123 GO- JSON stands for Javascript Object Notation. Why?
If we didn’t know XML, how might we find and extract BOSTON
from inside CityNm
?
<Filer>
<EIN>042888848</EIN>
<BusinessName>
<BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
</BusinessName>
<BusinessNameControlTxt>FREE</BusinessNameControlTxt>
<PhoneNum>6175425942</PhoneNum>
<USAddress>
<AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
---> <CityNm>BOSTON</CityNm>
<StateAbbreviationCd>MA</StateAbbreviationCd>
<ZIPCd>02110</ZIPCd>
</USAddress>
</Filer>
We know command line tools…
If your data contains structure, then use it.
DON’T do this:
$ grep "CityNm" 201932259349302043_public.xml
<CityNm>SOUTHBORO</CityNm>
<CityNm>BOSTON</CityNm>
<CityNm>BOSTON</CityNm>
<CityNm>BOSTON</CityNm>
We might be tempted to ignore all the XML structure and just treat the data as text. This is fine for familiarizing yourself with the data, but BAD for just about anything else.
Why bad?
We actually wanted CityNm
under the Filer
node.
-> <Filer>
<EIN>042888848</EIN>
<BusinessName>
<BusinessNameLine1Txt>FREE SOFTWARE FOUNDATION INC</BusinessNameLine1Txt>
</BusinessName>
<BusinessNameControlTxt>FREE</BusinessNameControlTxt>
<PhoneNum>6175425942</PhoneNum>
<USAddress>
<AddressLine1Txt>51 FRANKLIN STREET SUITE 500</AddressLine1Txt>
---> <CityNm>BOSTON</CityNm>
<StateAbbreviationCd>MA</StateAbbreviationCd>
<ZIPCd>02110</ZIPCd>
</USAddress>
</Filer>
Devil’s advocate: just write a more complicated version of grep
, right?
XPath identifies and navigates nodes in an XML document.
$ xmllint --xpath "//Filer//CityNm/text()" \
201932259349302043_public.xml
BOSTON
Let’s break this command down.
In data analysis, XPath expressions usually contain //
.
With //
:
$ xmllint --xpath "//Filer//CityNm/text()" \
201932259349302043_public.xml
BOSTON
Alternatively, with an absolute (full) path:
$ xmllint --xpath \
"/Return/ReturnHeader/Filer/USAddress/CityNm/text()" \
201932259349302043_public.xml
BOSTON
Suppose we slightly change the structure of the document, say ReturnHeader
becomes Header
.
123 GO: Will the absolute path still find BOSTON
?
grep
is a general purpose tool and XPath is a specific tool {.t}
hammer and drill analogy
Programming languages have packages to process XML using XPath.
For example, with Julia:
using EzXML
doc = readxml("201932259349302043_public.xml")
city_names = findall("//CityNm", doc)
# 4-element Array{EzXML.Node,1}:
# EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe8993d9c0>)
# EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe899549c0>)
# EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe899314c0>)
# EzXML.Node(<ELEMENT_NODE[CityNm]@0x00007ffe89999bd0>)
123 GO - is it better to learn and use XPath, or something programming language specific?
An example of querying the document from julia.
nodecontent(city_names[1])
# "SOUTHBORO"
[nodecontent(n) for n in city_names]
# 4-element Array{String,1}:
# "SOUTHBORO"
# "BOSTON"
# "BOSTON"
# "BOSTON"
Using our xpath above:
fc = findall("//Filer//CityNm/text()", doc)
# 1-element Array{EzXML.Node,1}:
# EzXML.Node(<TEXT_NODE@0x00007ffe89954a40>)
nodecontent(fc[1])
# "BOSTON"