|
5 years ago | |
---|---|---|
.. | ||
LICENSE.txt | 5 years ago | |
README.md | 5 years ago | |
slaxdom.lua | 5 years ago | |
slaxml.lua | 5 years ago |
SLAXML is a pure-Lua SAX-like streaming XML parser. It is more robust than
many (simpler) pattern-based parsers that exist (such as mine), properly
supporting code like <expr test="5 > 7" />
, CDATA nodes, comments, namespaces,
and processing instructions.
It is currently not a truly valid XML parser, however, as it allows certain XML that is syntactically-invalid (not well-formed) to be parsed without reporting an error.
<?foo bar?>
).<!-- hello world -->
).<![CDATA[ whoa <xml> & other content as text ]]>
).<foo xmlns="bar">
and <wrap xmlns:bar="bar"><bar:kittens/></wrap>
).< > & " '
) and numeric entities (e.g.
) in attributes and text nodes (but—properly—not in comments or CDATA). Properly handles edge cases like &amp;
.local SLAXML = require 'slaxml'
local myxml = io.open('my.xml'):read('*all')
-- Specify as many/few of these as you like
parser = SLAXML:parser{
startElement = function(name,nsURI,nsPrefix) end, -- When "<foo" or <x:foo is seen
attribute = function(name,value,nsURI,nsPrefix) end, -- attribute found on current element
closeElement = function(name,nsURI) end, -- When "</foo>" or </x:foo> or "/>" is seen
text = function(text,cdata) end, -- text and CDATA nodes (cdata is true for cdata nodes)
comment = function(content) end, -- comments
pi = function(target,content) end, -- processing instructions e.g. "<?yes mon?>"
}
-- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
-- (does not strip leading/trailing whitespace from CDATA)
parser:parse(myxml,{stripWhitespace=true})
If you just want to see if it will parse your document correctly, you can simply do:
local SLAXML = require 'slaxml'
SLAXML:parse(myxml)
…which will cause SLAXML to use its built-in callbacks that print the results as they are seen.
If you simply want to build tables from your XML, you can alternatively:
local SLAXML = require 'slaxdom' -- also requires slaxml.lua; be sure to copy both files
local doc = SLAXML:dom(myxml)
The returned table is a 'document' composed of tables for elements, attributes, text nodes, comments, and processing instructions. See the following documentation for what each supports.
SLAXML:dom()
method.
doc.type
: the string "document"
doc.name
: the string "#doc"
doc.kids
: an array table of child processing instructions, the root element, and comment nodes.doc.root
: the root element for the documentsomeEl.type
: the string "element"
someEl.name
: the string name of the element (without any namespace prefix)someEl.nsURI
: the namespace URI for this element; nil
if no namespace is appliedsomeAttr.nsPrefix
: the namespace prefix string; nil
if no prefix is appliedsomeEl.attr
: a table of attributes, indexed by name and index
local value = someEl.attr['attribute-name']
: any namespace prefix of the attribute is not part of the namelocal someAttr = someEl.attr[1]
: a single attribute table (see below); useful for iterating all attributes of an element, or for disambiguating attributes with the same name in different namespacessomeEl.kids
: an array table of child elements, text nodes, comment nodes, and processing instructionssomeEl.el
: an array table of child elements onlysomeEl.parent
: reference to the parent element or document tablesomeAttr.type
: the string "attribute"
someAttr.name
: the name of the attribute (without any namespace prefix)someAttr.value
: the string value of the attribute (with XML and numeric entities unescaped)someAttr.nsURI
: the namespace URI for the attribute; nil
if no namespace is appliedsomeAttr.nsPrefix
: the namespace prefix string; nil
if no prefix is appliedsomeAttr.parent
: reference to the owning element tablesomeText.type
: the string "text"
someText.name
: the string "#text"
someText.cdata
: true
if the text was from a CDATA blocksomeText.value
: the string content of the text node (with XML and numeric entities unescaped for non-CDATA elements)someText.parent
: reference to the parent element tablesomeComment.type
: the string "comment"
someComment.name
: the string "#comment"
someComment.value
: the string content of the attributesomeComment.parent
: reference to the parent element or document tablesomePI.type
: the string "pi"
somePI.name
: the string name of the PI, e.g. <?foo …?>
has a name of "foo"
somePI.value
: the string content of the PI, i.e. everything but the namesomePI.parent
: reference to the parent element or document tableThe following function can be used to calculate the "inner text" for an element:
function elementText(el)
local pieces = {}
for _,n in ipairs(el.kids) do
if n.type=='element' then pieces[#pieces+1] = elementText(n)
elseif n.type=='text' then pieces[#pieces+1] = n.value
end
end
return table.concat(pieces)
end
local xml = [[<p>Hello <em>you crazy <b>World</b></em>!</p>]]
local para = SLAXML:dom(xml).root
print(elementText(para)) --> "Hello you crazy World!"
If you want the DOM tables to be easier to inspect you can supply the simple
option via:
local dom = SLAXML:dom(myXML,{ simple=true })
In this case the document will have no root
property, no table will have a parent
property, elements will not have the el
collection, and the attr
collection will be a simple array (without values accessible directly via attribute name). In short, the output will be a strict hierarchy with no internal references to other tables, and all data represented in exactly one spot.
You can serialize any DOM table to an XML string by passing it to the SLAXML:xml()
method:
local SLAXML = require 'slaxdom'
local doc = SLAXML:dom(myxml)
-- ...modify the document...
local xml = SLAXML:xml(doc)
The xml()
method takes an optional table of options as its second argument:
local xml = SLAXML:xml(doc,{
indent = 2, -- each pi/comment/element/text node on its own line, indented by this many spaces
indent = '\t', -- ...or, supply a custom string to use for indentation
sort = true, -- sort attributes by name, with no-namespace attributes coming first
omit = {...} -- an array of namespace URIs; removes elements and attributes in these namespaces
})
When using the indent
option, you likely want to ensure that you parsed your DOM using the stripWhitespace
option. This will prevent you from having whitespace text nodes between elements that are then placed on their own indented line.
Some examples showing the serialization options:
local xml = [[
<!-- a simple document showing sorting and namespace culling -->
<r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3">
<e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta" />
<a:wrap><f/></a:wrap>
</r>
]]
local dom = SLAXML:dom(xml, {stripWhitespace=true})
print(SLAXML:xml(dom))
--> <!-- a simple document showing sorting and namespace culling --><r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3"><e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta"/><a:wrap><f/></a:wrap></r>
print(SLAXML:xml(dom, {indent=2}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3">
--> <e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta"/>
--> <a:wrap>
--> <f/>
--> </a:wrap>
--> </r>
print(SLAXML:xml(dom.root.kids[2]))
--> <a:wrap><f/></a:wrap>
-- NOTE: you can serialize any DOM table node, not just documents
print(SLAXML:xml(dom.root.kids[1], {indent=2, sort=true}))
--> <e alpha="y" beta="beta" a:bar="b" a:foo="f" x:alpha="a"/>
-- NOTE: attributes with no namespace come first
print(SLAXML:xml(dom, {indent=2, omit={'uri3'}}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2">
--> <e x:alpha="a" alpha="y" beta="beta"/>
--> </r>
-- NOTE: Omitting a namespace omits:
-- * namespace declaration(s) for that space
-- * attributes prefixed for that namespace
-- * elements in that namespace, INCLUDING DESCENDANTS
print(SLAXML:xml(dom, {indent=2, omit={'uri3', 'uri2'}}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1">
--> <e alpha="y" beta="beta"/>
--> </r>
print(SLAXML:xml(dom, {indent=2, omit={'uri1'}}))
--> <!-- a simple document showing sorting and namespace culling -->
-- NOTE: Omitting namespace for the root element removes everything
Serialization of elements and attributes ignores the nsURI
property in favor of the nsPrefix
attribute. As such, you can construct DOM's that serialize to invalid XML:
local el = {
type="element",
nsPrefix="oops", name="root",
attr={
{type="attribute", name="xmlns:nope", value="myuri"},
{type="attribute", nsPrefix="x", name="wow", value="myuri"}
}
}
print( SLAXML:xml(el) )
--> <oops:root xmlns:nope="myuri" x:wow="myuri"/>
So, if you want to use a foo
prefix on an element or attribute, be sure to add an appropriate xmlns:foo
attribute defining that namespace on an ancestor element.
foo="yes & no"
is seen as a valid attribute<foo></bar>
invokes startElement("foo")
followed by closeElement("bar")
<foo> 5 < 6 </foo>
is seen as valid text contents< > " ' &
) and numeric entities
(e.g.
or <
)<?xml version="1.x"?>
) are incorrectly reported
as Processing Instructionsxml
prefix is never redefined to an illegal namespacexmlns
prefix is never used as an element prefixSLAXML:xml()
to serialize the DOM back to XML.nsPrefix
properties to the DOM tables for elements and attributes (needed for round-trip serialization)doc.root
key from DOM when simple=true
is specified.<
). (Thanks Leorex/Ben Bishop)xml
prefix may be used without pre-declaring it. (Thanks David Durkee.)<foo xmlns="bar">
now directly generates startElement("foo","bar")
with no post callback for namespace
required.local SLAXML=require 'slaxml'
pattern to prevent any pollution
of the global namespace.foo=""
closeElement
no longer includes namespace prefix in the name, includes the nsURI.parent
referencesSLAXML.ignoreWhitespace
is now :parse(xml,{stripWhitespace=true})
<foo xmlns="barURI">
will call startElement("foo",nil)
followed by
namespace("barURI")
(and then attribute("xmlns","barURI",nil)
);
you must apply the namespace to your element after creation.startElement("child","barURI")
<xy:foo>
will call startElement("foo","uri-for-xy")
<foo xy:bar="yay">
will call attribute("bar","yay","uri-for-xy")
"
-> "
Copyright © 2013 Gavin Kistner
Licensed under the MIT License. See LICENSE.txt for more details.