I have been toying with the idea of a specialized regular expression syntax for XML. Often, the regular expression questions that people email to me indicate that they are using them on XML and HTML. There is a very nice discussion of part of the problem here on Joe Gregorio’s blog. On that blog, he considers solving the problem by trying to limit XML, but I think the problem may be that the Regex needs to be adapted.
Ideally, I’d like to write a pattern as simple as <a>(.*)</a>
and match it against an XML document, retrieving the text inside an “a” tag.
How should this work?
- CDATA sections: If we have a Regex parser that works on a stream, the stream reader can disentangle this bit for us
- Comments: We don’t want to match on a tag if it is in a comment. The pattern
<a>.*</a>
should know how to identify and ignore comments. Of course, we might want to match something inside a comment. If that’s the case, then our pattern should explicitly say so. It should look like this:<!--.*<a>.*</a>.-->
. - Matching nested blocks:
<root><a>bar</a><a><b><a>foo<h;/a></b></a></root>.
Obviously neither<a>.*</a>
nor<a>.*?</a>
is quite the right thing if we expand.*
according to standard regex rules. Our xml regex could be taught to understand what tags are, and how they nest. - If I write a regex to match
<a>.*</a>
I want it to match<a x='smile'>foo</a>
. If we want to require that a certain argument be present we could specify as follows:<a x='.*'>.*</a>
. If we want an element to not be present, we could write this:<a x != '.*'>.*</a>
. If we want the tag to have no elements, we could write this<a '.*'!='.*'>.*</a>
- I should not need to specify the quote character. Matches should work regardless of whether single, double, or no quotes (HTML) are used. Escaped quotes could be used to identify the quote character if I do care:
<a x=\"foo\">.*</a>
- Extra white space should not matter:
<a>
should be the same as<a >
. - Flags could be provided for ignoring prefixes, allowing
<a>.*</a>
to match on<root:a>narf</root:a>
- When searching text, it should be possible to ignore <b>, <i>, <em>, <font> tags that are mixed into the text. I want
foo
to match against<b>f</b>oo
What else would a tool like this need?
Is it necessary? Are existing tools like XQuery and Beautiful Soup sufficient?