problems with xml:base in feed
Tue Nov 3 2009
Another pain in parsing the feed is the xml:base thingy!
It's allowed in Atom feed, especially in Atom 1.0, and in fact many
websites make use of this xml:base attribute.
The problem is that xslt processor 1.0 does not support the base-uri()
xsl function, so if you using XSL to parse the XML feed then you need
to look for xml:base in the root of the feed:
<feed> element,
then in the <entry> element and then in the <content>
element
This is because the xml:base may appear anywhere in the
hierarchy.
Also it may even appear in more than one tag, so you must be sure you
using the one that is closest to the content tag.
The RSS feed does not require parsers to support xml:base and relative
paths, but it's recommended that parsers support it anyway.
http://cyber.law.harvard.edu/rss/relativeURI.html
It's quite easy to extract the value of xml:base of a tag when parsing
the feed directly with the DOMDOcument class: the DOMNode has the property
baseURI: $oDom->baseURI, but when parsing with
XSLT processor, it may become quite tricky.
Also, the rss 2.0 suggests that if xml:base is not defined anywhere in
the feed, then the value of <channel><link> to be used.
This makes things even more complicated since now the $oDom->baseURI
will not work because it only looks for xml:base and has no idea about this
weird way to extract the baseURI from rss
<link> tag of a <channel> element.
Also the baseURI is only needed if the image or link tags in the feed
item are relative. This means that now you have to also parse each item,
look for <img> and <a> tags, then
extract the 'src' attribute value and find our if it starts with
http:// or not.
The problem here is that the content of the feed item (actual html of
the item) is not parsed by the DOM since its often enclosed in CDATA
section.
So now you need to extract the html from each item's content, then load
it into a new DOMDocument object (which may not be easy and may require to
wrap the content in yet another <div>
tag), then once the content is loaded into DOMDocument you can parse
it, look for all img and <a> tags, then find src or href attribute or
each one and possibly prepend it with the baseURI
that has been extracted earlier.
This is quite complicated already, but to make it more complicated, the
value of xml:base usually ends with a forward slash, like this:
http://somesite.com/assets/
and then the relative paths in feed items usually start with the
forward slash like this: /image1.gif
So now you have to make sure that you when prepending the baseURI to
relative path you don't have double forward slashes, but also have to make
sure you have at least one.
Hopefully rss specification will NEVER start requiring to support
xml:base feature of XML and more importantly users and developers that
responsible for generating rss feed don't use this
feature.
It just doesn't make sense to use it.