LampCMS development and php tips

Announcements and articles about this site development and stuff like that

problems with xml:base in feed
Tue Nov 3 2009
Another pain in parsing the feed is the xml:base thingy!

It's allowed in Atom feed, especially in Atom 1.0, and in fact many websites make use of this xml:base attribute.

The problem is that xslt processor 1.0 does not support the base-uri() xsl function, so if you using XSL to parse the XML feed  then you need to look for xml:base in the root of the feed: <feed> element,
then in the <entry> element and then in the <content> element

This is because the xml:base may appear anywhere in the hierarchy.

Also it may even appear in more than one tag, so you must be sure you using the one that is closest to the content tag.

The RSS feed does not require parsers to support xml:base and relative paths, but it's recommended that parsers support it anyway.

http://cyber.law.harvard.edu/rss/relativeURI.html

It's quite easy to extract the value of xml:base of a tag when parsing the feed directly with the DOMDOcument class: the DOMNode has the property baseURI: $oDom->baseURI, but when parsing with XSLT processor, it may become quite tricky.

Also, the rss 2.0 suggests that if xml:base is not defined anywhere in the feed, then the value of <channel><link> to be used.

This makes things even more complicated since now the $oDom->baseURI will not work because it only looks for xml:base and has no idea about this weird way to extract the baseURI from rss <link> tag of a <channel> element.

Also the baseURI is only needed if the image or link tags in the feed item are relative. This means that now you have to also parse each item, look for <img> and <a> tags, then extract the 'src' attribute value and find our if it starts with http:// or not.

The problem here is that the content of the feed item (actual html of the item) is not parsed by the DOM since its often enclosed in CDATA section.

So now you need to extract the html from each item's content, then load it into a new DOMDocument object (which may not be easy and may require to wrap the content in yet another <div> tag), then once the content is loaded into DOMDocument you can parse it, look for all img and <a> tags, then find src or href attribute or each one and possibly prepend it with the baseURI that has been extracted earlier.

This is quite complicated already, but to make it more complicated, the value of xml:base usually ends with a forward slash, like this: http://somesite.com/assets/
and then the relative paths in feed items usually start with the forward slash like this: /image1.gif

So now you have to make sure that you when prepending the baseURI to relative path you don't have double forward slashes, but also have to make sure you have at least one.

Hopefully rss specification will NEVER start requiring to support xml:base feature of XML and more importantly users and developers that responsible for generating rss feed don't use this feature.

It just doesn't make sense to use it.