LampCMS development and php tips

Announcements and articles about this site development and stuff like that

More on feed parsing
Tue Nov 10 2009
Again, parsing rss feed is not always easy.

Recently I needed a 'bullet-proof' class that could parse just about any feed and return the consistent object with items each item must have guid, title, link, body and optionally 'author' and 'categories'

I could not find such class. The pear 's class was good but not good enough, I could parse rss2 but was having problems with atom feeds.

That's the main problem with parsing feeds - there are at least 3 major formats: atom, rss2 and rss1, each one has several minor versions. also rss1.1 is somewhat different from rss1 and not at all like rss2

On top of that some feed providers like feedburner (now owned by google) have some important rules about how you should access their feed, in particular you must send standard http request headers just like the browser does, or else, they will forward your requests to their slow backup servers where feeds may not be the most up-to-date.
This means that when you receive the feed you must parse not only the feed itself but also the http response headers, extract the values of 'Etag' and 'Last-Modified' and store the value somewhere so that next time you include these values in your request headers.

But that's not even the whole story.
Some feed providers will do a wicked thing: if the server determines that your request header 'If-Modified-Since' is older then their feed (meaning the feed has not changed), that server will send you an http 200 response code but the body of the reply will be empty!
This is totally not a standard way to do this - the server is supposed to send you a 304 reply header in such cases.

This brings us to another aspect of parsing a feed - you need to actually examine the http reply codes and look out for 304 code, meaning feed has not been changed since your last request, also look out for dreaded 'redirect' headers.

I say dreaded because there are many different reply codes that mean the 'redirect' and you should also keep track of number of redirects and have your script give up after about 3 or 4 redirects to prevent a very long an possibly infinite redirects loop.

Lastly, some servers are ignorant of 'If-Modified-Since' or 'Etag' headers and they will serve you the latest version of the feed all the time, even if the feed has not been modified. Actually ALL the major forum softwares like vbulletin, phpbb and many others don't bother parsing the 'If-Modified-Since' headers, which is totally wrong and parsing these headers could have saved alot of time and server resources.

But what do you do if server always sends a feed, even if it's the same feed? Sure you can still parse it and then compare individual 'guids' against the 'guid's that you have already parsed and storing somewhere in your database. But that would be a waste of time to even load the feed into DOMDocument if you know that the feed has not changed.

So how do you determine if the feed has changed? Simply use the crc32() function to get the crc32 value of the feed and store it in your database, then compare the crc32 of the downloaded feed against the most recent value in your database. If they are the same, then don't parse the file at all.

If they are not the same, then record the new value of crc32 in your database and parse the feed and remember that you still need to compare each item's guid against the ones that you already parsed.

On top of that there is a major pain with different encodings and html entities that may be included (incorrectly) inside the feeds. Some feeds may declare the xml encoding to be utf-8 while the server sends encoding as ASCII or even worse, a non utf-8 compatable encoding. This means your class must have method to detect the non utf-8 encoding and re-encode it the best it can.

In order to be able to set the correct http request headers I use the HttpRequest class from pecl_http. I extended it to make my own class that handles various http reply codes and throws custom exceptions that represent redirect or error 404 not found reply codes (this is also important part of parsing the feed. Sometimes a site may just not be available or just move, so you need to stop requesting the feeds from it)

In order to parse just about any feed format and transform it into an object that has a consistent data I used php's XSTLProcessor and custom xsl templates, one for each major feed type. The result is that the feed type is automatically detected, the correct xsl template is used and the feed is transformed into a standard DOMDocument object (extended version of it)

From that object I can actually get items, each item's body, title, links, etc. I even made this object implement Iterator so I can just use foreach($o as $item)

That's still not all. Next I want to make it so that each item's body is represented by a separate DOMDocument object, so that I can do custom parsing of body of item itself, like count number of links in the article, count number of images, maybe extract images, and most importantly I will be sure that when I dump that feed into a file, it will be a valid HTML fragment.

Dudes, parsing Feeds is not easy, trust me.

If you want me to release my classes as an open source RSS parser, I will do that, no problem.
All you have to do is actually ask for it, let me know that people need it.

Just post a comment here.

See you later.