Again, parsing rss feed is not always easy.
Recently I needed a 'bullet-proof' class that could parse just about
any feed and return the consistent object with items each item must have
guid, title, link, body and optionally 'author' and
'categories'
I could not find such class. The
pear 's class was
good but not good enough, I could parse rss2 but was having problems with
atom
feeds.
That's the main problem with parsing feeds - there are at least 3 major
formats: atom, rss2 and rss1, each one has several minor versions. also
rss1.1 is somewhat different from rss1 and not at
all like rss2
On top of that some feed providers like feedburner (now owned by
google) have some important rules about how you should access their feed,
in particular you must send standard http request
headers just like the browser does, or else, they will forward your
requests to their slow backup servers where feeds may not be the most
up-to-date.
This means that when you receive the feed you must parse not only the
feed itself but also the http response headers, extract the values of
'Etag' and 'Last-Modified' and store the value
somewhere so that next time you include these values in your request
headers.
But that's not even the whole story.
Some feed providers will do a wicked thing: if the server determines
that your request header 'If-Modified-Since' is older then their feed
(meaning the feed has not changed), that server will
send you an http 200 response code but the body of the reply will be
empty!
This is totally not a standard way to do this - the server is supposed
to send you a 304 reply header in such cases.
This brings us to another aspect of parsing a feed - you need to
actually examine the http reply codes and look out for 304 code, meaning
feed has not been changed since your last request, also
look out for dreaded 'redirect' headers.
I say dreaded because there are many different reply codes that mean
the 'redirect' and you should also keep track of number of redirects and
have your script give up after about 3 or 4 redirects
to prevent a very long an possibly infinite redirects loop.
Lastly, some servers are ignorant of 'If-Modified-Since' or 'Etag'
headers and they will serve you the latest version of the feed all the
time, even if the feed has not been modified. Actually
ALL the major forum softwares like vbulletin, phpbb and many others
don't bother parsing the 'If-Modified-Since' headers, which is totally
wrong and parsing these headers could have saved alot of
time and server resources.
But what do you do if server always sends a feed, even if it's the same
feed? Sure you can still parse it and then compare individual 'guids'
against the 'guid's that you have already parsed and
storing somewhere in your database. But that would be a waste of time
to even load the feed into DOMDocument if you know that the feed has not
changed.
So how do you determine if the feed has changed? Simply use the crc32()
function to get the crc32 value of the feed and store it in your database,
then compare the crc32 of the downloaded feed
against the most recent value in your database. If they are the same,
then don't parse the file at all.
If they are not the same, then record the new value of crc32 in your
database and parse the feed and remember that you still need to compare
each item's guid against the ones that you already
parsed.
On top of that there is a major pain with different encodings and html
entities that may be included (incorrectly) inside the feeds. Some feeds
may declare the xml encoding to be utf-8 while the
server sends encoding as ASCII or even worse, a non utf-8 compatable
encoding. This means your class must have method to detect the non utf-8
encoding and re-encode it the best it can.
In order to be able to set the correct http request headers I use the
HttpRequest class from pecl_http. I extended it to make my own class that
handles various http reply codes and throws custom
exceptions that represent redirect or error 404 not found reply codes
(this is also important part of parsing the feed. Sometimes a site may just
not be available or just move, so you need to
stop requesting the feeds from it)
In order to parse just about any feed format and transform it into an
object that has a consistent data I used php's XSTLProcessor and custom xsl
templates, one for each major feed type. The
result is that the feed type is automatically detected, the correct xsl
template is used and the feed is transformed into a standard DOMDocument
object (extended version of it)
From that object I can actually get items, each item's body, title,
links, etc. I even made this object implement Iterator so I can just use
foreach($o as $item)
That's still not all. Next I want to make it so that each item's body
is represented by a separate DOMDocument object, so that I can do custom
parsing of body of item itself, like count number of
links in the article, count number of images, maybe extract images, and
most importantly I will be sure that when I dump that feed into a file, it
will be a valid HTML fragment.
Dudes, parsing Feeds is not easy, trust me.
If you want me to release my classes as an open source RSS parser, I
will do that, no problem.
All you have to do is actually ask for it, let me know that people need
it.
Just post a comment here.
See you later.