Google May Be Crawling AJAX Now – How To Best Take Advantage Of It

on February 01, 2011 1 comments

In October 2009, Google proposed a new standard for implementing AJAX on web sites that would help search engines extract the content. Now, there’s evidence this proposal is either live or is about to be. Read on for more details on the proposal, how it works, and why it might be past the proposal stage.

The Trouble With AJAX
Historically, search engines have had trouble accessing AJAX-based content and this proposal would enable Google (and presumably other search engines that adopted the standard) to index more of the web. The standard SEO advice for AJAX implementations has traditionally been to follow accessibility best practices. If you build the site with progressive enhancement or graceful degradation techniques so that screen readers can render the content, chances are that search engines can access the content as well. Last May, I outlined some of the crawlability issues with AJAX and options for search-friendly implementations.
Google has offered advice on AJAX as well, including:
One of the primary search engine problems with AJAX is that it generates URLs that contain a hash mark (#). Since hash marks are also used for named anchors within a page, search engines typically ignore everything in a URL beginning with one (called the URL fragment). So, for instance, Google would see the following two URLs as identical:
  • http://www.buffy.com/seasons.php
  • http://www.buffy.com/seasons.php#best=2
Google’s AJAX Proposal
With Google’s proposal, an AJAX-generated URL that contains a hash mark (#) would also be replaced with a URL that uses #! in place of #. So, the second URL above would become http://www.buffy.com.seasons.php#!best=2. When Googlebot encounters the exclamation point after the hash mark, it would then request the URL from the server using a syntax that would replace the #! with ?_escaped_fragment_=.
Still with me? All this means is that when Googlebot encounters:
http://www.buffy.com/seasons.php#!best=2
it will request the following URL from the server:
http://www.buffy.com/seasons.php?_escaped_fragment_=best=2
Why, you ask? Well because ?_escaped_fragment_= in the URL tells the server to route the URL request to the headless browser to execute the AJAX code and render a static page.But, you might protest, I don’t want my URLs in the search results to look like that! Not to worry,  Google requests the URL using that syntax, but then translates the ?_escaped_fragment_= back into #! when displaying it to searchers.

How Do I Implement This?
This implementation basically requires that you:
  • Modify your AJAX implementation so that URLs that contain hash marks (#) are also available via the hash mark/exclamation point (#!) combination (or, as I recommend below, that you replace the # versions entirely with the #! ones).
  • Configure a headless browser on your web server that processes the ?_escaped_fragment_= versions of the URLs, executes the JavaScript on the page and returns a static page.
Oh, you still have questions? I have answers! Well, and some questions of my own.

What about all those links?
Is Google going to consolidate all links to the # version of the URL and attribute them to the #! version? It appears that the answer is no. urrently, all links to URLs that contain a hash mark are attributed to the URL before the fragment, and that will continue to be the case.  And the canonical tag won’t work in this case, since Google doesn’t process the # version of the URL. So returning to our earlier example, all links to http://www.buffy.com/seasons.php#best=2 are attributed to http://www.buffy.com/seasons.php.

Wait, do we need to start using #! instead of #?
You likely don’t want to implement this in such a way that the # and #! URLs co-exist. Instead, you’ll want to replace # URLs with #! URLs. You can’t redirect search engine bots, of course (same reason bots can’t crawl and index the AJAX URLs as is).  This means that as noted above, the pages won’t get credit for past links to the # version of the URLs. You should ensure that the #! version of the URLs is what displays in a visitor’s browser though, so that any new links are to the (now indexable) #! versions. What about visitors coming from existing links to the # versions of the URLs? You’ll want to add code that transforms the # version of the URLs to the #! version (see below for more on that).

How do I create #! URLs in place of # URLs?
That’s pretty straightforward. Just (I know, there is not “just”) modify the AJAX code that creates URLs to output #! URLs instead of # URLs. As noted above, for any existing AJAX pages that use #, you’ll want to redirect visitors to the new URLs that use #!. This won’t cause Google to transfer links from the # versions to the #! versions but it will ensure that visitors will see only the #! version and therefore, any new links will be to that version, which will causes Google to start accruing PageRank for those pages. Obviously, you’ll want to get any new links to the versions of the pages that Google will index so those pages have a better chance at ranking well.

Below are the few suggestions for redirecting visitors from the # versions to the #! versions of the URLs.
  • JavaScript – You can use document.location, such as: <script type=”text/javascript”> document.location=”http://www.buffy.com/seasons.php#!best=2″; </script>
  • PHP – You can write a short PHP script, such as: <?php header(“HTTP/1.1 301 Moved Permanently”); header(“Location: http://www.buffy.com/seasons.php#!best=2″); ?>
  • .htaccess – For Apache servers, you can use the NE flag in a rewrite rule, as shown below (although this really only works if you’re moving to the #! structure from a non-# URL):RewriteCond %{QUERY_STRING} ^best=(.*)$ RewriteRule ^seasons.php$ /seasons.php#!%1? [R=301,NE]
  • Meta refresh – generally, a meta refresh isn’t recommended for redirects as search engines do a better job of following 301s, but in this case, you’re only redirecting visitors. You can add code similar to the following to the <Head> section of the original page: <meta http-equiv=refresh content=”0; http://www.buffy.com/seasons.php#!best=2″>
Please click here to get more information on this.