Stop ACTA

Creating robots.txt for CakePHP apps

Posted in CakePHP on 09.07.2008.

Sometimes you just don't want to index/crawl parts of your site. Learn how to automate this with CakePHP and robots.txt

A while ago I got a lot of error when trying to diagnose my site with Google webmasters tools. One of the core problems was the fact that Googlebot was trying to index AJAX links, which just can't be good.

We start simply by adding a route for our robots.txt

// ~/app/config/routes.php
Router::connect(
    '/robots.txt',
    array(
        'controller' => 'seo',
        'action' => 'robots'
    )
);

After that, we need to type some code in our SeoController:

// ~/app/controllers/seo_controller.php
class SeoController extends AppController
{
    var $uses = array();
    var $components = array('RequestHandler');

    function robots()
    {
        if (Configure::read('debug'))
        {
            Configure::write('debug', 0);
        }

        $urls = array();

        // ...snip...
        // fill the $urls array with those you don't
        // want to be indexed/crawled
        // for example
        $urls[] = '/articles/view/some-article/commentpaging';

        $this->set(compact('urls'));
        $this->RequestHandler->respondAs('text');
        $this->viewPath .= '/text';
        $this->layout = 'ajax';
    }
}

Of course, the URLs should not be hardcoded like this.

Important bits:

  • response type is text
  • layout is ajax (simply because it's empty)

After that, in your view:

// ~/app/views/seo/text/robots.ctp
User-agent: *
<?php
foreach ($urls as $url)
{
    echo 'Disallow: '.$url."\n";
}
?>

And that's it! Of course, this can be easily extended to disable access only to certain robots, but that's normally not a very good idea (except for those bastard bots which don't even bother checking robots.txt).

To see this in action, simply see my robots.txt:

http://lecterror.com/robots.txt

The URLs for it are generated like this:

$articles = $this->Article->getSitemapInformation();
$urls = array();

foreach ($articles as $article)
{
    $urls[] = Router::url(
            array
            (
                'controller' => 'articles',
                'action' => 'view'
            )
        ).'/'.$article['Article']['slug'].'/page';
}

Easy, right?

Happy baking!

Article comments — View · Add


Page 1 of 2
1 · 2

Daniel Hofstetter :: 10.07.2008 01:10:46
If the robots.txt is static you could also put the file directly to app/webroot.
lecterror :: 10.07.2008 02:31:18
Yes, if the list of URLs is going to be static, physical file is definitely a better solution. Otherwise it would be a real pain to maintain...
Langdon :: 06.09.2008 02:31:27
Nice tutorial thanks. You just saved me the time of figuring all of this out for myself.
lecterror :: 08.09.2008 06:35:02
Thanks, glad I could help!
Paul Edenburg :: 08.03.2010 09:48:41
Important to add these two properties in your controller to make it work:

var $uses = array();
var $components = array('RequestHandler');

So this way you won't get errors like:
- Error: Database table seos for model Seo was not found.
- Undefined property: SeoController::$RequestHandler

Gr, Paul Edenburg