OddThinking

A blog for odd things and odd thoughts.

Robots.txt Boost: Don’t record these URLs

So, I have a web-site with some automatically-generated pages. It includes very mildly personal information; for the sake of argument, let’s say it is their shoe-size.

This information is intended only for a select group of shoe-enthusiasts, doing research. I don’t want this web-site to appear in the search engines. People randomly searching for someone shouldn’t find their shoe-size.

The first option would be to password-protect it (or otherwise authenticate each user). However, that is hugely onerous on the legitimate users of the site, for something as mild as shoe-size.

So, I went for the honour system provided by robots.txt. “Please, please,” I asked the search engines, “Don’t go looking in my shoe-size directory.” While it is not guaranteed, the convention robots.txt is widely followed by bots.

Google honoured the letter of my request, but not the spirit. (I only pick Google out because I use it, not because it is necessarily the only offender.) It doesn’t search the page, but when someone links to it, Google will still return it. (e.g. <a href=wherever.html>Kevin Rudd's Shoe-Size</a> will be found when searching for Kevin Rudd’s shoe-size, even though the URL itself is never visited by Google’s bots.)

How do you get around this? Google explain that there is another convention – the noindex meta tag.

I need to put in every page a note to the robots to say “Hey, don’t include this very page in your index at all.” But, the only way that they will see that note is if I let the search engines read the file – I need to remove the robots.txt restriction.

Now, I am in a dilemma for two reasons.

The first is that I have to let Google crawl all around my database, which may have hundreds of thousands of records and reports of shoe-sizes. That is going to cost me in bandwidth and CPU, just to say “This page shouldn’t appear in your list”.

The second is that there are sure to be robots out there that support the robots.txt convention but not the newer noindex meta tag convention. I don’t know who they are. I can’t exclude them in the robots.txt, while allowing the bots through that comply with the new convention.

We need a updated standard for robots.txt that says “Not only shouldn’t you look in this directory, you shouldn’t even admit that such URLs exist in your database.”

8 CommentsCategories: Insufficiently Advanced Technology,S/W Dev
Tags: Google, HTML, robots.txt, search engines, web

Comments

  1. robots meta tag with its “noindex, nofollow” content has been around since at least 98: I was using it then fairly successfully with wget, the GNU web spider, which by now is probably the template of most link harvesters. It seems this tag was defined as part of the W3C recommendation for HTML 4.0, and that was in 1997. The text of the spec also has this interesting statement (even in the 4.01 update):

    Note. In early 1997 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.

    It’s been more than ten years, so I think you can stop worrying about whether it’s supported. In fact, I think any spider that doesn’t respect this tag isn’t going to respect your robots.txt either.

  2. Support OpenID?

  3. There’s this sort-of club near where I live where they talk about shoe sizes. They meet up in this abandoned building with broken windows and stuff. There’s no “club leader” as such, and no “club registry”. People just politely assume that they belong to the club when they wander in.

    Unfortunately, sometimes people tell other people about the club, or sometimes people see other people wander into the building and get curious. Worst case it sometimes gets into shoe mags and sometimes a bunch of randoms show up and it’s really awkward… The people don’t want to make it a proper club, because that makes it too much effort, and they want it to be casual.

    They instituted a rule:

    The first rule of shoe club is that nobody talks about shoe club.

    The second rule: If this is your first night, you have to… shoe…

    True story.

  4. Aristotle,

    I may be missing something, but supporting OpenID won’t solve the problem, for two reasons.

    It will mean that (depending how I configure it) users won’t need yet another username and password for my site, but they would still need to authenticate somehow.

    Similarly, there is still an initial registration. As a regular web-surfer, I still do not yet have an OpenID account (to my knowledge? perhaps some of the web accounts I have are OpenID-ready?). I don’t expect the occasional visitor will have one. I hope both of those facts change in the next five years

    More importantly, while Googlebot won’t be able to read the contents of the page, it will still serve links to it. Kevin Rudd seekers will find a link to his shoe-size.

  5. Richard,

    You have assuaged my fear that common search bots might not support the NoIndex meta tag.

    It still leaves the problem that each bot will need to visit each of the 10,000-or-so generated pages just to find out it shouldn’t index the page.

    Perhaps I am being too miserly with CPU and bandwidth? 10,000 page hits per bot spread over time is possibly not worth worrying over. The pages aren’t large (assuming you don’t download the images).

    Oh dear! Images! Suppose you don’t link to Kevin Rudd’s generated database page, but instead link straight to the image of Kevin Rudd’s footprint. While I can block the images with a judicious robots.txt file, I can’t include a NoIndex clause in a JPEG file.

  6. You may not have an account for it, but you already have an OpenID… or four. 😉

  7. The footprint image will only be shown when someone embeds that image in their site, not when they link to it. So if your site has noindex and nobody embeds the image, it will not be indexed. And if they do embed it, you can’t really stop indexing from -their- site, can you?

  8. Configurator,

    If they link directly to the image (e.g. <a href="mysite.com/wherever.jpg">Kevin Rudd's Shoe-Size<a>), it will be found when you search for Kevin Rudd. I have no option to include a NoIndex modifier. You claim that is indexing their site. I claim that it is linking the name Kevin Rudd to my site, which I want to prevent.

    I can see this is two ways of different ways of looking at the same thing, but there doesn’t seem to be a way of me controlling the indexing of my images here, which seems wrong.

    I could modify my server to give 404s to Google and other known bots when they look at images, but that seems over the top.

Leave a comment

You must be logged in to post a comment.