{"id":531,"date":"2008-04-09T05:05:17","date_gmt":"2008-04-08T19:05:17","guid":{"rendered":"http:\/\/www.somethinkodd.com\/oddthinking\/?p=531"},"modified":"2008-04-06T18:07:28","modified_gmt":"2008-04-06T08:07:28","slug":"robotstxt-boost-dont-record-these-urls","status":"publish","type":"post","link":"https:\/\/www.somethinkodd.com\/oddthinking\/2008\/04\/09\/robotstxt-boost-dont-record-these-urls\/","title":{"rendered":"Robots.txt Boost: Don&#8217;t record these URLs"},"content":{"rendered":"<p>So, I have a web-site with some automatically-generated pages. It includes very mildly personal information; for the sake of argument, let&#8217;s say it is their shoe-size.<\/p>\n<p>This information is intended only for a select group of shoe-enthusiasts, doing research. I don&#8217;t want this web-site to appear in the search engines. People randomly searching for someone shouldn&#8217;t find their shoe-size. <\/p>\n<p>The first option would be to password-protect it (or otherwise authenticate each user). However, that is hugely onerous on the legitimate users of the site, for something as mild as shoe-size.<\/p>\n<p>So, I went for the honour system provided by robots.txt. &#8220;Please, please,&#8221; I asked the search engines, &#8220;Don&#8217;t go looking in my shoe-size directory.&#8221; While it is not guaranteed, the convention robots.txt is widely followed by bots.<\/p>\n<p>Google honoured the letter of my request, but not the spirit. (I only pick Google out because I use it, not because it is necessarily the only offender.) It doesn&#8217;t search the page, but when someone links to it, Google will still return it. (e.g. <code>&lt;a href=wherever.html&gt;Kevin Rudd's Shoe-Size&lt;\/a&gt;<\/code> will be found when searching for Kevin Rudd&#8217;s shoe-size, even though the URL itself is never visited by Google&#8217;s bots.)<\/p>\n<p>How do you get around this? Google <a href=\"http:\/\/www.google.com\/support\/webmasters\/bin\/answer.py?answer=35303\">explain<\/a> that there is another convention &#8211; the <a href=\"http:\/\/www.google.com\/support\/webmasters\/bin\/answer.py?answer=61050\">noindex meta tag<\/a>.<\/p>\n<p>I need to put in every page a note to the robots to say &#8220;Hey, don&#8217;t include this very page in your index at all.&#8221; But, the only way that they will see that note is if I let the search engines read the file &#8211; I need to remove the robots.txt restriction.<\/p>\n<p>Now, I am in a dilemma for two reasons.<\/p>\n<p>The first is that I have to let Google crawl all around my database, which may have hundreds of thousands of records and reports of shoe-sizes. That is going to cost me in bandwidth and CPU, just to say &#8220;This page shouldn&#8217;t appear in your list&#8221;. <\/p>\n<p>The second is that there are sure to be robots out there that support the robots.txt convention but not the newer noindex meta tag convention. I don&#8217;t know who they are. I can&#8217;t exclude them in the robots.txt, while allowing the bots through that comply with the new convention.<\/p>\n<p>We need a updated standard for robots.txt that says &#8220;Not only shouldn&#8217;t you look in this directory, you shouldn&#8217;t even admit that such URLs exist in your database.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am in a dilemma about how to tell Google to not index my pages.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_s2mail":"","footnotes":""},"categories":[25,34],"tags":[114,265,266,264,95],"class_list":["post-531","post","type-post","status-publish","format-standard","hentry","category-insufficiently-advanced-technology","category-software-development","tag-google","tag-html","tag-robotstxt","tag-search-engines","tag-web"],"_links":{"self":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts\/531","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/comments?post=531"}],"version-history":[{"count":0,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts\/531\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/media?parent=531"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/categories?post=531"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/tags?post=531"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}