Yahoo! Content group의 engineering manager인 Tim Converse가 Crawler trap과 redirection에 대해서 언급하고 있다.
Excerpted from An Interview with Tim Converse, part 2 (emphasis made by me):
JQ: You talked about comprehensiveness. There’s this perception that there’s the web that most of us see and then this dark web: the stuff that the crawlers don’t reach. How do we try to get that data into the index? Are there barriers that webmasters put up that they should avoid to help us better index the content?
A: At it’s simplest, webmasters aren’t aware of robots.txt and it’s uses. Redirection can also be problematic if people create content by creating lots of domains or hosts so we encourage people to organize their sites in many documents before they get a new host.
And of course, there’s also the issue of crawler traps which some people do intentionally but much more often, they’ve unintentionally created crawler traps….
JQ: …and a crawler trap is…
A: A crawler trap is something where you crawl a page and it has a link, usually in the same site that’s dynamically created and then you follow that link and it has another analogous link that’s dynamically created and often, just because people make mistakes, you’re attaching on another directory every time which doesn’t exist and takes you back to an automatically generated error page which has the same link. So you can fall into traps where there are an infinite number of pages that don’t have any content.
Another thing people can do to help us is, this is sort of geeky but, don’t make page not found pages that return a status 200.
JQ: I was just about to ask that. 404 pages back in the day, were these ugly grey things with block text that all looked the same and now they’re done up to look like regular pages to be more appealing to users.
A: We do actually have ways of detecting that but it’s a lot easier for us if a web server just says, "this page doesn’t exist" as opposed to creating a nice page for the user that to a crawler looks like any other page. In general, if the server tells us 404, then we discard it.
YQ: I worked for a company that used CIDs instead of cookies to follow users through the site and it turned out to be a disaster. We went from having pretty much every page indexed to hardly any. So what about CIDs and how they affect the crawlers?
A: If you have differences in the URL that don’t actually make a difference in the site, that can be hard for us to untangle. We’re getting better at it. One of the scenarios you’re talking about there would just create a lot of duplicates for us. So it’s nicer for us if we have one URL per actual content but we understand that you’re not designing this just for us. And we obviously do a lot of duplicate detection–actually, we do duplicate detection in a couple of different ways. Finding out if documents are the same; finding out if sites are mirrors of each other.
JQ: This question came up today on a mailing list that I’m on. The concern for this particular company is that they want to move their site to a new domain but they don’t want to become invisible for the next six months or year or however long it’ll take for people to point to their new website. What can we tell people like that?
A: We can tell them that in the future, if you actually want to move your site, you want to use a 301 redirect which will do as much of the right thing as we can.
YQ: What actually happens there? I’ve heard of companies who have used 301 redirects and yet their old pages continued to show up in the search engines anyway. Why is that?
A: The underlying problem is that people out there haven’t changed their links and search engines do pay attention to links.
I can’t give you a date, but we’re changing how we deal with redirects. The thing about redirects is that everyone thinks it’s obvious how a search engine should treat them and the obvious answer is not really that helpful. Any policy you develop with redirects is going to make someone unhappy but what we’re about to roll out we will pay better attention to 301 redirects and the exact problem you’re talking about should be less.
[In the time since we met with Tim, the team has rolled out a fix for 301/302 redirects. Documents will be handled by the new redirect policy as they are re-crawled and re-indexed and webmasters will start to see many of the sites change in the next couple of weeks. The index should be fully propagated within a month. See Tim Mayer’s Webmaster World presentation for details.]