8 things the web content optimization community can learn from Matt Cuttss cloaking video

For years, I have been answering questions about web content optimization (WCO) and search engine optimization (SEO). (See this post from earlier this year.) Recently, I’ve had several calls with customers who want a definitive answer to the following questions:

In the past, we’ve worked with the Google team to get answers to these questions. Recently, Matt Cutts posted this video on cloaking, based on our discussions. (Shout-out to Patrick Meenan for his tireless efforts here. Thanks, Pat!)

Like most search-related answers, the video is a general overview of cloaking, but it contains all of the information we need to answer the questions above.

(Before I get into that, though, I have to mention how much Matt reminds me of Mr Rogers in this video. He seems so approachable and kind. I’d let him take care of my kids.)

Now let’s get into things. In order to analyze Matt’s points, I’ve transcribed the video to the best of my abilities and patience. I’ll be quoting Matt, and then interpreting his comments in the context of web content optimization.

Point #1: You cannot target the bot specifically, and you cannot serve it different content.

“Cloaking is essentially showing different content to users, then to Googlebot. So imagine that you have a web server right here and a user comes and asks for a page. So you know, here is your user. [He’s drawing on the whiteboard here.] You give him some sort of page and everybody is happy. And now let’s have Googlebot come and ask for a page as well, and you give Googlebot a page. Now, in the vast majority of situations, the same content goes to Googlebot and users. Everybody is happy. Cloaking is when you show different content to users and to Googlebot and it is definitely high risk; that is a violation of our quality guidelines.

I think he makes the basic principle very clear here: you cannot target the bot specifically, and you cannot serve it different content.

What is key here are two concepts:

  • What constitutes different content.
  • What situations outside of the “vast majority of situations” are acceptable use cases for sending different content to different browsers.

Let’s keep going as Matt expands on both of these concepts.

Point #2: Intent is important.

“Why do we consider cloaking bad, or why does Google not like cloaking? Well, the answer is sort of in the ancient days of search engines, when you saw a lot of people do really deceptive or misleading things with cloaking. For example, when Googlebot came, the web server that was cloaking might return a page all about cartoons, Disney cartoons or whatever, but when a user came and visited the page, the web server might return something like porn. And so if you do a search for Disney cartoons on Google, you will get a page that look like it would be about cartoons, you would click on it, and then you would get porn. That is a hugely bad experience. People complain about it. It is an awful experience for users. So we say that all types of cloaking are against the quality guidelines.  There is no such thing as “white hat cloaking”. Certainly when somebody is doing something especially deceptive or misleading that is when we care the most that is when the web spam team really gets involved.”

In this section, Matt makes it clear that intent is important. The cloaking debate really centers around the issue of intentionally misleading and deceiving users. Like most Google initiatives, the purpose of banning cloaking is to ensure that the system has integrity and does no evil. When analyzing how web content optimization deals with bots, we need to keep this principle in mind.

Let’s keep going.

Point #3: Testing with simple hash comparisons won’t work for dynamic sites.

“Okay, so what are some rules of thumb, to save you trouble or help you stay out of a high-risk area? … Take a hash of a page — take all that different content and boil it down to one number [the hash] — and then pretend to be Googlebot. You know, with the Googlebot user agent. We even have a “fetch as Googlebot” feature in Google webmaster tools. So you fetch a page as Googlebot and you hash that page, as well, and if those numbers are different [i.e., the page hash taken from a browser versus the hash taken from the bot’s perspective], then that could be a little bit tricky. That could be something where you might be in a high-risk area. Now pages can be dynamic — you might have things like time stamps, or the ads might change. So it’s not a hard and fast rule.”

Obviously, a simple hash of a dynamic page will not give you the answer to the cloaking question. Sites are so dynamic that this test in its simple form would simply not work for most pages. I tried this on 10 prominent sites and found that the hashes were completely different due to dynamic content such as ads and changing content. We need to keep searching for answers.

Point #4: Targeting the bot and serving it different content is a clear violation.

“Another simple heuristic to keep in mind is, if you were to look through the code of your web server [or in the WCO market, your friendly neighborhood automation vendor :) ], would you find something that deliberately checks for a user agent of Googlebot specifically, or Googlebot’s IP address, specifically? Because if you’re doing something very different or special or unusual for Googlebot — either its user agent or its IP address — that has the potential to, you know, maybe be showing  different content to Googlebot than to users. That is the stuff that is high risk. So keep those kinds of things in mind.”

This provides good guidance for web content optimization. Any WCO solution that is targeting the bot specifically and serving it different content is clearly violating the rules. At Strangeloop, we don’t do this and I’m not aware of anyone that does in our industry. (I checked a number of other vendors while preparing this post.)

Next, Matt transitions to a few examples that are very relevant to our world.

Point #5: Serving different content to different clients, based on client needs, is okay.

“Now, one question we get from a lot of people who are white hat and who do not want to be involved in cloaking in any way,  and who want to make sure that they stay clear of high-risk areas [that’s me, and ostensibly, if you’re still following along, it’s you too]: what about geolocation and mobile user IDs, so you know, phones and that sort of thing? The good news, in an executive sort of summary, is that you don’t really need to worry about that [geolocation/mobile user IDs], but let’s talk through exactly why geolocation and handling mobile phone is not cloaking.”

In other words, addressing different client needs is not cloaking. He continues with his example, but it is important to note that serving different content to users based on capabilities is clearly defined as acceptable. Like mobile phones have different capabilities, so do different browser versions.

For more clarity, let’s examine Matt’s examples.

Point #6: Treat Googlebot like any normal desktop browser.

“Okay Buy Viagra… so until now we have had one user. Now let’s go ahead and say this user is coming from France. And let’s have a completely different user, and let’s say maybe they are coming from United Kingdom. In an ideal world, if you have your content available on a dot-FR domain or dot-UK domain or different languages because you have gone through the work of translating them, it is really, really helpful if someone coming from a French IP address gets their content in France. They are going to be much happier about that. So what geolocation does is, whenever a request comes in to the web server, you look at the IP address and you say ‘Ah, this is a French IP address. I am going to send them the French language version or send them to the dot-FR version of my domain.’  If someone comes in and their browser language is English or their IP address is something from America or Canada or something like that, then you say, ‘Aha, English is probably the best message.’ Unless they are coming from the French part, of course. [I like the shout out to my friends in Quebec.]

“So what that is doing is, you are making the decision based on the IP address. As long as you are not making some specific country the Googlebot belongs to–  ”GoogleLandia” or something like that — then you are not doing something special or different for Googlebot. At least currently, when we are making this video, Googlebot crawls from United States, so you would treat Googlebot just like a visitor from the Unites States. You’d serve up content in English and we typically recommend that you treat Googlebot just like a regular desktop browser, so you know, Internet Explorer 7 or whatever a very common desktop browser is for your particular site. So geolocation — that is, looking at the IP address and reacting to that — is totally fine, as long as you are not reacting specifically to the IP address of just Googlebot, just that very narrow range, and instead you are looking at what is the best user experience overall depending on the IP address.

This example really helps us understand our role in web content optimization. The goal is to provide the best user experience, and this can change depending on country or browser. Matt is asking us to treat the bot like we would any normal desktop browser. Don’t do anything special for it.

His mobile example, next, provides further clarity.

Point #7: Case in point – Serving customized “squeezed” pages to mobile devices is fine.

“In the same way, if someone now comes in — and let’s say they are coming in from a mobile phone, so they are accessing it in an iPhone or Android phone — and you can figure out, okay, that is a completely different user agent. It has got completely different capabilities. It is totally fine to respond to that [mobile] user agent and give them, you know, a more squeezed version of the website or something that fits better on the smaller screen. Again, the difference is, if you are treating Googlebot like a desktop user, so that user agent doesn’t have anything special or different that you are doing, then you should be in perfectly fine shape. So, you know, you are looking at the capabilities of the mobile phone, you are returning an appropriately customized page, or you are not trying to do anything deceptive or misleading, you are not treating Googlebot really differently based on the user agent, and you should be fine.”

Matt clearly states that it is okay to give the user an experience that is tailored to the browser’s capabilities. This is strong endorsement of the fact that advanced web content optimization features, which only apply to one user agent, are perfectly legal and encouraged, so long as the bot is not treated in any special way.

He gives even more insight below.

Point #8: No, really, you can’t treat Googlebot differently than you treat users. Ever.

“So the one last thing I want to mention — this is a little bit of a power user kind of thing — is some people are like ‘Okay, I won’t make the distinction based on the exact user agent string or the exact IP address range that Googlebot comes from, but maybe I will, say, check for cookies and if somebody does not respond to cookies, or if they don’t treat JavaScript the same way, then I will carve out and treat that differently.’ The litmus test there is: are you basically using that as an excuse to try to find a way to treat Google differently or to try to find some way to segment Googlebot and make it do a completely different thing. So again the instinct behind cloaking is: are you treating users the same way as you are treating Googlebot? We want to score and return roughly the same page that the user is going to see. So we want the end user experience when they click on the Google result to be the same as if they’d just come to the page themselves.

“So that is why you shouldn’t treat Googlebot differently, and that is why cloaking is a bad experience, why it violates our quality guidelines, and that is why we do pay attention to it. There is no such thing as “white hat” cloaking. We really do want to make sure that the page the user sees is the same page the Googlebot saw.

“Okay, I hope that kind of helps.”

Thanks, Matt. This does help.

To summarize:

  1. It is safe to provide different pages to mobile browsers, different locations, and different user agents, so long as the Googlebot user agent, its IP addresses, or its capabilities (i.e., no cookies) are not directly targeted.
  2. Treating the Googlebot like a desktop browser with the basic acceleration treatments one would apply to any generic browser is fine and encouraged.
  3. As I was just re-watching the video, another concept stuck out: content. Matt repeatedly says not to serve different content to Googlebot. The operative question here is: if Googlebot was a full-featured browser, would it see the same page (including images, etc.) as a normal browser? If we’re not changing the content itself and only manipulating the way the content is delivered (through techniques like inlining or MHTML or DataURLs or whatever), then we’re clearly not in violation.
  4. Intent is key. Something that is geared toward speeding up sites, and that has no intention of deceiving users, is safe so long as it abides by Google’s rules and regulations.

I’m confident you are safe if:

  • You don’t target the bot or Google IP addresses specifically.
  • You don’t try to game search engines by using methods to treat browsers that don’t accept cookies or that don’t use JavaScript differently.
  • You provide the basic features that apply to all browsers to all browsers, including bots.
  • You save advanced features for the specific browsers for which they are built.
  • You don’t change the actual content (images, etc.) you serve to Googlebot.
  • Most importantly, your intention is to speed up pages and not do evil.

Related posts: