Export to GitHub

google-plus-platform - issue #178

Specific user-agent to identify the google-plus crawler


Posted on Feb 24, 2012 by Swift Bear

What steps will reproduce the problem? 1.The Google Plus BOT that crawls the page, after a user clicks the +1 button, doesn't identify itself as a BOT. 2.Neither as an agent without cookies support

What is the expected output? What do you see instead? There are some problems if it's not possible to identify the crawler: - The system count the crawler visit as a normal one - The system can not use a load balancer or cache system to priorize normal users.

Comment #1

Posted on Feb 24, 2012 by Massive Rhino

(No comment was entered for this change.)

Comment #2

Posted on Mar 6, 2012 by Happy Kangaroo

Having a consistent User Agent will help our multi language website detect language settings and share it effectively

Comment #3

Posted on Mar 12, 2012 by Grumpy Kangaroo

Declaring a consistent user agent will allow the +1 to work on our 100% SSL site. We have to manually set cookies for each of the social sharing tools (twitter, facebook, etc) but we cannot set it up yet for G+ because of the current bot user-agent issue...

Comment #4

Posted on Mar 22, 2012 by Grumpy Hippo

Just noticed this also. Google +1 on links to my site count as visits with user agent "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0" Facebook has "facebookexternalhit" user agent for this purpose. LinkedIn has "LinkedInBot" (I am filtering those).

Comment #5

Posted on Apr 12, 2012 by Massive Rhino

Issue 209 has been merged into this issue.

Comment #6

Posted on Apr 13, 2012 by Helpful Elephant

Bummer that this was requested almost 2 months ago and it has not been implemented yet. I found a work around for me, but it is not ideal.

Comment #7

Posted on Apr 13, 2012 by Swift Bear

I think you're right, 2 months is too much to implement such an easy issue. What is your work around?

Comment #8

Posted on Apr 13, 2012 by Helpful Elephant

I use the Google provided #!ajax workaround for g+ button URLs. I then can use any server side software to handle the request.

The only drawback is that g+ has the full URL in the shared item (i.e. //domain.com/seo/url#!ajax instead of //domain.com/seo/url).

This in and of itself is likely a "bug" as the Google search bot strips out the #!ajax before adding the URL to the index. I will submit it as one to see if they can address it too.

Comment #9

Posted on Apr 17, 2012 by Quick Ox

Comment deleted

Comment #10

Posted on Apr 17, 2012 by Quick Ox

Comment deleted

Comment #11

Posted on Apr 17, 2012 by Quick Ox

Comment deleted

Comment #12

Posted on Apr 17, 2012 by Quick Ox

This also affects pages on sites where some sort of user action is required... such as entering a zip code/state/etc before you are able to view a product page that is tied to a specific region/product/etc. Or the SSL use case mentioned by Kym above.

It seems this would be a very easy change in just about any programming language I can think of. Please, Google can we haz useragentz? :) Here's a couple examples for Java and PHP:

Java: java.net.URLConnection c = url.openConnection(); c.setRequestProperty("User-Agent", "Mozilla/5.0 (compatible; GooglePlusBot/20120417)");

PHP: setHeaders(array('User-Agent' => 'Mozilla/5.0 (compatible; GooglePlusBot/20120417)')); ?>

And even if Google Plus bot was just some script being kicked off in FF... it can be overridden in about:config with general.useragent.override ...

Comment #13

Posted on May 17, 2012 by Grumpy Kangaroo

There's some discussion of this going on over on StackOverflow:

http://stackoverflow.com/questions/10538919/how-can-i-identify-web-requests-created-when-someone-links-to-my-site-from-googl/10539080#comment13774178_10539080

Suffice to say, a nice way of identifying the Google+ bot would be extremely useful to us...

Comment #14

Posted on May 21, 2012 by Quick Panda

A site I administer has an age check screen that uses bot detection do allow bots to crawl the site bypassing the age check. Not being able to detect Google+ as a bot implies that each type a user shares a link from the site on google+ (or using a social plugin to +1 a link) it gets the title, image, description and url of the age check screen instead.

Comment #15

Posted on Jul 5, 2012 by Swift Ox

In the Netherlands we have to ask every visitor for permission to place cookies BEFORE we place cookies. This means we have a splash screen that users who have not agreed to receive cookies get to see. Since the googleplus crawler does not have such a cookie, it gets that splash screen no matter what page the user pressed the google+ button on. Since google+ doesn't have an identifiable user agent it is impossible for us to get te button to work properly. Facebooks button works perfectly fine. If this is not resolved on short notice we will have no choice but to remove the google+ and +1 buttons and functionality from our site.

Comment #16

Posted on Jul 20, 2012 by Swift Rhino

You can look for

X-Goog-Source:LP_

to know if this is a googlebot crawler.

Comment #17

Posted on Jul 20, 2012 by Swift Rabbit

Still showing the same user agent:

"Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0"

Comment #18

Posted on Jul 28, 2012 by Happy Camel

I cannot use the google plus button without a fixed user agent for it, as my pages are only visible to people who sign up.

Please implement a fixed user agent, google.

Comment #19

Posted on Jul 29, 2012 by Swift Ox

24 days after my comment above (#15) still nothing has changed. The snippet the google+ button gives is the cookie-permission text because we are unable to grant the crawler permission to skip the cookie-permission page. We at fok.nl will therefor be removing all google+ functionality from our site withing the next week. Google's arrogance in this is a shame, but I'm sure facebook won't mind.

Comment #20

Posted on Jul 30, 2012 by Helpful Cat

Hello all, Shall we use "HTTP_X_GOOG_SOURCE" for identifying google-plus bot, this is the only noted difference I found while watching SERVER(in PHP) array.

Comment #21

Posted on Nov 14, 2012 by Happy Bear

It's been 9 months now and there's still no change on this? Using the HTTP_X_GOOG_SOURCE works but feels like a hack and could be broken at any time. Even one of the testing tools from Google throws Googlebot-richsnippets, why not bring this inline?

Comment #22

Posted on Nov 14, 2012 by Quick Bird

Comment deleted

Comment #23

Posted on Dec 21, 2012 by Swift Camel

Comment deleted

Comment #24

Posted on Jan 28, 2013 by Happy Rabbit

@drood: I noticed at fok.nl you have fixed this issue, since Google+ sharing now shows the correct information. Could you share your solution with us? Thanks in advance.

Comment #25

Posted on Jan 28, 2013 by Swift Ox

@fr...: we used the solution from post #20 to check for $_SERVER['HTTP_X_GOOG_SOURCE']. Also we whitelisted a known google IP-range.

Comment #26

Posted on Feb 4, 2013 by Quick Horse

In case someone needs a copy-pastable asp.net version: HttpContext.Current.Request.ServerVariables["HTTP_X_GOOG_SOURCE"] != null;

Comment #27

Posted on Feb 4, 2013 by Helpful Horse

I agree that it should use a specific user-agent. The Facebook crawler does and it makes sense to use something like "Googlebot-richsnippets" like said in Comment 21.

Comment #28

Posted on Feb 5, 2013 by Grumpy Bird

What exactly is this HTTP_X_GOOG_SOURCE thing? How would I detect that in haproxy?

Comment #29

Posted on Feb 6, 2013 by Happy Elephant

@elyog...: for Haproxy use "hdr*" matching criteria in your ACLs. See http://cbonte.github.com/haproxy-dconv/configuration-1.5.html#7-hdr

Comment #30

Posted on Feb 27, 2013 by Massive Rabbit

I've found new user-agent 'Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google'. Maybe we can relay on 'Google'?

Comment #31

Posted on Feb 27, 2013 by Swift Cat

I was relying on the HTTP_X_GOOG_SOURCE header to identify the crawler, however I think it's not working now (I noticed that yesterday, maybe a recent change?).

So my question is the same than the previous comment.. Can I relay in the user-agent 'Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google'?? I just noticed that it's including 'Google' at the end, it looks like a recent change as well

Comment #32

Posted on Feb 27, 2013 by Happy Giraffe

This is not a supported feature of the API so no you can't relay on the user agent. It could change at any time.

Comment #33

Posted on Feb 28, 2013 by Swift Bear

WTF!? Too bad. The very poor solution we had, using HTTP_X_GOOG_SOURCE, has gone. And you, engineers of Google, what the hell are you thinking!? You, along with a few companies have defined the behavior and netiquette of BOTs, and now it seems that you are not able to implement the most basic one: SEND YOUR OWN USER AGENT!!! We opened this issue 1 YEAR AGO!! After SIX MONTHS, someone came up with a solution (thanks a lot, btw) that uses a strange header coming with the request from Google+. Ok, it was not the best solution, but a solution, and we put it on our code. And now, even this poor solution has gone, as the integration of Google+ in our site. Thanks for nothing Google!

Comment #34

Posted on Mar 1, 2013 by Grumpy Kangaroo

ugh, +1. It's hard to justify continued use of the +1 button after this.

Comment #35

Posted on Mar 1, 2013 by Happy Rabbit

@drood: I noticed the G+ button on Fok.nl continues to work. Can you tell me which Google - IP range you use to exclude? We added a whole bunch of them, which worked fine. But since a couple of days, we are again sharing the cookie-page instead of the article. So I suspect you use a broader range. Thanks in advance.

Comment #36

Posted on Mar 5, 2013 by Grumpy Panda

Feedback acknowledged.

To make filtering requests easier the User-Agent will soon contain a link to the snippets help page. If you have custom rules for processing User-Agents please change them to recognize this suffix:

Google (+https://developers.google.com/+/web/snippet/)

Thank you - more details to come.

Comment #37

Posted on Mar 5, 2013 by Swift Ox

Only took just over a year as well. Kudos.

Comment #38

Posted on Mar 5, 2013 by Grumpy Panda

The User-Agent change is now live. Documentation changes coming soon as well as DNS PTR records for outbound IPs.

Comment #39

Posted on Mar 25, 2013 by Grumpy Panda

Reverse DNS for all fetch IPs is now available.

% host 66.249.80.100 100.80.249.66.in-addr.arpa domain name pointer google-proxy-66-249-80-100.google.com.

Status: Fixed

Labels:
Type-Enhancement Priority-Medium Component-Plugins