Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Disable beaconing for bots #813

Closed
GoogleCodeExporter opened this issue Apr 6, 2015 · 18 comments
Closed

Disable beaconing for bots #813

GoogleCodeExporter opened this issue Apr 6, 2015 · 18 comments

Comments

@GoogleCodeExporter
Copy link

For example, Googlebot goes after '/mod_pagespeed_beacon' links in MPS  
JavaScript. I realize one can disable it in robots.txt, etc, but it is just as 
easy and preferable if it weren't needed at all.

All that has to be done is to replace '/' with a variable, so
'/mod_pagespeed_beacon'

becomes

s+'mod_pagespeed_beacon' 

and Googlebot will leave the links alone. :-)

Original issue reported on code.google.com by webmas...@clubsilver.org on 1 Nov 2013 at 5:59

@GoogleCodeExporter
Copy link
Author

Are you seeing a specific problem with googlebot following mod_pagespeed_beacon 
links? Also, do you know which filter this is applying to? We use beacons for 
several filters including add_instrumentation, prioritize_critical_css, and 
lazyload_images, it'd be helpful to know which of those filters you have 
enabled on your site.

Original comment by j...@google.com on 7 Nov 2013 at 4:18

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

Yes, I see Googlebot requesting (GET) beacon links and producing 400 errors.

It seems to be a critical css filter. Here is a snippet from any page that is 
not a front page, e.g., https://clubsilver.org/l/silverdaddies-countries.html : 


pagespeed.criticalCssBeaconInit('/mod_pagespeed_beacon','https://clubsilver.org/
l/silverdaddies-countries.html','TRAzTRclVP','Rriib57DqH4',pagespeed.selectors);

I would suggest that any MPS code that generates client-side JS, should hide 
'/' in a variable or else--from my experience--Googlebot will chase these 
links. I had to hide text as trivial as 'js/' or else Googlebot keeps 
requesting directory listing that it cannot retrieve.

Original comment by webmas...@clubsilver.org on 7 Nov 2013 at 4:42

@GoogleCodeExporter
Copy link
Author

Thanks for pinging about this issue. I think a better solution is to just 
disable beacon injection for googlebot. Otherwise we might start having to play 
whack-a-mole with string literals in our JS to stop googlebot from following 
them (it seems pretty aggressive). Plus, googlebot I understand can execute 
some limited JS (https://twitter.com/mattcutts/status/131425949597179904), and 
we certainly don't want it to try to send back valid beacons either.

Original comment by j...@google.com on 27 Jan 2014 at 3:11

@GoogleCodeExporter
Copy link
Author

Original comment by j...@google.com on 27 Jan 2014 at 3:14

  • Changed title: Disable beaconing for bots
  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

Well, Googlebot may be able to execute som JS, but for my own scripts, I know 
that for now, replacing '/' with a variable addressed the issue. If you only 
fix it for GB, what will happen if Baidu or Yahoo or Bing bots start following 
JS links as well? Just a thought.

Original comment by webmas...@clubsilver.org on 27 Jan 2014 at 5:18

@GoogleCodeExporter
Copy link
Author

Just to clarify, it would be disabling not just for googlebot but for all bots 
that we recognize, which includes bing, yahoo, and baidu amoung others. You can 
view our code for bot detection in 
https://code.google.com/p/modpagespeed/source/browse/trunk/src/pagespeed/kernel/
http/bot_checker.gperf

Original comment by j...@google.com on 27 Jan 2014 at 6:07

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision r3751.

Original comment by j...@google.com on 30 Jan 2014 at 10:03

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

X-Mod-Pagespeed:1.8.31.2-3973 does not seem to fix it (assuming no 
configuration changes were required to address specifically the beacon issue). 
This version has been running on the server for at least a week now.

66.249.75.84 - - [18/May/2014:02:56:07 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fprofiles%2Fusnicencgu 
HTTP/1.1" 301 251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 901 187
66.249.75.84 - - [18/May/2014:04:10:23 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fl%2Fmaturemen-france.ht
ml HTTP/1.1" 301 251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 1161 221
66.249.75.84 - - [18/May/2014:04:10:24 +0000] "GET /mod_pagespeed_beacon 
HTTP/1.1" 400 226 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 1381 368
66.249.75.84 - - [18/May/2014:05:39:35 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fprofiles%2Fcaterryt61 
HTTP/1.1" 301 251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 1209 183
66.249.75.84 - - [18/May/2014:10:44:11 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fl%2Fsilverdaddies-saint
%2520thomas.html HTTP/1.1" 301 251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 955 166
66.249.75.84 - - [18/May/2014:15:17:52 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fl%2Fsilverdaddies-great
er%2520london%2520(england).html HTTP/1.1" 301 251 "-" "Mozilla/5.0 
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 1243 176
66.249.75.84 - - [18/May/2014:18:26:38 +0000] "GET 
/mod_pagespeed_beacon?url=https%3A%2F%2Fclubsilver.org%2Fprofiles%2Fusrelaxedm 
HTTP/1.1" 301 251 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 1284 226
66.249.75.84 - - [18/May/2014:18:26:39 +0000] "GET /mod_pagespeed_beacon 
HTTP/1.1" 400 226 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)" 1653 373


Original comment by webmas...@clubsilver.org on 18 May 2014 at 10:19

@GoogleCodeExporter
Copy link
Author

OK, I'm reopening so we don't forget to follow up on this.

Original comment by jmaes...@google.com on 19 May 2014 at 1:40

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

Ah, it looks like we are still enabling add_instrumentation for bots. The 
change I previously made disabled just the critical image and CSS beacons for 
bots. I'll update to disable for add_instrumentation as well.

Original comment by j...@google.com on 6 Jun 2014 at 7:39

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision r4040.

Original comment by j...@google.com on 17 Jun 2014 at 6:37

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

This is still an issue. Taken from GWT 2/11/2015:

URL:
http://irshelpoklahoma.com/mod_pagespeed_beacon

Error details

Last crawled: 1/28/15
First detected: 10/28/14
Googlebot couldn't access this page because the server didn't understand the 
syntax of Googlebot's request.

Original comment by bfr...@gmail.com on 11 Feb 2015 at 7:59

@GoogleCodeExporter
Copy link
Author

(This is on X-Mod-Pagespeed:1.9.32.3-4448 according to the site headers.)

Original comment by jmaes...@google.com on 12 Feb 2015 at 1:52

@GoogleCodeExporter
Copy link
Author

Fetching https://irshelpoklahome.com/ with curl and user-agent 'GoogleBot' 
doesn't yield any beaconing code.  I'm going to follow up with folks here who 
might know what's going on.

Original comment by jmaes...@google.com on 12 Feb 2015 at 1:57

@GoogleCodeExporter
Copy link
Author

Same as Jan, I was not able to repro while fetching with the googlebot user 
agent. Is it possible for you to dig through your access logs and see which 
user agent the bot was using to access /mod_pagespeed_beacon?

Original comment by j...@google.com on 12 Feb 2015 at 2:40

@azerborodach
Copy link

its still not fixed

@jmaessen
Copy link
Contributor

It'd be helpful to get the User-Agent string which is receiving instrumented pages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants