My favorites | Sign in
Project Home Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 248: Inappropriately converting / -> /index.html while mirroring sites with Slurping
1 person starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  jmara...@google.com
Closed:  Mar 2011


Sign in to add a comment
 
Project Member Reported by sligocki@google.com, Mar 21, 2011
From jmaessen:

I've been collecting a fresh slurp, since we're now doing more stuff than we were the last time I did so.  But in looking at the logs, I've realized we're running into an odd problem:

When we ask apache for a uri ending in /, like say http://www.ibm.com/ , Apache sees the url and says "hey, a directory, I'd better append index.html".  So we end up asking the web for http://www.ibm.com/index.html , which is great except that this page says "302, try http://www.ibm.com/ instead".  So we end up not fetching quite a lot of content, because apache corrupts the uri as it proxies it through the slurper.  I presume (but don't know for sure) that mod_proxy doesn't have the same flaw.  Not 100% sure of the mechanics inside Apache that cause this to happen; does anyone know more?
Mar 21, 2011
Project Member #1 sligocki@google.com
Confirmed for trunk build:

$ curl -x localhost:8080 -v http://www.ibm.com/
* About to connect() to proxy localhost port 8080 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 8080 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:31:36 GMT
< Server: Apache/2.2.16 (Unix) DAV/2
< Location: http://www.ibm.com/
< Content-Length: 203
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

But it is not broken in latest release:

$ curl -x localhost:80 -v http://www.ibm.com/
* About to connect() to proxy localhost port 80 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 80 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:20:07 GMT
< Server: IBM_HTTP_Server
< Location: http://www.ibm.com/us/en/
< Cache-Control: no-cache, must-revalidate
< Pragma: no-cache
< Expires: Mon, 01 Jan 1990 00:00:20 GMT
< Vary: Accept-Encoding
< Content-Length: 209
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/us/en/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Mar 21, 2011
Project Member #2 morlov...@google.com
This also affects playback of older slurps -- we simply 404 on them.

Mar 23, 2011
Project Member #3 jmara...@google.com
I believe I know why this is.  I'm being troubled by this situation now.  It's due to this call sequence:
   apache_slurp.cc: SlurpUrl()
   InstawebContext::MakeRequestUrl()
   ap_construct_url()
This occurs, in my debugger, with request with these fields:
     the_request = 0x7c10d0 "GET http://www.vip-chicks.de/ HTTP/1.1", 
     hostname = 0x7601a0 "www.vip-chicks.de", 
     unparsed_uri = 0x7b9550 "/index.html", 
     uri = 0x7b9570 "/index.html", 
     parsed_uri.path = "/index.html"
     main != NULL
the 'main' points to a request where:
     unparsed_uri = 0x75f4f0 "http://www.vip-chicks.de/", 
     uri = 0x74e980 "/", 
     parsed_uri.path = 0x74e980 "/", 
So I think a good fix for that is, in MakeRequestUri, follow the main() pointer till its null before looking at 'uri' fields.

Mar 23, 2011
Project Member #4 jmara...@google.com
(No comment was entered for this change.)
Owner: jmara...@google.com
Mar 23, 2011
Project Member #5 jmara...@google.com
fix coming....
Status: Started
Mar 24, 2011
Project Member #6 jmara...@google.com
(No comment was entered for this change.)
Status: Fixed
May 6, 2011
Project Member #7 jmara...@google.com
(No comment was entered for this change.)
Summary: Inappropriately converting / -> /index.html while mirroring sites with Slurping
Sign in to add a comment

Powered by Google Project Hosting