URL Encoding/Decoding is broken in urlrewritefilter #27

GoogleCodeExporter · 2015-07-05T22:25:40Z

urlrewritefilter fails to properly handle encoded URLs such as
http://foo/plus+slash%2fsemicolon%3bquestionmark%3f?first=plus+plus

This is due to several common misconceptions. The previous URL once decoded
is http://foo/plus+slash/semicolon;questionmark??first=plus+plus which is
impossible to reencode correctly since we have just lost the precious
information that "plus+slash/semicolon;questionmark?" used to be a single
path segment, and not several, including one with a path parameter and an
invalid query string.

The first real error lies in using either java.net.URLDecoder or
commons-codec to decode URLs, since "+" in a path segment is a valid
character which should not be decoded to a space, as should be in the query
part. I've included URLEncoder and URLDecoder classes in the patch that
perform valid escaping based on where in the URL we are doing the
encoding/decoding.

The second obstacle is that there should be both ${escape:...} (assuming we
keep it for query parts) and ${escapePath:...} functions available for
substitution, since we may need to be able to decode from a path segment
and encode to a query part (and reverse):

<from>/(.*)?foo=(.*)</from>
<to>/foo/${encodePath:$2}?query=${encode:$1}</to>

I've added these functions (and decode) to my patch.

Actually we also may need <from decode="false">...</from> in many cases,
but I haven't made that change in my patch. We want it only for some rules,
as opposed to all rules as is currently the case.

The last thing my patch does is fix <to> substitution. Because this is not
evaluated as a real language with a lexer/parser there are many errors. For
example it is currently possible to have the following happen:

Suppose you have a variable FOO set to "$1" and the URL being matched "/bar":
<from>/(.*)</from>
<to>/url/%{FOO}</to>
will be substituted first to "/url/$1" then "/url/bar" due to sequential
substitution.

My patch introduces a poor-man's evaluation/substitution technique in the
form of a chain of substitution filters similar to servlet filters, which
means that the output of (backreferences, previous back-references,
variables and functions) will not be substituted again. Also it enables
recursive function evaluation such as "${escape:${unescapePath:$1}}" which
was not previously working.

The tests have been updated and all run except one weird test which I
suspect is not applicable any more.

The only thing I did not properly fix is
"${replace:${replace:foo:bar}:gee}" because the regexp which finds the
semicolons is wrong and will find the first one even though it belongs to
an inner function. If there is a way to write a regexp that matches a
semicolon which isn't withing "${}" then I could fix it, but my expertise
stops there.

This patch applies to 3.2.0 and fixes all URL issues we had with
urlrewritefilter for us. Do not hesitate to contact me for more info.

Thanks for this great product.

Original issue reported on code.google.com by stephane...@gmail.com on 21 Jan 2009 at 11:38

Attachments:

urlrewritefilter-3.2.0.stef1.patch

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter added Type-Defect auto-migrated Priority-Medium labels Jul 5, 2015

GoogleCodeExporter closed this as completed Jul 5, 2015

GoogleCodeExporter mentioned this issue Jul 5, 2015

<to encode="true"> should apply to forwards #38

Closed

arvindsv mentioned this issue Jan 29, 2019

Pipeline parameters misinterpretation gocd/gocd#11707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL Encoding/Decoding is broken in urlrewritefilter #27

URL Encoding/Decoding is broken in urlrewritefilter #27

GoogleCodeExporter commented Jul 5, 2015

URL Encoding/Decoding is broken in urlrewritefilter #27

URL Encoding/Decoding is broken in urlrewritefilter #27

Comments

GoogleCodeExporter commented Jul 5, 2015