| Issue 61: | Unicode issues due to recent LWPs and a simple solution | |
| 1 person starred this issue and may be notified of changes. | Back to list |
Ok, so I was talking to people at IPW about the unicode troubles and their
effect on mechanize-using code, notably Test::WWW::Mechanize::Catalyst.
dakkar was kind enough to go through and work out what part of the problem
belonged to Test::WWW::Mechanize::Catalyst and what part belongs to
WWW::Mechanize, and provide a very simple patch to WWW::Mechanize that gets
rid of the bug at that level and will then make it trivial for downstream
users such as TWMC to fix -their- code.
If you'd be kind enough to read the explanation he provided to me, which I
reproduce in full as part of this ticket, and have a look at the patches
involved, I'd hope we can at least open a dialogue about getting this patch
or something similar into the next release of Mechanize.
=== explanation from dakkar from here to end of ticket
==================================================
The problems with ``LWP`` and ``WWW::Mechanize``
==================================================
:author: dakkar@thenautilus.net
:date: 2008-09-25
Prologue
========
A scalar value in `Perl 5` can be one of *three* things:
1) a number
2) a byte string
3) a character string
`Java` has this nice distinction between ``String`` and ``Byte[]``,
and all (non-deprecated) conversion functions between them require
that the encoding be specified.
A *character encoding* is a pair of functions mapping characters to
bytes and vice-versa. `ASCII` and the `ISO` 8859 family define the
identity encoding for their character sets: the character identified
by the number "x" is represented by the byte having that same
value. The other encoding we are interested in is ``UTF-8``, which
maps the whole `Unicode` character repertoire (with code points
numbered from 0 to 0x10FFFF) to bytes, using a variable-length code
[UNICODE]_.
The ``HTTP`` protocol describes all the messages in terms of "octets"
(which, for all intents and purposes, are bytes). In particular, the
content of the ``HTTP`` message is defined to be a sequence of
arbitrary bytes, whose interpretation is left to whatever software is
using ``HTTP`` to communicate. [HTTP]_
.. note:: The fact that ``MIME`` and the ``HTTP`` `RFC` use the term
"character set" to mean "character encoding" (and even admit it),
does not help clarify the issues. ``HTTP`` uses the term "content
coding" (and the header field ``Content-Encoding``) to mean a
byte-level transformation of the body for transmission purposes
(usually, compression). (``Transfer-Encoding`` means something
completely unrelated, and should be ignored for this discussion).
When the ``Content-Type`` ``HTTP`` header declares the message body to
be of a subtype of ``text`` (e.g. ``text/plain`` or ``text/html``),
without specifying the ``charset`` parameter, the character encoding
is assumed to be the one defined by `ISO` 8859-1 (usually called
"Latin 1").
What ``HTTP::Message`` does
===========================
``HTTP::Message`` exposes 2 methods to access the message body:
``content``
returns and/or sets the raw message body, as it is (or would be) on
the wire; the data returned by this method is obviously a byte
string; the data that this method accepts is *either* a byte string,
or a character string that can be converted to a byte string using
the "native" character encoding and set (either Latin-1 or
``EBCDIC``)
``decoded_content``
returns (it can't be used to set) the message body (which must *not*
be a ``multipart/*``) as interpreted under the ``Content-Encoding``
(that is, this method decompresses the body if necessary);
furthermore, *if* the ``Content-Type`` header declares the body to
be a ``text/*``, the returned value is a character string, obtained
by decoding the bytes of the body according to the ``charset``
parameter of the ``Content-Type`` header value (defaulting to
Latin-1), *unless* ``charset => 'none'`` has been passed as option
to the method. So, the value returned by this method can be *either*
a byte string, or a character string, depending on the value of the
``Content-Type`` header, but it's possible to always get the byte
string.
This is all well and good, and according to the spec; I'd suggest a
little "improvement":
allow ``content`` to accept an arbitrary character string, *if* the
``Content-Type`` header has already been set, and have it
automatically convert the character string into bytes, according to
the value of the ``Content-Type`` header (of course, this can still
``die`` if the given characters can't be represented in the declared
encoding/"charset")
(This is not exactly trivial to implement, but not hard, either)
What ``WWW::Mechanize`` does
============================
``WWW::Mechanize`` plainly states::
# use charset => 'none' because while we want LWP to handle
Content-Encoding for
# the auto-gzipping with Compress::Zlib we don't want it messing with
charset
That is, it always wants to handle byte strings.
This is arguably a bug: as long as we are dealing with binary data
(e.g. images), it must, indeed, handle bytes. But ``WWW::Mechanize``
is used to deal with ``HTML`` pages, which are *text*, and text is
usually understood to be comprised of characters.
On the other hand, `Perl 5` tries to be helpful, and will happily
treat any byte string as if it were encoded using the "native"
encoding, and apply character semantic to it [test1]_.
So, whatever ``WWW::Mechanize`` does, as long as the page contains
`ASCII` or `ISO` 8859-1 text, nothing breaks. This *does* break,
however, as soon as the page contains text encoded differently
[test2]_
I have a very small patch that lets ``WWW::Mechanize`` handle
arbitrarily-encoded pages, with all test passing (including my
[test2]_). [patch1]_
Please note that the best way to handle this would be to let the user
of ``WWW::Mechanize`` decide. In fact, together with the above-quoted
comment, is this::
# See docs in HTTP::Message for details. Do we need to expose the options
there?
I'm pretty sure the answer is "yes". The usual cycle for
backwards-incompatible changes should be followed:
1) expose the ``charset`` option, document it, and warn that the
default will change from ``none`` to ``auto`` (i.e. it will not be
passed by default to ``decoded_content``) sometime in the future
2) a release or two down the line, change the default
What ``Test::WWW::Mechanize::Catalyst`` does
============================================
Finally we arrive at the point that manifested all the problems::
# For some reason Test::WWW::Mechanize uses $response->content everywhere
# instead of $response->decoded_content;
$response->content( $response->decoded_content );
This line is *wrong*. It takes what could well be a character string,
and tries to stuff it into an attribute that can only accepts byte
strings. The fact that it sued to work before ``HTTP::Message`` became
stricter was due to a combination of bugs and badly-defined
behaviours.
Using my patched ``WWW::Mechanize`` and removing the bad line from
``Test::WWW::Mechanize::Catalyst`` allows all tests to pass again.
References
==========
.. [UNICODE] The Unicode Standard, Version 4.0 ISBN 0-321-18578-1
.. [HTTP] http://www.rfc-editor.org/rfc/rfc2616.txt
.. [test1] see the ``uni-regex.pl`` test file
.. [test2] see the ``uni-mech.pl`` test file
.. [patch1] see the ``WWW-Mechanize-1.34-uniaware.patch`` file
|
|
,
Sep 27, 2008
I have made this change in 1.49_01 which I will put out tonight.
Status: Accepted
|
|
,
Oct 26, 2008
(No comment was entered for this change.)
Status: Fixed
|
|
|
|