My favorites | Sign in
Project Logo
                
New issue | Search
for
| Advanced search | Search tips
Issue 61: Unicode issues due to recent LWPs and a simple solution
1 person starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  ----
Closed:  Oct 2008
Type-Defect
Priority-Medium


Sign in to add a comment
 
Reported by m...@shadowcatsystems.co.uk, Sep 27, 2008
Ok, so I was talking to people at IPW about the unicode troubles and their
effect on mechanize-using code, notably Test::WWW::Mechanize::Catalyst.

dakkar was kind enough to go through and work out what part of the problem
belonged to Test::WWW::Mechanize::Catalyst and what part belongs to
WWW::Mechanize, and provide a very simple patch to WWW::Mechanize that gets
rid of the bug at that level and will then make it trivial for downstream
users such as TWMC to fix -their- code.

If you'd be kind enough to read the explanation he provided to me, which I
reproduce in full as part of this ticket, and have a look at the patches
involved, I'd hope we can at least open a dialogue about getting this patch
or something similar into the next release of Mechanize.

=== explanation from dakkar from here to end of ticket

==================================================
 The problems with ``LWP`` and ``WWW::Mechanize``
==================================================
:author: dakkar@thenautilus.net
:date: 2008-09-25

Prologue
========

A scalar value in `Perl 5` can be one of *three* things:

1) a number
2) a byte string
3) a character string

`Java` has this nice distinction between ``String`` and ``Byte[]``,
and all (non-deprecated) conversion functions between them require
that the encoding be specified.

A *character encoding* is a pair of functions mapping characters to
bytes and vice-versa. `ASCII` and the `ISO` 8859 family define the
identity encoding for their character sets: the character identified
by the number "x" is represented by the byte having that same
value. The other encoding we are interested in is ``UTF-8``, which
maps the whole `Unicode` character repertoire (with code points
numbered from 0 to 0x10FFFF) to bytes, using a variable-length code
[UNICODE]_.

The ``HTTP`` protocol describes all the messages in terms of "octets"
(which, for all intents and purposes, are bytes). In particular, the
content of the ``HTTP`` message is defined to be a sequence of
arbitrary bytes, whose interpretation is left to whatever software is
using ``HTTP`` to communicate. [HTTP]_

.. note:: The fact that ``MIME`` and the ``HTTP`` `RFC` use the term
   "character set" to mean "character encoding" (and even admit it),
   does not help clarify the issues. ``HTTP`` uses the term "content
   coding" (and the header field ``Content-Encoding``) to mean a
   byte-level transformation of the body for transmission purposes
   (usually, compression). (``Transfer-Encoding`` means something
   completely unrelated, and should be ignored for this discussion).

When the ``Content-Type`` ``HTTP`` header declares the message body to
be of a subtype of ``text`` (e.g. ``text/plain`` or ``text/html``),
without specifying the ``charset`` parameter, the character encoding
is assumed to be the one defined by `ISO` 8859-1 (usually called
"Latin 1").

What ``HTTP::Message`` does
===========================

``HTTP::Message`` exposes 2 methods to access the message body:

``content``
  returns and/or sets the raw message body, as it is (or would be) on
  the wire; the data returned by this method is obviously a byte
  string; the data that this method accepts is *either* a byte string,
  or a character string that can be converted to a byte string using
  the "native" character encoding and set (either Latin-1 or
  ``EBCDIC``)

``decoded_content``
  returns (it can't be used to set) the message body (which must *not*
  be a ``multipart/*``) as interpreted under the ``Content-Encoding``
  (that is, this method decompresses the body if necessary);
  furthermore, *if* the ``Content-Type`` header declares the body to
  be a ``text/*``, the returned value is a character string, obtained
  by decoding the bytes of the body according to the ``charset``
  parameter of the ``Content-Type`` header value (defaulting to
  Latin-1), *unless* ``charset => 'none'`` has been passed as option
  to the method. So, the value returned by this method can be *either*
  a byte string, or a character string, depending on the value of the
  ``Content-Type`` header, but it's possible to always get the byte
  string.

This is all well and good, and according to the spec; I'd suggest a
little "improvement":

  allow ``content`` to accept an arbitrary character string, *if* the
  ``Content-Type`` header has already been set, and have it
  automatically convert the character string into bytes, according to
  the value of the ``Content-Type`` header (of course, this can still
  ``die`` if the given characters can't be represented in the declared
  encoding/"charset")

(This is not exactly trivial to implement, but not hard, either)

What ``WWW::Mechanize`` does
============================

``WWW::Mechanize`` plainly states::

   # use charset => 'none' because while we want LWP to handle
Content-Encoding for 
   # the auto-gzipping with Compress::Zlib we don't want it messing with
charset

That is, it always wants to handle byte strings.

This is arguably a bug: as long as we are dealing with binary data
(e.g. images), it must, indeed, handle bytes. But ``WWW::Mechanize``
is used to deal with ``HTML`` pages, which are *text*, and text is
usually understood to be comprised of characters.

On the other hand, `Perl 5` tries to be helpful, and will happily
treat any byte string as if it were encoded using the "native"
encoding, and apply character semantic to it [test1]_.

So, whatever ``WWW::Mechanize`` does, as long as the page contains
`ASCII` or `ISO` 8859-1 text, nothing breaks.  This *does* break,
however, as soon as the page contains text encoded differently
[test2]_

I have a very small patch that lets ``WWW::Mechanize`` handle
arbitrarily-encoded pages, with all test passing (including my
[test2]_). [patch1]_

Please note that the best way to handle this would be to let the user
of ``WWW::Mechanize`` decide. In fact, together with the above-quoted
comment, is this::

  # See docs in HTTP::Message for details. Do we need to expose the options
there?

I'm pretty sure the answer is "yes". The usual cycle for
backwards-incompatible changes should be followed:

1) expose the ``charset`` option, document it, and warn that the
   default will change from ``none`` to ``auto`` (i.e. it will not be
   passed by default to ``decoded_content``) sometime in the future
2) a release or two down the line, change the default

What ``Test::WWW::Mechanize::Catalyst`` does
============================================

Finally we arrive at the point that manifested all the problems::

     # For some reason Test::WWW::Mechanize uses $response->content everywhere
     # instead of $response->decoded_content;
        $response->content( $response->decoded_content );

This line is *wrong*. It takes what could well be a character string,
and tries to stuff it into an attribute that can only accepts byte
strings. The fact that it sued to work before ``HTTP::Message`` became
stricter was due to a combination of bugs and badly-defined
behaviours.

Using my patched ``WWW::Mechanize`` and removing the bad line from
``Test::WWW::Mechanize::Catalyst`` allows all tests to pass again.

References
==========

.. [UNICODE] The Unicode Standard, Version 4.0 ISBN 0-321-18578-1

.. [HTTP] http://www.rfc-editor.org/rfc/rfc2616.txt

.. [test1] see the ``uni-regex.pl`` test file

.. [test2] see the ``uni-mech.pl`` test file

.. [patch1] see the ``WWW-Mechanize-1.34-uniaware.patch`` file

uni-mech.pl
405 bytes Download
uni-regex.pl
465 bytes Download
WWW-Mechanize-1.34-uniaware.patch
2.3 KB Download
Comment 1 by petdance, Sep 27, 2008
I have made this change in 1.49_01 which I will put out tonight.

Status: Accepted
Comment 2 by petdance, Oct 26, 2008
(No comment was entered for this change.)
Status: Fixed
Sign in to add a comment

Hosted by Google Code