My favorites | Sign in
Project Home
READ-ONLY: This project has been archived. For more information see this post.
Search
for
  Advanced search   Search tips   Subscriptions
Issue 57: TouchXML: UTF8 string with non-latin chars appears wrong after parsing
1 person starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  ----
Closed:  May 2009


 
Reported by yurypetr...@gmail.com, May 1, 2009
What steps will reproduce the problem?
1. Init CXMLDocument with some UTF8 string having non-Latin characters. I
tried to parse a Russian website this way. The website is using CP1251
encoding, so I converted the NSData with the html page to NSString using
stringWithData:encoding. The string looked great (I saw Russian chars as
Russian chars in XCode).
2. Try to parse the CXMLDocument using XPath. 

What is the expected output? What do you see instead?

The nodes come out with Russian chars unrecognizable, i.e. they look like
ÖÑÊÀ âûøåë â ôèíàë instead.

What version of the product are you using? On what operating system?

The latest as of April 28, 2009, whatever it was.

Please provide any additional information below.

I guess, this has something to do with converting NSString to char* using
UTF8String encoding in the CXMLDocument initWithString method. Somehow, in
the end the correct encoding is lost. NSXMLDocument has no such problem,
Russian still looks Russian after being parsed.

May 1, 2009
Project Member #1 jwight
Please provide sample code and sample data, see: 
https://code.google.com/p/touchcode/wiki/BugSubmission

Thanks!
May 10, 2009
#2 jpedroso@gmail.com
Yury, John,

Attached you'll find unit tests should help to reproduce the problem. 

Yury, tests are showing that the problem only happens with CXMLDocument's initWithData:options:error:. For me, 
initWithXMLString:options:error: is working fine, contrary to what your report describes.

I went on to check NSXMLDocument. Its initWithData:options:error: doesn't parse data which isn't proper UTF-8 but 
does fine when using initWithString:options:error: much(1) like current CXMLDocument's implementation. I ain't 
seeing differences here so, this could be a won't fix in order to keep CXMLDocument 1:1 compatible with 
NSXMLDocument's API.

One workaround is to convert the NSData to NSString using the fancy encoding and then work from there. Nevertheless, as a proof of concept, I attach a patch to CXMLDocument that accepts encoding on its data initialiser in 
order to correctly parse NSData with encodings other than UTF-8. The patch is retro-compatible. 


(1) On encoding errors, current CXMLDocument actually goes on with the parsing and returns a document omitting 
the encoding error, alas NSXMLDocument return nil document and an error. This is subject for another issue, though.


EncodingTests.h
1.3 KB   View   Download
EncodingTests.m
5.0 KB   View   Download
CXMLDocument_AcceptEncoding.patch
1.7 KB   View   Download
May 13, 2009
Project Member #3 jwight
Jorge, thanks for the unit tests and new method for CXMLDocument.

Hopefully people will find the new method handy - I've accepted the patch and it is in 
the repository now.

Closing this bug as fixed. Yury, please try the new API.
Status: Fixed
May 13, 2009
Project Member #4 jwight
Oh and if you want to become a project commiter Jorge let me know. Really happy to 
add commiters who write unit tests :-)
Jun 5, 2009
#5 sircambr...@gmail.com
touchJSON has the same issue. Looking through the code, I cant find a place to make a similar modification, as 
everything is using NSUTF8StringEncoding
Jun 5, 2009
#6 sircambr...@gmail.com
nevermind :) I was using it wrong. I was using NSSting stringWithContentsOfURL, then converting to NSData with 
utf8 encoding, then feeding it to the parser, which.... "double decodes" the utf8 ?

Powered by Google Project Hosting