| Issue 57: | TouchXML: UTF8 string with non-latin chars appears wrong after parsing | |
| 1 person starred this issue and may be notified of changes. | Back to list |
What steps will reproduce the problem? 1. Init CXMLDocument with some UTF8 string having non-Latin characters. I tried to parse a Russian website this way. The website is using CP1251 encoding, so I converted the NSData with the html page to NSString using stringWithData:encoding. The string looked great (I saw Russian chars as Russian chars in XCode). 2. Try to parse the CXMLDocument using XPath. What is the expected output? What do you see instead? The nodes come out with Russian chars unrecognizable, i.e. they look like ÖÑÊÀ âûøåë â ôèíàë instead. What version of the product are you using? On what operating system? The latest as of April 28, 2009, whatever it was. Please provide any additional information below. I guess, this has something to do with converting NSString to char* using UTF8String encoding in the CXMLDocument initWithString method. Somehow, in the end the correct encoding is lost. NSXMLDocument has no such problem, Russian still looks Russian after being parsed.
May 1, 2009
Project Member
#1
jwight
May 10, 2009
Yury, John, Attached you'll find unit tests should help to reproduce the problem. Yury, tests are showing that the problem only happens with CXMLDocument's initWithData:options:error:. For me, initWithXMLString:options:error: is working fine, contrary to what your report describes. I went on to check NSXMLDocument. Its initWithData:options:error: doesn't parse data which isn't proper UTF-8 but does fine when using initWithString:options:error: much(1) like current CXMLDocument's implementation. I ain't seeing differences here so, this could be a won't fix in order to keep CXMLDocument 1:1 compatible with NSXMLDocument's API. One workaround is to convert the NSData to NSString using the fancy encoding and then work from there. Nevertheless, as a proof of concept, I attach a patch to CXMLDocument that accepts encoding on its data initialiser in order to correctly parse NSData with encodings other than UTF-8. The patch is retro-compatible. (1) On encoding errors, current CXMLDocument actually goes on with the parsing and returns a document omitting the encoding error, alas NSXMLDocument return nil document and an error. This is subject for another issue, though.
May 13, 2009
Jorge, thanks for the unit tests and new method for CXMLDocument. Hopefully people will find the new method handy - I've accepted the patch and it is in the repository now. Closing this bug as fixed. Yury, please try the new API.
Status:
Fixed
May 13, 2009
Oh and if you want to become a project commiter Jorge let me know. Really happy to add commiters who write unit tests :-)
Jun 5, 2009
touchJSON has the same issue. Looking through the code, I cant find a place to make a similar modification, as everything is using NSUTF8StringEncoding
Jun 5, 2009
nevermind :) I was using it wrong. I was using NSSting stringWithContentsOfURL, then converting to NSData with utf8 encoding, then feeding it to the parser, which.... "double decodes" the utf8 ? |