| Issue 165: | pyodbc / unixODBC problem reading nvarchar | |
| 3 people starred this issue and may be notified of changes. | Back to list |
What steps will reproduce the problem?
1. on custom built Linux (kernel 2.6.9), build pyodbc from source code, using python 2.6 and unixODBC 2.3.0
2. in Netezza database, setup a table with column with datatype nvarchar, and a column with datatype varchar
3. insert one row of values, say ('A', 'B') into the table
4. run sql query using pyodbc, select * from test_table;
The test script looks like this:
import pyodbc
def main():
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=server.example.com;DATABASE=mydatabase;UID=myusername;PWD=password", ansi=True)
sql = 'select * from MY_TABLE'
cur = conn.cursor()
cur.execute(sql)
row = cur.fetchone()
print row
if __name__ == "__main__":
main()
What is the expected output? What do you see instead?
I expect to see (u'A', 'B') or ('A', 'B'), but I actually see (u'', 'B').
If I insert more characters, say ('AAAA', 'BBBB'), then I got this when querying using pyodbc:
(u'\u4141\u4141', 'BBBB')
While I can get the expected ('AAAA', 'BBBB') using either isql or nzsql, or even pyodbc on windows.
What version of the product are you using? On what operating system?
Linux 2.6.9 64 bit, custom built
python 2.6
pyodbc 2.1.8 (custom built)
unixODBC 2.3.0
Netezza odbc driver 64 bit
Please provide any additional information below.
I haven't tried this with other type of database, as I don't have access to another database type at the moment.
On the other hand, I can use isql (binary tool that came with unixODBC) and nzsql (binary tools that came with Netezza database) to do the same query, and see the expected output.
Thanks
Mar 21, 2011
Hey, you got it exactly right. After I changed the setting UnicodeTranslationOption = utf16, nvarchar data does show up correctly. However, would this cause any other issues with other type of data that might actually need to be translated using utf8, instead of utf16? Thanks
Mar 21, 2011
I do not expect you'd have problems with other data types. My guess is that this option attempts to cater for non-unicode-aware clients, but if you're only using pyodbc as a client then setting this to utf16 should be fine.
May 20, 2011
(No comment was entered for this change.)
Status:
NoFix
Jan 19, 2012
I just looked at this issue more closely. What the Netezza ODBC driver is doing is returning SQL_WVARCHAR regardless of whether it is using UTF-8 or UTF-16 encoding, and regardless of whether the result is unicode. In PYODBC, there actually is support for receiving Unicode through UTF-8, however the switch statement in GetDataString() in getdata.cpp that does this only checks the flag 'unicode_results' when the driver had reported SQL_CHAR, SQL_VARCHAR, SQL_LONGVARCHAR, SQL_GUID, and SQL_SS_XML. If I move SQL_WVARCHAR up into that section of the switch statement, then this bug goes away. However, the whole logic section only seems to apply for Python < 3.0 (not sure why that is; does Python 3.0 somehow figure out Unicode on its own?) Anyway, attached is my patched getdata.cpp which fixes this problem. The most important question, however, is does this work for all ODBC drivers?
Jan 19, 2012
The fundamental problem is the ODBC does *not* support UTF8 or UTF16. The specification only allows for 1-byte ASCII/latin1 and 2-byte UCS2. The UTF encodings are actually variable width with a minimum size of 8-bits or 16-bits. They can be up to 4 or 5 characters wide. I don't know why driver writers can't understand this. Their job is to convert whatever the database sends into ASCII or UCS2. Client software, such as pyodbc, simply cannot guess what encoding is being used -- instead the CHAR vs WCHAR is how the driver tells the client what to expect. The unicode_results flag is used to have pyodbc convert actual ANSI results to Unicode before returning them to the user, not to expect Unicode when the driver says it is ANSI. I have considered adding a different setting for an encoding such as UTF8, in particular so that MySQL users can also tell pyodbc that their driver broken and will lie about the result type. That may be the fix you need. In the meantime, if you can configure the driver to use UCS2, it should work. Is that possible?
Jun 27, 2014
"Change UnicodeTranslationOption to "utf16" in odbcinst.ini" is actionable. "Configure the driver to use UCS2" is more of a vision statement. I'm a simple programmer! I'll see if we can get IBM to fix the driver, but is pyodbc fixing it? Has a fix been released? |
The string u'\u4141\u4141' indicates that a UTF-8 string is being received when UTF-16 is expected: >>> u'AAAA'.encode('UTF-8').decode('UTF-16') u'\u4141\u4141' This would happen in the following scenario: 1) Netezza ODBC driver reports the column's SQL type to be "wvarchar" (represented by the ODBC contant SQL_WVARCHAR) 2) pyodbc would then request to read the column data as the C language type "wchar_t*" (represented by the ODBC constant SQL_C_WCHAR) 3) Netezza ODBC driver then returns UTF-8 data (wrong! wchar_t* in C is UTF-16 or UCS4 depending on compiler/platform, never UTF-8) While step 3 should be regarded as a bug in the Netezza ODBC driver, I wonder if this is configurable behaviour: looking at Netezza's sample odbcinst.ini shown in this forum post: http://www.netezzacommunity.com/message/5801 ... I see the following option: UnicodeTranslationOption = utf8 Do you have a setting like this? I'd suggest changing this to "utf16".