Ticket #37 (closed defect: wontfix)
UTF-8 outside BMP doesn't work in tables
| Reported by: | julian | Owned by: | sheep |
|---|---|---|---|
| Priority: | Normal | Milestone: | 1.4.0 |
| Component: | Hatta Wiki | Version: | 1.3.3dev |
| Keywords: | Cc: |
Description
If a UTF-8 character outside of the BMP (code point above U+FFFF) is used in a table, then it is not displayed correctly. If the same UTF-8 character is used outside of a table, such as in a paragraph, then it is displayed correctly.
This appears to be a bad interaction with UTF-16, at least in my version of Python on Mac OS X 10.5.8, Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12). Although my browser displays the Unicode character as ��, the actual bytes output appear to be the UTF-16 surrogate pair representation. I didn't look into this further.
Using hg bisect, this problem first appeared in this changeset:
The first bad revision is:
changeset: 621:633ab90042f4
user: sheep@ghostwheel
date: Wed Oct 21 22:22:30 2009 +0200
summary: completely overhaul the table parsing to allow links with | in them
Change History
comment:2 Changed 22 months ago by julian
Here you go:
Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 65535
This is the stock Python interpreter bundled in Mac OS X version 10.5.8. Are there any workarounds?
comment:3 follow-up: ↓ 4 Changed 22 months ago by sheep
- Status changed from accepted to closed
- Resolution set to wontfix
Well, the obvious solution would be to compile your own python interpreter, with full unicode support. If the stock one is so broken, I would expect to find a lot of tutorials on how to do it.
Another solution that comes to my mind would be to prepare a mac bundle with the right interpreter in it, but unfortunately I'm not familiar with macs to do it. Volunteers welcome :)
comment:4 in reply to: ↑ 3 Changed 22 months ago by julian
OK, I understand your decision. It sucks that after nearly 10 years of Unicode support in Python, the compilation still defaults to --enable-unicode=ucs2 (per PEP 261) even though on the Mac platform sizeof(wchar_t) == 4. This causes problems for the uninitiated (me), but that's a discussion for another time or space.
Would it be possible to put in a run-time check for a wide Python build then, at least with a warning?

Can you check what this piece of python code returns for you?
If it's not "1114111", then the solution is to recompile your python interpreter with full unicode support.