Ticket #37 (closed defect: wontfix)

Opened 23 months ago

Last modified 22 months ago

UTF-8 outside BMP doesn't work in tables

Reported by: julian Owned by: sheep
Priority: Normal Milestone: 1.4.0
Component: Hatta Wiki Version: 1.3.3dev
Keywords: Cc:

Description

If a UTF-8 character outside of the BMP (code point above U+FFFF) is used in a table, then it is not displayed correctly. If the same UTF-8 character is used outside of a table, such as in a paragraph, then it is displayed correctly.

This appears to be a bad interaction with UTF-16, at least in my version of Python on Mac OS X 10.5.8, Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12). Although my browser displays the Unicode character as ��, the actual bytes output appear to be the UTF-16 surrogate pair representation. I didn't look into this further.

Using hg bisect, this problem first appeared in this changeset:

The first bad revision is:
changeset: 621:633ab90042f4
user: sheep@ghostwheel
date: Wed Oct 21 22:22:30 2009 +0200
summary: completely overhaul the table parsing to allow links with | in them

Change History

comment:1 Changed 22 months ago by sheep

  • Status changed from new to accepted
  • Milestone set to 1.4.0

Can you check what this piece of python code returns for you?

import sys
sys.maxunicode

If it's not "1114111", then the solution is to recompile your python interpreter with full unicode support.

comment:2 Changed 22 months ago by julian

Here you go:

Python 2.5.1 (r251:54863, Feb  6 2009, 19:02:12) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535

This is the stock Python interpreter bundled in Mac OS X version 10.5.8. Are there any workarounds?

comment:3 follow-up: ↓ 4 Changed 22 months ago by sheep

  • Status changed from accepted to closed
  • Resolution set to wontfix

Well, the obvious solution would be to compile your own python interpreter, with full unicode support. If the stock one is so broken, I would expect to find a lot of tutorials on how to do it.

Another solution that comes to my mind would be to prepare a mac bundle with the right interpreter in it, but unfortunately I'm not familiar with macs to do it. Volunteers welcome :)

comment:4 in reply to: ↑ 3 Changed 22 months ago by julian

OK, I understand your decision. It sucks that after nearly 10 years of Unicode support in Python, the compilation still defaults to --enable-unicode=ucs2 (per  PEP 261) even though on the Mac platform sizeof(wchar_t) == 4. This causes problems for the uninitiated (me), but that's a discussion for another time or space.

Would it be possible to put in a run-time check for a wide Python build then, at least with a warning?

Note: See TracTickets for help on using tickets.