4

I have the following vexing problem I've been trying to solve for weeks now, to no avail so far.

*WARNING overly long question---in short: * what I need, in essence, is a system-wide way to exactly define what fonts will be used to display a given unicode codepoint. Ideally, this decision would be made by referring to unicode code blocks, with a way to give fallbacks for missing codepoints and, super-plus, to define overrides for single codepoints.

I have found no solution so far, and many descriptions on the net seem to be outdated for ubuntu 10.04.

Helpful answers do include explanations or pointers to how the current ubuntu font rendering is intended to work, and what you can possibly configure at all.

*long explanation: *

I work a lot with unicode characters from the so-called 'astral planes', that is, with codepoints beyond unicode's original 16 bits. Now there are many situations---browser address bar, terminal, text editors---where fonts can not be configured the way you would do, say, in a word processor or an html/css file, where you can explicitly define the font for each character to be displayed.

Instead, in each such application, precisely what image will appear is a result of fonts installed on the system, application-wide settings, possibly font system configuration, and, it would seem, your good or bad luck.

For the purpose of working with chinese/japanese/korean (cjk) characters I have installed Sun-ExtA. Ttf, Sun-ExtB. Ttf, and BabelStoneHan. Ttf, alongside with quite a number of other fonts, including the default ubuntu offering. Also, I have (under wine) BabelMap and do all of my editing in Komodo Edit 6.1.

Komodo is configured to use DejaVu Sans Mono, which I find sufficiently pleasant to work with. By way of system-wide glyph substitution (i believe), I am getting a lot of correct images for cjk codepoints. However, I am not entirely sure those images do indeed originate from the the fonts mentioned above. You see, the cjk blocks contain well over 70000 codepoints, some with subtle differences, some with negligible variants, and some being outright copies. It's a surprisingly hairy subject. Basically you can only successfully work in this field if you can be absolutely sure what a given codepoint is intended to look like, and the most faithful renderings I have found are contained in the fonts mentioned above.

Unfortunately, ubuntu seems to mess up quite a few codepoints. Take, for example,

u-cjk/5f50    彐
u-cjk-rad1/2f39    ⼹
u-cjk-rad2/2e95    ⺕

In all applications---including firefox without the proper css, and komodo---these three codepoints look exactly identical on my machine. However, if you look the characters up in a source like http://www.longwiki.net/%E5%BD%90 (, , ), which, in my experience, has very well selected gifs for the characters in question, there are subtle differences between these three codepoints.

I am not so happy that unicode chose to define so many virtually-identical codepoints, but then cjk encoding has been known to be quite a hard problem for decades. Now I do have fonts (here it's Sun-ExtA. Ttf) installed that render these three codepoints with the intended looks, but my feeling is that these fonts never get a chance to render because ubuntu or whoever at some point intervenes, declaring that all these codepoints should be conflated to a single one. Or maybe it's some font that ubuntu considers the correct font for these codepoints that does the conflation. Let me show you why it is highly unlikely that this is the correct and desired behavior: from the list above you can see the codepoints reside in three different unicode blocks, namely

CJK UNIFIED IDEOGRAPHS
KANGXI RADICALS
CJK RADICALS SUPPLEMENT

Respectively. The unicode consortium has developed a pretty strange viewpoint on the so-called 'radicals', which means they treat them as 'symbols' (for symbols of sections in dictionaries), not as 'characters' (which you use to write texts), which I believe is plain bollocks. This policy drives unicode to include a character like 馬 'horse' more than one time, as

u-cjk/99ac    馬
u-cjk-rad1/2fba    ⾺

Which to me is plain and simple a case of unwarranted codepoint duplication, and it's a stated policy of unicode that these points show the same but are to be treated differently. Now while there are known and admitted cases of inadvertant character/glyph duplication (where some comittee got drowned in the myriads of codepoints and admitted a character more than once---other codesets suffer from that problem, too), this is highly unlikely in this case. The two radicals blocks are but a few hundred codepoints long, and the supplemental one was added only after the introduction of the primary 'kangxi' radicals block (even the naming is whacky), for the sole purpose of differentiating glyphs. Therefore, given the assumption that it is highly unlikely such a doublet was introduced by error (any first-year student of chinese could check those short lists for correctness---that's what you spend a lot of time with when learning chinese, sorting out and remembering all those near-lookalikes), we must conclude that a difference in appearance at least between two of the codepoints was fully intended by unicode, and, therefore, my computer is wrong in trying to convince me they should look the same.

Another glitch I have noticed is that some intermittent codepoints are definitely displayed using another font than most others; as an example, the three codepoints in the first group below are rendered by some sans-serif font (possibly from the Ume Gothic or Wen Quan Yi series), while the second is rendered in song style:

u-cjk/534b    卋
u-cjk/5359    卙
u-cjk/535b    卛

u-cjk/534c    卌
u-cjk/534f    协
u-cjk/535a    博

This behavior can be observed both in gedit and komodo edit, so I can be pretty sure it happensx on the os level, not within the application.

Observe that the codepoints in question are immediately neighboring ones, so my guess is that the default song-style font has a few missing codepoints, and ubuntu believes a sans-serif font to contain the best alternatives for those points---and gets it wrong, since, after all, the installed Sun-ExtA.ttf does have a complete coverage of song-style glyphs for this block of unicode (that said, I have never seen a glyph substitution system that really works).

Above, I mentioned BabelMap, which is quite an useful tool for doing character encoding work. One of the outstanding aspects of BabelMap is that the glyph table can be configured in a very manageable way to use specific fonts for each unicode block. I'd actually like to have even more fine-grained control for a few border cases, but this is as good as it seems to get in this age.

Jorge Castro
  • 71,754
flow
  • 201
  • Hello, this question has no information and activity for a very long time. I am closing it for now. If by any reason you think this question is still viable or useful in anyways or that there is still a good chance it will be answered please flag it to a moderator or add a comment with the reasons why you want it open. Regards – fossfreedom Feb 19 '12 at 19:51

0 Answers0