Page MenuHomePhabricator

Terminology doesn't support combining characters
Open, NormalPublic

Description

Example: If you print a codepoint "a" and then the codepoint U+20D7, you don't get an "a" with an arrow above in Terminology.

Works fine in xterm and gnome-terminal.

I tried to add it to termptyops.c by myself, but I can't figure out how to get it to wait for combining characters before advancing the position. It seems right now it's advancing the position right after printing each codepoint - which would be impossible then.

It would have to

  1. print the non-combining character, remember the width but don't advance yet
  2. for each trailing combining codepoint check the width (and don't advance)
  3. and finally advance by the maximum of those widths.

Also, what if _termpty_text_append were called once with just a normal character and then again with the codepoint of a combining character? So there would need to be a new state for "am I combining, and what's the cumulative width of the character so far?" in ty->state.

Also, termpty_cell_codepoint_att_fill seems to overwrite what's already in the cell - it should be combining them for combining characters.

dannym updated the task description. (Show Details)
dannym raised the priority of this task from to Normal.
dannym added a project: Terminology.
dannym added a subscriber: dannym.
billiob renamed this task from [BUG] Terminology doesn't support combining characters to Terminology doesn't support combining characters.Nov 23 2013, 6:26 AM

Huh, I think I got at least the termpty part working now and now the obvious happens:

Since, for the last cell, it's impossible to know whether the cell is finished or not (whether there are still combining codepoints following or not), the caret (say in a shell) will stay ON the last character while it waits for input.

I wonder how xterm does it O_o

They print the main character as soon as they get it. If a combining character follows, they backspace and add it.

billiob added a subscriber: billiob.Dec 1 2013, 3:03 PM

I think you should look at this as you know more about those weird characteres !

egmont added a subscriber: egmont.Dec 14 2013, 8:48 AM

Fyi: it's also valid to put multiple combining accents over a letter (perhaps with a reasonable safety cap of 10 or so). I'm not sure how xterm and others implement it without significantly increasing memory usage. Gnome-terminal (vte) keeps only one Unicode codepoint per character cell, and starts building up a "palette" mapping unused Unicode code points (above U+1fffff) to sequences of base letter + combining accents, e.g. it says that from now on U+200000 means "a" followed by a U+20D7. Yup strictly speaking it can run out of free slots after a while, and I don't think it ever free()s the sequences that are no longer used, but it works quite well in practice.

terminology already uses the unused unicode space for inlined media. :( yeah - it needs all of it as it allocates the bit fields to all sorts of different data.

what it likely needs to do is remember the PREVIOUS char cell pos and IF it sees a combining char, jump back an modify the previous cell (the advance again as if it started at the previous pos). each new modifier repeats the process. to make this possible it needs a table of all possible modifier chars and resulting unicode "combined" versions. if you add another combining char on top of something already combined.. well the table needs to have that combo in it too. there must be a limited set of such combinations.

it needs a table of all possible modifier chars and resulting unicode "combined" versions

I'm not entirely sure we're on the same page... Some char+modifier combinations do have a combined Unicode character (e.g. U+E1 is a precomposed version of a + U+301) but most don't, that's the point, to allow arbitrary combinations. Even when there is a precomposed version, there might be a tiny little difference in rendering, as well as for copy-paste purposes it is probably better to keep whatever version was printed by the application, rather than substituting with the precomposed one. So you'd need to allow multiple Unicode codepoints per character cells.

Keeping track of the previous char cell position is indeed another, totally independent problem. Although, simply the cell on the left of the cursor might be a safe bet. I don't think it's specified anywhere what should happen if an application prints some letters, then moves the cursor, then prints a combining accent.

Cf. https://bugzilla.gnome.org/show_bug.cgi?id=673981, see Mosh's opininon and me expressing why disagree with that.

i would argue that any combinations hat do not have already combined versions in unicode are "not valid". :) they may be theoretically composable, but they are not practical considerations. the point of composing is to CREATE these pre-composed chars to get accents etc. etc.

i would lean towards vetoing multiple unicode codepoints per cell. basically because of the complexity and overhead (memory and computation). we already use 8 bytes per cell. we need all of that to hold fg/bg color (2 bytes there) 4 bytes for the char and another 2 bytes worth of flags like 256 color mode flags, brightness/bold, etc. etc. - all the bits are packed and used. to do more means expanding to more than 8 bytes (and for alignment reasons that probably means going to 16bytes). we could use variable encodings but then compression just got nasty as well as addressing. admittedly compression does reduce the cost of "bigger cells" as if we only sometimes use the extra memory space (and normally zero it out) it'll compress nicely, but its uncompressed form will bloat out. also supporting "arbitrary numbers of codepoints" means bloating out by a lot. it's essentially unbounded. so then back to the above - there is obviously a purpose to composing... and that is to do a + ^ = â. or... the likes. thus there exist pre-composed versions...

now i'd disagree to "Cell on the left". i'd just say "previous cell" whatever it happens to be, since we line-wrap at the end. if the app inserts control codes in between to move the char around - i'd say that breaks the "chain" and it is no longer a + ^, but 2 separate chars. so the app has to emit both right next to each-other to get it to work. nothing in between (no spaces, newlines, linefeeds or other control sequences). if you use such a combining char without a valid "previous char" in the "1 char back history buffer", then you treat the char as a stand-alone. :) that solves the "combining char on first position of a line" case.

i would argue that any combinations hat do not have already combined versions in unicode are "not valid". :)

Please check http://www.cl.cam.ac.uk/~mgk25/unicode.html -> "What are UCS implementation levels?". Your approach will lead to Level 1 only, even if you find the precomposed equivalent whenever possible (that could be level 1.5 or so, but it's not defined there). Apparently there is the need to be able to compose characters that don't have precomposed versions, it's required by certain scripts (level 2) and other misc uses such as math (level 3). I guess combining characters wouldn't exist if everyone could just happily live with the precomposed ones - they do exist because it's not the case.

It's your call to figure out the Unicode support level, you might say it's low prio or it won't happen, but IMHO aiming at level "1.5" is not time well spent. I'd personally either stick with level 1 and put zero effort in it, or I'd try to get it right.

i would lean towards vetoing multiple unicode codepoints per cell.

This is where I referred to Gnome-Terminal (VTE) using an indexed "palette", e.g. upon seeing "a + U+301" it inserts "U+200000 -> a + U+301" into a lookup table, and stores U+200000 as a single fake Unicode character in the cell, thus leaving a cell's memory usage at 8 bytes. Just an idea you might consider, I don't know how other emulators handle combining chars.

now i'd disagree to "Cell on the left".

I guess you're right, I didn't think of soft line breaks.

dannym added a comment.EditedDec 15 2013, 2:13 AM

i would argue that any combinations hat do not have already combined versions in unicode are "not valid". :) they may be theoretically composable, but they are not practical considerations. the point of composing is to CREATE these pre-composed chars to get accents etc. etc.

Not at all. If that were the case, you could just type the precombined characters to begin with. That there are precomposed characters in Unicode in the first place is just for backward compatibility with legacy encodings.

I'm mainly using combining characters to type mathematics. For example, a vector v⃗, or a statistical average x̄, or ⟨x̄⟩ for the expectation value of the observable x in the density matrix representation.

Something like v⃗̄ can happen, too (the statistical average of the vector v⃗), although the notation is so overloaded then that it's not used much (instead ⟨v⃗⟩ is).

Also, ẋ for the derivative of x by time. ẍ for the second derivative of x by time.

raster assigned this task to billiob.Jan 22 2014, 3:02 AM
billiob removed billiob as the assignee of this task.Jan 29 2015, 3:29 PM

Is this still an issue? Just test on Terminology-0.8.0 and character combinations work as expected e.g '^' + 'i' = "î, '^' + 'a' = 'â' etc. this should be closed!'

for input it works (if you have an input method dealing with it or multi_key + combos - i use multi-key compose sequences for input and it works fine for me.:)

billiob changed the visibility from "All Users" to "Public (No Login Required)".Dec 27 2015, 3:00 PM

hmm back to this - we can't to the palette thing. upper unicode codepoints are used by terminology already for other purposes. so that means storing a list of codepoints per cell - that s a bit nasty.

since this would be VERY rare... we can keep an "overlay" that is a list of char cell "x" values per row where we might overlay more info (eg combining chars) and most of the time this ptr to the list per row will be NULL ... and in some cases we walk it.

one big issue is evas's textgrid cant display combining chars. until it can - this isnt doable. it supports a unicode codepoint per cell. unless of course we do in evas textgrid what gnome termina does - assign invalid high unicode codepoint space to a lookup table of combined chars.