AWK - How to reference Unicode caracter by number for correct value assignment

ericmarceau · 19 February 2025 03:18

Instinctively, this should be downright simple. But I have been looking high and low for the last 3 days, and can't seem to find an example that shows the specification of a Unicode character numerically and have that assign the proper Unicode character to its position in an a sequential array, or print the proper glyph for that character.

Things seem to work OK for some characters, but others are not being assigned/displayed/printed correctly.

The test script I have is this:

#!/bin/sh

i=32

while [ $i -lt 255 ]
do
	#echo '\u'$i
	echo $i
	i=$(expr ${i} + 1)
done |
LC_CTYPE=C awk 'BEGIN{
	split( "", glyph ) ;
	n=32 ;
}{
	glyph[n] = sprintf("%c", $0 ) ;
	n++ ;
}END{
	for( n=32 ; n<256 ; n++ ){
		printf("\t glyph[%d] = %c\n", n, glyph[n] ) ;
	} ;
}'

Where things break down is captured in this snapshot:

I tried using the following instead,

LC_CTYPE=en_US.UTF-8

but the results seem worse, rather than better:

Anyone know how to get the "value" or "glyph" assignment coded properly for proper display in MATE Terminal ?

I don't think it is relevant, but the Font used is "Liberation Sans".

Edit: I actually verified that the Fonts "C059" and "Deja Vu Sans Mono" both actually have glyphs defined for the character range 32-255, changed my MATE terminal profile, re-ran the script, and still have the same issue of either the character not being assigned correctly, or the glyph not being displayed correctly. Don't understand why !!! Real head-scratcher!

UM 22.04.5 LTS
Kernel 6.8.0-45-generic
MATE 1.26.0

Also, in that context, how to "properly" handle the single quote as one of the glyphs referenced numerically upon assignment, but subsequently referenced as character value of array position.

ugnvs · 19 February 2025 09:43

UM Character Map application will help a lot. It allows to inspect fonts and shows unicode codepoints.

The following links also may be useful

https://www.unicode.org/charts/

tkn · 19 February 2025 09:57

UTF-8 is a multi-byte encoding of unicode. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Values above 128 behave at first sight like undefined (which you already discovered) but include codes that indicate the length of the character sequence and other control mechanisms.

The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, etc etc etc

This means that to encode/decode characters with a higher value than 128 we need use more bytes to describe the character. This demands a kind of escape construct to make this possible:

The following link describes how this multi-byte encoding works in UTF-8:
https://en.wikipedia.org/wiki/UTF-8#Description

ericmarceau · 19 February 2025 18:14

Thank you, Eugene.

I have been referring to the Character Map application on an ongoing basis. Other that realising from its use that the "Liberation Sans" did not have glyphs in the range that came up empty (which is why I switched fonts on my MATE Terminal), that hasn't helped me much.
Thank you for the second URL. I had not seen that, but it still doesn't explain why MATE Terminal is not displaying correctly when, from what I understand, UM is supposed to be fully UTF-8 capable, so it should be displaying those glyphs correctly, IMHO.
I have referred to the Wikipedia page for Unicode characters for many, many moons now, so nothing new there for me, but others may find it educational if its new for them.

ericmarceau · 19 February 2025 18:28

Thank you, Thom, for giving that succinct explanation.

However, I was already aware of all that. I had seen that URL many months ago for a problem I was tackling back then.

... Which is why I had tried the test script with the option

LC_CTYPE=en_US.UTF-8 awk 'BEGIN{

instead of

LC_CTYPE=C awk 'BEGIN{

What I find bizarre, is that with the UTF-8 command option, the script aborts at character number 157 (decimal) with return code = 0 !!!

On the off chance that it might fix the issue, I tried that with gawk instead of awk, to no avail.

I am starting to wonder whether MATE Terminal just can't handle Unicode. Otherwise, its the formatting options being used with awk/gawk that have not been correctly identified, which was, in fact, my original question.

Looking at this posting on StackOverflow, there might be a hint of an answer, but I don't quite understand how to translate that into what I need for

assigning individual characters into an array slot (i.e. array[n] ), or
specifying the printing format that will result in proper display of the Unicode character, since MATE Terminal appears not to understand.

Is there an option for the awk/gawk utility to specify the output stream as specifically UTF-8 ? If such is used, will MATE Terminal respond correctly to that stream for display ? At this point, I am starting to have my doubts.

I even tried what was recommended in this post, regarding adding a "header" to the file,

# -*- coding: utf-8 -*-

but I'm not sure if that applies only to that discussion involving python, and not just any output stream.

ugnvs · 19 February 2025 19:27

Hello, Eric,
Yes, UM supports unicode without a glitch. Have a look at
https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF

Please pay your attention to codes from 0x80 to 0x9f. These codes are control codes which do not have associated glyphs by their nature. Characters of extended ASCII range now are found elsewhere.

And a boring note: UTF-8 is a method to represent unicode codes using various-length byte sequences, not pure codes themselves.

ugnvs · 19 February 2025 19:30

Yes, that is Python-only thing.

ugnvs · 19 February 2025 19:36

And that is correct. 157 = 0x9D = [OSC]

ericmarceau · 19 February 2025 20:03

When I looked at the glyphs for Deja Vu Sans Mono in the Font Manager, I thought that the contiguous presentation implied that the numbers were sequential.

I am humbled by the fact that I did not check that initial assumption, which makes makes me look like an idiot!

Lesson learned!