Why can't D store all UTF-8 code units in char type? (not really understanding explanation) (page 2)

December 02, 2022

Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Posted by Ali Çehreli
in reply to thebluepandabear

Permalink

Ali Çehreli

Posted in reply to thebluepandabear

Permalink

On 12/2/22 13:18, thebluepandabear wrote:

> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'?

The integral value of Ğ in unicode is 286.

  https://unicodeplus.com/U+011E

Since 'char' is 8 bits, it cannot store 286.

At first, that sounds like a hopeless situation, making one think that Ğ cannot be represented in a string. The concept of encoding to the rescue: Ğ can be encoded by 2 chars:

import std.stdio;

void main() {
    foreach (c; "Ğ") {
        writefln!"%b"(c);
    }
}

That program prints

11000100
10011110

Articles like the following explain well how that second byte is a continuation byte:

  https://en.wikipedia.org/wiki/UTF-8#Encoding

(It's a continuation byte because it starts with the bits 10).

> I don't think it was explained well in
> the book.

Coincidentally, according to another recent feedback I received, unicode and UTF are introduced way too early for such a book. I agree. I hadn't understood a single thing when the first time smart people were trying to explain unicode and UTF encodings to the company where I worked at. I had years of programming experience back then. (Although, I now think the instructors were not really good; and the company was pretty bad as well. :) )

> Any help would be appreciated.

I recommend the Wikipedia page I linked above. It is enlightening to understand how about 150K unicode characters can be encoded with units of 8 bits.

You can safely ignore wchar, dchar, wstring, and dstring for daily coding. Only special programs may need to deal with those types. 'char' and string are what we need and do use predominantly in D.

Ali

On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali Çehreli via Digitalmars-d-learn wrote: > On 12/2/22 13:44, rikki cattermole wrote: > > > Yeah you're right, its code unit not code point. > > This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. > > So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! [...] Think of Unicode as a vector space. A code point is a point in this space, and a code unit is one of the unit vectors; although some points can be reached with a single unit vector, to get to a general point you need to combine one or more unit vectors. Furthermore, the set of unit vectors you have depends on which coordinate system (i.e., encoding) you're using. Reencoding a Unicode string is essentially changing your coordinate system. ;-) (Exercise for the reader: compute the transformation matrix for reencoding. :-P) Also, a grapheme is a curve through this space (you *graph* the curve, you see), and as we all know, a curve may consist of more than one point. :-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.

> :-D > > (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) > > > T Your explanation was great and cleared things up... not sure about the linear algebra one though ;)

On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote: >> :-D >> >> (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) >> >> >> T > > Your explanation was great and cleared things up... not sure about the linear algebra one though ;) Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.

On Fri, Dec 02, 2022 at 11:47:30PM +0000, thebluepandabear via Digitalmars-d-learn wrote: > On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote: > > > :-D > > > > > > (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) > > > > > > > > > T > > > > Your explanation was great and cleared things up... not sure about the linear algebra one though ;) > > Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said. It was a math joke. :-P It was half-serious, though, and I think the analogy surprisingly holds up well enough in many cases. In any case, silly analogies are often a good mnemonic for remembering things like Unicode terminology. :-D T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

On 02.12.22 22:39, thebluepandabear wrote: > Hm, that specifically might not be. The thing is, I thought a UTF-8 code unit can store 1-4 bytes for each character, so how is it right to say that `char` is a utf-8 code unit, it seems like it's just an ASCII code unit. You're simply not using the term "code unit" correctly. A UTF-8 code unit is just one of those 1-4 bytes. Together they form a "sequence" which encodes a "code point". And all (true) ASCII code units are indeed also valid UTF-8 code units. Because UTF-8 is a superset of ASCII. If you save a file as ASCII and open it as UTF-8, that works. But it doesn't work the other way around.

Forums