December 27, 2021
On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:
> So should I just use UTF-8 only for Linux?

Most unix things do utf-8 more often than not, but technically you are supposed to check the locale and change the terminal settings to do it right.

> But what about Windows?

You should ALWAYS use the -W suffix functions on Windows when available, and pass them utf-16 encoded strings.

There's a bunch of windows things taking utf-8 nowdays too, but utf-16 is what they standardized on back in the 1990's so it gives you a lot of compatibility. The Windows OS will convert to other things for you it for you do this utf-16 consistently.

> Unfortunately I have to support this OS too with my library so I should know.

The Windows API is an absolute pleasure to work with next to much of the trash you're forced to deal with on Linux.
December 27, 2021

On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:

>

So should I just use UTF-8 only for Linux? What about other operating systems? I suppose Unix-based OSs (maybe MacOS as well if I'm lucky) work the same as well. But what about Windows? Unfortunately I have to support this OS too with my library so I should know. If you know and you can tell me of course...

https://utf8everywhere.org/ - this is an advise from a windows programmer, I use it too. Windows allocates a per thread buffer and when you call, say, WriteConsoleA, it first transcodes the string to UTF-16 in the buffer and calls WriteConsoleW, you would do something like that.

December 27, 2021
On Mon, Dec 27, 2021 at 02:30:55PM +0000, Adam D Ruppe via Digitalmars-d-learn wrote:
> On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:
> > So should I just use UTF-8 only for Linux?
> 
> Most unix things do utf-8 more often than not, but technically you are supposed to check the locale and change the terminal settings to do it right.

Technically, yes. But practically all modern Linux distros have standardized on UTF-8, and you're quite unlikely to run into non-UTF-8 environments except on legacy systems or extremely specialized applications.  I don't know what's the situation on BSD, but I'd imagine it's pretty similar.  A lot of modern Linux applications don't even work properly under anything non-UTF-8, so for practical purposes I'd say don't even worry about it, unless you're specifically targeting a non-UTF8 environment for a specific reason.


> > But what about Windows?
> 
> You should ALWAYS use the -W suffix functions on Windows when available, and pass them utf-16 encoded strings.
[...]

I'm not a regular Windows user, but I did remember running into problems where sometimes command.exe doesn't handle Unicode properly, and needs an API call to switch it to UTF mode or something.


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
December 27, 2021
On Monday, 27 December 2021 at 15:26:16 UTC, H. S. Teoh wrote:
> A lot of modern Linux applications don't even work properly under anything non-UTF-8

yeah, you're supposed to check the locale but since so many people just assume that's becoming the new de facto reality

just like how people blindly shoot out vt100 codes without checking TERM and that usually works too.


> I'm not a regular Windows user, but I did remember running into problems where sometimes command.exe doesn't handle Unicode properly, and needs an API call to switch it to UTF mode or something.

That'd be because someone called the -A function instead of the -W ones. The -W ones just work if you use them. The -A ones are there for compatibility with Windows 95 and have quirks. This is the point behind my blog post i linked before, people saying to make that api call don't understand the problem and are patching over one bug with another bug instead of actually fixing it with the correct function call.
December 27, 2021
On Mon, Dec 27, 2021 at 04:40:19PM +0000, Adam D Ruppe via Digitalmars-d-learn wrote:
> On Monday, 27 December 2021 at 15:26:16 UTC, H. S. Teoh wrote:
> > A lot of modern Linux applications don't even work properly under anything non-UTF-8
> 
> yeah, you're supposed to check the locale but since so many people just assume that's becoming the new de facto reality

Yep, sad reality.


> just like how people blindly shoot out vt100 codes without checking TERM and that usually works too.

Haha, doesn't terminal.d do that in a few places too? ;-)

To be fair, though, most of the popular terminal apps are based off of extensions of vt100 codes anyway, so the basic escape sequences more-or-less work across the board. AFAIK non-vt100 codes are getting rarer and can practically be treated as legacy these days. (At least on Linux, that is. Can't say for the other *nixen.)


> > I'm not a regular Windows user, but I did remember running into problems where sometimes command.exe doesn't handle Unicode properly, and needs an API call to switch it to UTF mode or something.
> 
> That'd be because someone called the -A function instead of the -W ones. The -W ones just work if you use them. The -A ones are there for compatibility with Windows 95 and have quirks. This is the point behind my blog post i linked before, people saying to make that api call don't understand the problem and are patching over one bug with another bug instead of actually fixing it with the correct function call.

Point.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!
December 27, 2021

On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:

>

On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:

>

write just transfers a sequence of bytes. It doesn't know nor care what they represent - that's for the receiving end to figure out.

Oh, so it was as I expected :P

Well to add functionality with say ANSI you entered an escape code and then stuff like offset, color, effect, etc. UTF-8 automatically has escape codes being anything 128 or over, so as long as the terminal understand it, it should be what's handling it.

https://www.robvanderwoude.com/ansi.php

In the end it's all just a binary string of 1's and 0's.

December 28, 2021
On Monday, 27 December 2021 at 14:23:37 UTC, Adam D Ruppe wrote:
> [...]

After reading the whole things, I said it and I'll say it again! You guys must get paid for your support!!!! I also helped a guy in another forum yesterday writing a very big reply and tbh it felt great :P

> (or of course when you get to a human reader, they can interpret it differently too but obviously human language is a whole other mess lol)

Yep! If machines are complicated, humans are even more complicated. Tho machine are also made from humans so... hmmmm!
December 28, 2021
On 27.12.21 15:23, Adam D Ruppe wrote:
> Let's look at:
> 
> "Hello 😂\n";
[...]
> Finally, there's "string", which is utf-8, meaning each element is 8 bits, but again, there is a buffer you need to build up to get the code points you feed into that VM.
[...]
> H, e, l, l, o, <space>, <next point is combined by these bits PLUS THREE MORE elements>, <this is a work-in-progress element and needs two more>, <this is a work-in-progress element and needs one more>, <this is the final work-in-progress element>, <new line>
[...]
> Notice how each element here told you how many elements are left. This is encoded into the bit pattern and is part of why it took 4 elements instead of just three; there's some error-checking redundancy in there. This is a nice part of the design allowing you to validate a utf-8 stream more reliably and even recover if you jumped somewhere in the middle of a multi-byte sequence.

It's actually just the first byte that tells you how many are in the sequence. The continuation bytes don't have redundancies for that.

To recover from the middle of a sequence, you just skip the orphaned continuation bytes one at a time.
December 28, 2021
On Monday, 27 December 2021 at 14:30:55 UTC, Adam D Ruppe wrote:
> Most unix things do utf-8 more often than not, but technically you are supposed to check the locale and change the terminal settings to do it right.

Cool! I mean, I don't plan on supporting legacy systems so I think we're fine if the up-to-date systems fully support UTF-8 as the default.

> You should ALWAYS use the -W suffix functions on Windows when available, and pass them utf-16 encoded strings.
>
> There's a bunch of windows things taking utf-8 nowdays too, but utf-16 is what they standardized on back in the 1990's so it gives you a lot of compatibility. The Windows OS will convert to other things for you it for you do this utf-16 consistently.

That's pretty nice. In this case is even better because at least for now, I will not work on Windows by myself because making the library work on Linux is a bit of a challenge itself. So I will wait for any contributors to work with that and they will probably know how windows convert UTF-8 to UTF-16 and they will be able to do tests. Also I plan to support only Windows 10/11 64-bit officially so just like with Unix, I don't mind if legacy systems don't work.

> The Windows API is an absolute pleasure to work with next to much of the trash you're forced to deal with on Linux.

Whaaaat??? Don't crash my dreams sempai!!! I mean, this may sound stupid but which kind of API you are referring to? Do you mean system library stuff (like "unistd.h" for linux and "windows.h" for Windows) or low level system calls?
December 28, 2021

On Monday, 27 December 2021 at 14:47:51 UTC, Kagamin wrote:

>

https://utf8everywhere.org/ - this is an advise from a windows programmer, I use it too. Windows allocates a per thread buffer and when you call, say, WriteConsoleA, it first transcodes the string to UTF-16 in the buffer and calls WriteConsoleW, you would do something like that.

That's awesome! Like I said to Adam, I will not officially write code for Windows myself (at least for now) so It will probably be up to the contributors to decide anyway. Tho knowing that there will not be compatibility problems with the latest versions of Windows is just nice to know. Thanks a lot for the info man!