Why ElementType!(char[3]) == dchar instead of char? (page 2)

02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет: > On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote: >> My case is I don't know what type user will be using, because I write a >> library. What's the best way to process char[..] in this case? > > char[] should never be anything other than UTF-8. Similarly, wchar[] is > UTF-16, and dchar[] is UTF-32. So, if you're getting something other than > UTF-8, it should not be char[]. It should be something more like ubyte[]. > If you want to operate on it as char[], you should convert it to UTF-8. > std.encoding may or may not help with that. But pretty much everything in D > - certainly in the standard library - assumes that char, wchar, and dchar > are UTF-encoded, and the language spec basically defines them that way. > Technically, you _can_ put other encodings in them, but it's just asking for > trouble. > > - Jonathan M Davis > I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?

On Wednesday, 2 September 2015 at 05:00:42 UTC, drug wrote: > 02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет: >> On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote: >>> My case is I don't know what type user will be using, because I write a >>> library. What's the best way to process char[..] in this case? >> >> char[] should never be anything other than UTF-8. Similarly, wchar[] is >> UTF-16, and dchar[] is UTF-32. So, if you're getting something other than >> UTF-8, it should not be char[]. It should be something more like ubyte[]. >> If you want to operate on it as char[], you should convert it to UTF-8. >> std.encoding may or may not help with that. But pretty much everything in D >> - certainly in the standard library - assumes that char, wchar, and dchar >> are UTF-encoded, and the language spec basically defines them that way. >> Technically, you _can_ put other encodings in them, but it's just asking for >> trouble. >> >> - Jonathan M Davis >> > I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right? You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.

On 02.09.2015 11:30, FreeSlave wrote: >> I see, thanks. So I should always treat char[] as UTF in D itself, but >> because I need to pass char[], wchar[] or dchar[] to a C library I >> should treat it as not UTF but ubytes sequence or ushort or uint >> sequence - just to pass it correctly, right? > > You should just keep in mind that strings returned by Phobos are UTF > encoded. Does your C library have UTF support? Is it relevant at all? > Maybe it just treats char array as binary data. But if it does some > non-trivial string and character manipulations or talks to file system, > then it surely should expect strings in some specific encoding, and if > it's not UTF, you should re-encode data before passing from D to this > library. > > Also C does not have wchar and dchar, but has wchar_t which size is not > fixed and depends on particular platform. Well, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess. Thanks all for anwers.

September 03, 2015

Re: Why ElementType!(char[3]) == dchar instead of char?

Posted by Jonathan M Davis
in reply to drug

Permalink

Jonathan M Davis

Posted in reply to drug

Permalink

On Wednesday, September 02, 2015 11:47:11 drug via Digitalmars-d-learn wrote:
> On 02.09.2015 11:30, FreeSlave wrote:
> >> I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?
> >
> > You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library.
> >
> > Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
> Well, I think it's not simple question. The C library I used is hdf5 lib
> and it stores data without processing. In general. In particular I need
> to evalutate a situation concretely, I guess.
> Thanks all for anwers.

Yeah. char in C is often used for what D uses ubyte, so just because C uses a char doesn't mean that it even has anything to do with strings, let alone UTF. The correct way to deal with a C function depends on the C function, and that requires that you understand enough about what it's doing to know whether you're really dealing with a string or just bytes.

Fortunately, most of the time - in *nix-land anyway - when char* is treated as string data, it's either ASCII or UTF-8. However, in Windows, it's not, and the situation gets far less pleasant (though if you're dealing with strings a Windows API, you should almost always be using UTF-16 and avoid that whole issue altogether).

In any case, you have to be familiar with what the C function is doing and whether it's operating on string data or not rather than just blindly seeing char* and thinking that it's a zero-terminated string.

- Jonathan M Davis

Forums