September 02, 2015
02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет:
> On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:
>> My case is I don't know what type user will be using, because I write a
>> library. What's the best way to process char[..] in this case?
>
> char[] should never be anything other than UTF-8. Similarly, wchar[] is
> UTF-16, and dchar[] is UTF-32. So, if you're getting something other than
> UTF-8, it should not be char[]. It should be something more like ubyte[].
> If you want to operate on it as char[], you should convert it to UTF-8.
> std.encoding may or may not help with that. But pretty much everything in D
> - certainly in the standard library - assumes that char, wchar, and dchar
> are UTF-encoded, and the language spec basically defines them that way.
> Technically, you _can_ put other encodings in them, but it's just asking for
> trouble.
>
> - Jonathan M Davis
>
I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?
September 02, 2015
On Wednesday, 2 September 2015 at 05:00:42 UTC, drug wrote:
> 02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет:
>> On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:
>>> My case is I don't know what type user will be using, because I write a
>>> library. What's the best way to process char[..] in this case?
>>
>> char[] should never be anything other than UTF-8. Similarly, wchar[] is
>> UTF-16, and dchar[] is UTF-32. So, if you're getting something other than
>> UTF-8, it should not be char[]. It should be something more like ubyte[].
>> If you want to operate on it as char[], you should convert it to UTF-8.
>> std.encoding may or may not help with that. But pretty much everything in D
>> - certainly in the standard library - assumes that char, wchar, and dchar
>> are UTF-encoded, and the language spec basically defines them that way.
>> Technically, you _can_ put other encodings in them, but it's just asking for
>> trouble.
>>
>> - Jonathan M Davis
>>
> I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?

You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library.

Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
September 02, 2015
On 02.09.2015 11:30, FreeSlave wrote:
>> I see, thanks. So I should always treat char[] as UTF in D itself, but
>> because I need to pass char[], wchar[] or dchar[] to a C library I
>> should treat it as not UTF but ubytes sequence or ushort or uint
>> sequence - just to pass it correctly, right?
>
> You should just keep in mind that strings returned by Phobos are UTF
> encoded. Does your C library have UTF support? Is it relevant at all?
> Maybe it just treats char array as binary data. But if it does some
> non-trivial string and character manipulations or talks to file system,
> then it surely should expect strings in some specific encoding, and if
> it's not UTF, you should re-encode data before passing from D to this
> library.
>
> Also C does not have wchar and dchar, but has wchar_t which size is not
> fixed and depends on particular platform.
Well, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess.
Thanks all for anwers.
September 03, 2015
On Wednesday, September 02, 2015 11:47:11 drug via Digitalmars-d-learn wrote:
> On 02.09.2015 11:30, FreeSlave wrote:
> >> I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?
> >
> > You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library.
> >
> > Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
> Well, I think it's not simple question. The C library I used is hdf5 lib
> and it stores data without processing. In general. In particular I need
> to evalutate a situation concretely, I guess.
> Thanks all for anwers.

Yeah. char in C is often used for what D uses ubyte, so just because C uses a char doesn't mean that it even has anything to do with strings, let alone UTF. The correct way to deal with a C function depends on the C function, and that requires that you understand enough about what it's doing to know whether you're really dealing with a string or just bytes.

Fortunately, most of the time - in *nix-land anyway - when char* is treated as string data, it's either ASCII or UTF-8. However, in Windows, it's not, and the situation gets far less pleasant (though if you're dealing with strings a Windows API, you should almost always be using UTF-16 and avoid that whole issue altogether).

In any case, you have to be familiar with what the C function is doing and whether it's operating on string data or not rather than just blindly seeing char* and thinking that it's a zero-terminated string.

- Jonathan M Davis

1 2
Next ›   Last »