Jump to page: 1 26  
Page
Thread overview
Wide characters support in D
Jun 07, 2010
Ruslan Nikolaev
Jun 07, 2010
Simen kjaeraas
Jun 07, 2010
Robert Clipsham
Jun 07, 2010
Ali Çehreli
Jun 07, 2010
justin
Jun 07, 2010
Ruslan Nikolaev
Jun 08, 2010
torhu
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
Jesse Phillips
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
BCS
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
dennis luehring
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
Michel Fortin
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
Michel Fortin
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
dennis luehring
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
dennis luehring
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
dennis luehring
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Pelle
Jun 09, 2010
Simen kjaeraas
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Yao G.
Jun 08, 2010
Ruslan Nikolaev
Jun 08, 2010
Walter Bright
Jun 08, 2010
Ruslan Nikolaev
Jun 07, 2010
Walter Bright
Jun 08, 2010
Kagamin
Jun 08, 2010
bearophile
Jun 08, 2010
Walter Bright
Jun 08, 2010
Rainer Deyke
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Ali Çehreli
Jun 11, 2010
Jer
June 07, 2010
Note: I posted this already on runtime D list, but I think that list was a wrong one for this question. Sorry for duplication :-)

Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar, dchar. This is cool, however, I have some questions about it:

1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?
2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or have incomplete support) for wchar/dchar
e.g. writefln probably assumes char[] for strings like "Number %d..."
3. Even if they do support, it is kind of annoying to provide methods for all 3 types of chars. Especially, if we want to use native mode (e.g. for Windows wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they should be native (in a sense that no conversion is necessary when we do, for instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Since D language is targeted on system programming, why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system). It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work. For this kind of situation, it would be nice to have some built-in functions for transparent conversion from char to byte/short/int and vice versa (especially, if conversion only happens if needed on a particular platform).

In my opinion, to separate notion of character from byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Sorry if my suggestion sounds odd. Anyway, it would be great to hear something from D gurus :-)

Ruslan.



June 07, 2010
Ruslan Nikolaev <nruslan_devel@yahoo.com> wrote:

> 1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?

String literals in D(2) are of type immutable(char)[] (char[] in D1) by
default, and thus will be handled by the char[]-version of the function.
Should you want a string literal of a different type, append a c, w, or
d to specify char[], wchar[] or dchar[]. Or use a cast.

> Since D language is targeted on system programming, why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system).

Because this leads to unportable code, that fails in unexpected ways
when moved from one system to another, thus increasing rather than
decreasing the cognitive load on the hapless programmer.

> It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work.

Absolutely not. One of the things D tries, is doing strings right. For
that purpose, all 3 types are needed.

> In my opinion, to separate notion of character from byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

D has not. A char is a character, a possibly incomplete UTF-8 codepoint,
while a byte is a byte, a humble number in the order of -128 to +127.

Yes, it is possible to abuse char in D, and byte likewise. D aims to allow
programmers to program close to the metal if the programmer so wishes, and
thus does not pretend char is an opaque type about which nothing can be
known.

-- 
Simen
June 07, 2010
On 07/06/10 22:48, Ruslan Nikolaev wrote:
> Note: I posted this already on runtime D list, but I think that list
> was a wrong one for this question. Sorry for duplication :-)
>
> Hi. I am new to D. It looks like D supports 3 types of characters:
> char, wchar, dchar. This is cool, however, I have some questions
> about it:
>
> 1. When we have 2 methods (one with wchar[] and another with char[]),
> how D will determine which one to use if I pass a string "hello
> world"?

If you pass "Hello World", this is always a string (char[] in D1, immutable(char)[] in D2). If you want to specify a type with a string literal, you can use "Hello World"w or "Hello World"d for wstring anddstringrespectively.

> 2. Many libraries (e.g. tango or phobos) don't provide
> functions/methods (or have incomplete support) for wchar/dchar e.g.
> writefln probably assumes char[] for strings like "Number %d..."

In tango most, if not all string functions are templated, so work with all string types, char[], wchar[] and dchar[]. I don't know how well phobos supports other string types, I know phobos 1 is extremely limited for types other than char[], I don't know about Phobos 2

> 3.
> Even if they do support, it is kind of annoying to provide methods
> for all 3 types of chars. Especially, if we want to use native mode
> (e.g. for Windows wchar is better, for Linux char is better). E.g.
> Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,
> wchar_t[] argv) and so on, and they should be native (in a sense that
> no conversion is necessary when we do, for instance, _wopen). Linux
> doesn't have them as UTF-8 is used widely there.

Enter templates! You can write the function once and have it work with all three string types with little effort involved. All the lower level functions that interact with the operating system are abstracted away nicely for you in both Tango and Phobos, so you'll never have to deal with this for basic functions. For your own it's a simple matter of templating them in most cases.

> Since D language is targeted on system programming, why not to try to
> use whatever works better on a particular system (e.g. char will be 2
> bytes on Windows and 1 byte on Linux; it can be a compiler switch,
> and all libraries can be compiled properly on a particular system).
> It's still necessary to have all 3 types of char for cooperation with
> C. But in those cases byte, short and int will do their work. For
> this kind of situation, it would be nice to have some built-in
> functions for transparent conversion from char to byte/short/int and
> vice versa (especially, if conversion only happens if needed on a
> particular platform).

This is something C did wrong. If compilers are free to choose their own width for the string type you end up with the mess C has where every library introduces their own custom types to make sure they're the expected length, eg uint32_t etc. Having things the other way around makes life far easier - int is always 32bits signed for example, the same applies to strings. You can use version blocks if you want to specify a type which changes based on platform, I wouldn't recommend it though, it just makes life harder in the long run.

> In my opinion, to separate notion of character from byte would be
> nice, and it makes sense as a particular platform uses either UTF-8
> or UTF-16 natively. Programmers may write universal code (like TCHAR
> on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably
> but why D has to make this mistake again?

They are different types in D, so I'm not sure what you mean. byte/ubyte have no encoding associated with them, char is always UTF-8, wchar UTF-16 etc.

Robert
June 07, 2010
Ruslan Nikolaev wrote:

> 1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?

I asked the same question on the D.learn group recently. Literals like that don't have a particular encoding. The programmer must specify explicitly to resolve ambiguities: "hello world"c or "hello world"w.

> 3. Even if they do support, it is kind of annoying to provide methods for all 3 types of chars. Especially, if we want to use native mode

I think the solution is to take advantage of templates and use template constraints if the template parameter is too flexible.

Another approach might be to use dchar within the application and use other encodings on the intefraces.

Ali
June 07, 2010
This doesn't answer all your questions and suggestions, but here goes.
In answer to #1, "Hello world" is a literal of type char[] (or string). If you want
to use UTF-16 or 32, use "Hello world"w and "Hello world"d respectively.
In partial answer to #2 and #3, it's generally pretty easy to adapt a string
function to support string, wstring, and dstring by using templating and the fact
that D can do automatic conversions for you. For instance:

string blah = "hello world";
foreach (dchar c; blah)   // guaranteed to get a full character
  // do something
June 07, 2010
Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world" representation. Was postfix "w" and "d" added initially or just recently? I did not know about it. I thought D does automatic conversion for string literals.

Yes, templates may help. However, that unnecessary make code bigger (since we have to compile it for every char type). The other problem is that it allows programmer to choose which one to use. He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this encoding natively (e.g. for file system operations, screen output, etc.), whereas it will cause conversion overhead on the other. Not to say that it's a big overhead, but unnecessary one. Having said this, I do agree that there must be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't believe that this flexibility should be available for application programmer.

I don't think there is any problem with having different size of char. In fact, that would make programs better (since application programmers will have to think in terms of characters as opposed to bytes). System programmers (i.e. OS programmers) may choose to think as they expect it to be (since char width option can be added to compiler). TCHAR in Windows is a good example of it. Whenever you need to determine size of element (e.g. for allocation), you can use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be supported, too. My only point is that it would be good to have universal char type that depends on platform. That, in turns, allows to have unified char for all libraries on this platform.

In addition, commonly used constants '\n', '\r', '\t' will be the same regardless of char width.

Anyway, that was just a suggestion. You may disagree with this if you wish.

Ruslan.



June 07, 2010
Ruslan Nikolaev wrote:
> Note: I posted this already on runtime D list,

Although D is designed to be fairly agnostic about character types, in practice I recommend the following:

1. Use the string type for strings, it's char[] on D1 and immutable(char)[] on D2.

2. Use dchar's to hold individual characters.

The problem with wchar's is that everyone forgets about surrogate pairs. Most UTF-16 programs in the wild, including nearly all Java programs, are broken with regard to surrogate pairs. The problem with dchar's is strings of them consume memory at a prodigious rate.
June 08, 2010
On 08.06.2010 01:16, Ruslan Nikolaev wrote:
> Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world" representation. Was postfix "w" and "d" added initially or just recently? I did not know about it. I thought D does automatic conversion for string literals.
>

There is automatic conversion, try this example:

---
//void f(char[] s) { writefln("char"); }
void f(wchar[] s) { writefln("wchar"); }


void main()
{
  f("hello");
}
---

As long as there's just one possible match, a string literal with no postfix will be interpreted as char[], wchar[], or dchar[] depending on context.  But if you uncomment the first f(), the compiler will complain about there being two matching overloads.  Then you'll have to add the 'c' or 'w' postfixes to the string literal to disambiguate.

For templates and type inference, string literals default to char[].

This example prints 'char':
---
void f(T)(T[] s) { writefln(T.stringof); }

void main()
{
  f("hello");
}
---
June 08, 2010
On Mon, 07 Jun 2010 17:48:09 -0400, Ruslan Nikolaev <nruslan_devel@yahoo.com> wrote:

> Note: I posted this already on runtime D list, but I think that list was a wrong one for this question. Sorry for duplication :-)
>
> Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar, dchar. This is cool, however, I have some questions about it:
>
> 1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?
> 2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or have incomplete support) for wchar/dchar
> e.g. writefln probably assumes char[] for strings like "Number %d..."
> 3. Even if they do support, it is kind of annoying to provide methods for all 3 types of chars. Especially, if we want to use native mode (e.g. for Windows wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they should be native (in a sense that no conversion is necessary when we do, for instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.
>
> Since D language is targeted on system programming, why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system). It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work. For this kind of situation, it would be nice to have some built-in functions for transparent conversion from char to byte/short/int and vice versa (especially, if conversion only happens if needed on a particular platform).
>
> In my opinion, to separate notion of character from byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

One thing that may not be clear from your interpretation of D's docs, all strings representable by one character type are also representable by all the other character types.  This means that a function that takes a char[] can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on Tango I think).

So D's char is decidedly not like byte or ubyte, or C's char.

In general, I use char (utf8) because I am used to C and ASCII (which is exactly represented in utf-8).  But because char is utf-8, it could potentially accept any unicode string.

-Steve
June 08, 2010
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.122.1275952601.24349.digitalmars-d@puremagic.com...
> Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world" representation. Was postfix "w" and "d" added initially or just recently? I did not know about it. I thought D does automatic conversion for string literals.
>

The postfix 'c', 'w' and 'd' have been in there a long time. But D does have a little bit of automatic conversion. Let me try to clarify:

    "hello"c  // string, UTF-8
    "hello"w  // wstring, UTF-16
    "hello"d  // dstring, UTF-32
    "hello"   // Depends how you use it

Suppose I have a function that takes a UTF-8 string, and I call it:

    void cfoo(string a) {}

    cfoo("hello"c); // Works
    cfoo("hello"w); // Error, wrong type
    cfoo("hello"d); // Error, wrong type
    cfoo("hello");  // Works, assumed to be UTF-8 string

If I make a different function that takes a UTF-16 wstring instead:

    void wfoo(wstring a) {}

    wfoo("hello"c); // Error, wrong type
    wfoo("hello"w); // Works
    wfoo("hello"d); // Error, wrong type
    wfoo("hello");  // Works, assumed to be UTF-16 wstring

And then, a UTF-32 dstring version would be similar:

    void dfoo(dstring a) {}

    dfoo("hello"c); // Error, wrong type
    dfoo("hello"w); // Error, wrong type
    dfoo("hello"d); // Works
    dfoo("hello");  // Works, assumed to be UTF-32 dstring

As you can see, the literals with postfixes are always the exact type you specify. If you have no postfix, then you get whatever the compiler expects it to be.

But, then the question is, what happens if any of those types can be used? Which does the compiler choose?

    void Tfoo(T)(T a)
    {
        // When compiling, display the type used.
        pragma(msg, T.stringof);
    }

    Tfoo("hello");

(Normally you'd want to add in a constraint that T must be one of the string types, so that no one tries to pass in an int or float or something. I skipped that in there.)

In that, Tfoo isn't expecting any particular type of string, it can take any type. And "hello" doesn't have a postfix, so the compiler uses the default: UTF-8 string.

> Yes, templates may help. However, that unnecessary make code bigger (since we have to compile it for every char type).<

It only generates code for the types that are actually needed. If, for instance, your progam never uses anything except UTF-8, then only one version of the function will be made - the UTF-8 version.  If you don't use every char type, then it doesn't generate it for every char type - just the ones you choose to use.

>The other problem is that it allows programmer to choose which one to use. He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this encoding natively (e.g. for file system operations, screen output, etc.), whereas it will cause conversion overhead on the other. I don't think there is any problem with having different size of char. In fact, that would make programs better (since application programmers will have to think in terms of characters as opposed to bytes). Not to say that it's a big overhead, but unnecessary one. Having said this, I do agree that there must be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't believe that this flexibility should be available for application programmer.
<

That's not good. First of all, UTF-16 is a lousy encoding, it combines the worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS uses it natively, it's still best to do most internal processing in either UTF-8 or UTF-32. (And with templated string functions, if the programmer actually does want to use the native type in the *rare* cases where he's making enough OS calls that it would actually matter, he can still do so.)

Secondly, the programmer *should* be able to use whatever type he decides is appropriate. If he wants to stick with native, he can do so, but he shouldn't be forced into choosing between "use the native encoding" and "abuse the type system by pretending that an int is a character". For instance, complex low-level text processing *relies* on knowing exactly what encoding is being used and coding specifically to that encoding. As an example, I'm currently working on a generalized parser library ( http://www.dsource.org/projects/goldie ). Something like that is complex enough already that implementing the internal lexer natively for each possible native text encoding is just not worthwhile, expecially since the text hardly every gets passed to or from any OS calls that expect any particular encoding. Or maybe you're on a fancy OS that can handle any encoding natively. Or maybe the programmer is in a low-memory (or very-large-data) situation and needs the space savings of UTF-8 regardless of OS and doesn't care about speed. Or maybe they're actually *writing* an OS (Most moderns languages are completely useless for writing an OS. D isn't). A language or a library should *never* assume it knows the programmer's needs better than the programmer does.

Also, C already tried the approach of multi-sized types (ex, C's "int"), and it ended up being a big PITA disaster that everyone ended up having to make up hacks to work around.

>>
> System programmers (i.e. OS programmers) may choose to think as they expect it to be (since char width option can be added to compiler).<

See that's the thing, D is intended as a systems language, so a D programmer must be able to easily handle it that way whenever they need to.

>TCHAR in Windows is a good example of it. Whenever you need to determine size of element (e.g. for allocation), you can use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be supported, too. My only point is that it would be good to have universal char type that depends on platform.

You can have that easily:

version(Windows)
    alias wstring tstring;
else
    alias string tstring;

Besides, just because you *can* get a job done a certain way doesn't mean languages should never try to allow a better way for those who want a better way.

> That, in turns, allows to have unified char for all libraries on this platform.
>

With templated text functions, there is very little benefit to be gained from having a unified char. Just wouldn't serve any real purpose. All it would do is cause problems for anyone who needs to work at the low-level.

-------------------------------
Not sent from an iPhone.


« First   ‹ Prev
1 2 3 4 5 6