String type. - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » String type.

Thread overview

String type.
Mar 11, 2002 Jakob Kemi
Mar 11, 2002 Pavel Minayev
Mar 11, 2002 Jakob Kemi
Mar 11, 2002 Jakob Kemi
Mar 11, 2002 Walter
Mar 11, 2002 Walter
Mar 12, 2002 Pavel Minayev
Mar 12, 2002 Walter
Mar 16, 2002 Juan Carlos Arevalo Baeza
Mar 27, 2002 Walter
Mar 11, 2002 J. Daniel Smith
Mar 11, 2002 Jakob Kemi
Mar 12, 2002 Serge K
Mar 11, 2002 Walter
Mar 12, 2002 Jakob Kemi
Mar 12, 2002 Pavel Minayev
Mar 12, 2002 Jakob Kemi
Mar 12, 2002 Walter

March 11, 2002

Posted by Jakob Kemi

Jakob Kemi

As I understand from the docs, D is supposed to use wchars
(2 to 4 bytes) for representing non-ASCII strings. I think it would be
better to let all string functions only handle UTF-8 (which is fully
backwards compatible with ASCII). UTF-8 is slowly becoming standard in
UNIX. (just look at X and Gtk+ 2.0)

Just a thought.

	Jakob Kemi

March 11, 2002

Re: String type.

Posted by Pavel Minayev
in reply to Jakob Kemi

Pavel Minayev

Posted in reply to Jakob Kemi

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...

> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS
(these, I know for sure; Linux?). I'd prefer to have both char
and wchar flavor for each and every string-manipulation function,
probably overloaded (so you don't really see the difference).

BTW, Walter, a question. String literals seem to be char[] by
default, I guess they are wchar[] if the program is written
in UNICODE, though? Also, are UNICODE literals allowed in ASCII
programs?

March 11, 2002

Re: String type.

Posted by Jakob Kemi
in reply to Pavel Minayev

Jakob Kemi

Posted in reply to Pavel Minayev

On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> 
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> 
> BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

Linux supports UTF-8 very good. You can use all your standard programs
with UTF-8 encoding (cat, less, etc...) The best part is that if all
string functions is written to handle UTF-8, they'll also work for
ordinary (legacy) ASCII strings. There's no need to change the char
type or anything to deal with UTF-8. wchar would still be useful
to interact with older C libs.

March 11, 2002

Re: String type.

Posted by Jakob Kemi
in reply to Pavel Minayev

Jakob Kemi

Posted in reply to Pavel Minayev

On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> 
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> 
> BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

I forgot to add. UNICODE is a very loose notion as it includes
UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
others.

My first reaction was that variable width characters is kinda gross
and inelegant.
However, one would waste memory (yeah, I know, it's cheap) with 4
byte characters in order to be UNICODE compliant. Also the fact
that UTF-8 works so well with ASCII inheritance and it's fast
acceptance in the UNIX world make me feel all warm and fuzzy about
it, despite its variable character size.
By sticking with UCS-2 and UCS-4 we'll still be in present day's
situation, all internationalization will be clumpsy addons to
the standard ASCII strings and every program will have to decide
which routines to use etc. (a hell when you're developing big
applications and/or exchanging data from different countries and
different systems.) The best thing would offcourse be if the whole
world including all legacy databases, every file ever written and
every old and unsupported application just magically and instantly
were rewritten or converted to UCS-4. But there is _no_ way that
it's ever going to happen. UTF-8 gives the best compromise IMO.

	Jakob Kemi

March 11, 2002

Re: String type.

Posted by J. Daniel Smith
in reply to Jakob Kemi

J. Daniel Smith

Posted in reply to Jakob Kemi

UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages).  But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.

UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option.

   Dan

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)
>
> Just a thought.
>
> Jakob Kemi

March 11, 2002

Re: String type.

Posted by Jakob Kemi
in reply to J. Daniel Smith

Jakob Kemi

Posted in reply to J. Daniel Smith

On Mon, 11 Mar 2002 22:52:46 +0100, J. Daniel Smith wrote:

> UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages).  But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.
True.
Globally however UTF-8 will save memory compared to UCS-4
(no UCS-2 isn't enough) since 6 byte wide characters are rare.
But I think that the memory issue doesn't really matter, it will
also matter even less as prices fall. Also if someone is storing
_huge_ amounts of text they will just compress it and remove most
of the redundancy in the codeset.


> UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option.
It sure does solve lots of problems and the best part is that you don't have to opt out anything else. Just design all string functions to handle UTF-8 and you'll have the best of both worlds (ordinary ASCII char strings and UTF-8 that is).

If there's need you can still have special UCS-4 functions and
ucs4_char (or whatever).

>    Dan
	Jakob

March 11, 2002

Re: String type.

Posted by Walter
in reply to Pavel Minayev

Walter

Posted in reply to Pavel Minayev

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a6j584$1kuq$1@digitaldaemon.com...
> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message
> news:a6j4v5$1koq$1@digitaldaemon.com...
> BTW, Walter, a question. String literals seem to be char[] by
> default, I guess they are wchar[] if the program is written
> in UNICODE, though? Also, are UNICODE literals allowed in ASCII
> programs?

Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context.

You can insert unicode literals into strings with the \uUUUU syntax.

March 11, 2002

Re: String type.

Posted by Walter
in reply to Jakob Kemi

Walter

Posted in reply to Jakob Kemi

Supporting utf8 would just be by using char[] arrays!

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j8hj$1m6n$1@digitaldaemon.com...
> On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:
>
> > "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> >
> >> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> >
> > ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> >
> > BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
>
> I forgot to add. UNICODE is a very loose notion as it includes
> UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
> others.
>
> My first reaction was that variable width characters is kinda gross
> and inelegant.
> However, one would waste memory (yeah, I know, it's cheap) with 4
> byte characters in order to be UNICODE compliant. Also the fact
> that UTF-8 works so well with ASCII inheritance and it's fast
> acceptance in the UNIX world make me feel all warm and fuzzy about
> it, despite its variable character size.
> By sticking with UCS-2 and UCS-4 we'll still be in present day's
> situation, all internationalization will be clumpsy addons to
> the standard ASCII strings and every program will have to decide
> which routines to use etc. (a hell when you're developing big
> applications and/or exchanging data from different countries and
> different systems.) The best thing would offcourse be if the whole
> world including all legacy databases, every file ever written and
> every old and unsupported application just magically and instantly
> were rewritten or converted to UCS-4. But there is _no_ way that
> it's ever going to happen. UTF-8 gives the best compromise IMO.
>
> Jakob Kemi

March 11, 2002

Re: String type.

Posted by Walter
in reply to Jakob Kemi

Walter

Posted in reply to Jakob Kemi

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented.

It turned out to be a lot of trouble :-( and I finally converted it to wchar's.

March 12, 2002

Re: String type.

Posted by Jakob Kemi
in reply to Walter

Jakob Kemi

Posted in reply to Walter

On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:

> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented.
> 
> It turned out to be a lot of trouble :-( and I finally converted it to wchar's.

You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers.

	Jakob Kemi

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation