Jump to page: 1 2
Thread overview
String type.
Mar 11, 2002
Jakob Kemi
Mar 11, 2002
Pavel Minayev
Mar 11, 2002
Jakob Kemi
Mar 11, 2002
Jakob Kemi
Mar 11, 2002
Walter
Mar 11, 2002
Walter
Mar 12, 2002
Pavel Minayev
Mar 12, 2002
Walter
Mar 27, 2002
Walter
Mar 11, 2002
J. Daniel Smith
Mar 11, 2002
Jakob Kemi
Mar 12, 2002
Serge K
Mar 11, 2002
Walter
Mar 12, 2002
Jakob Kemi
Mar 12, 2002
Pavel Minayev
Mar 12, 2002
Jakob Kemi
Mar 12, 2002
Walter
March 11, 2002
As I understand from the docs, D is supposed to use wchars
(2 to 4 bytes) for representing non-ASCII strings. I think it would be
better to let all string functions only handle UTF-8 (which is fully
backwards compatible with ASCII). UTF-8 is slowly becoming standard in
UNIX. (just look at X and Gtk+ 2.0)

Just a thought.

	Jakob Kemi
March 11, 2002
"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...

> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS
(these, I know for sure; Linux?). I'd prefer to have both char
and wchar flavor for each and every string-manipulation function,
probably overloaded (so you don't really see the difference).

BTW, Walter, a question. String literals seem to be char[] by
default, I guess they are wchar[] if the program is written
in UNICODE, though? Also, are UNICODE literals allowed in ASCII
programs?


March 11, 2002
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> 
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> 
> BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

Linux supports UTF-8 very good. You can use all your standard programs
with UTF-8 encoding (cat, less, etc...) The best part is that if all
string functions is written to handle UTF-8, they'll also work for
ordinary (legacy) ASCII strings. There's no need to change the char
type or anything to deal with UTF-8. wchar would still be useful
to interact with older C libs.
March 11, 2002
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> 
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> 
> BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?

I forgot to add. UNICODE is a very loose notion as it includes
UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
others.

My first reaction was that variable width characters is kinda gross
and inelegant.
However, one would waste memory (yeah, I know, it's cheap) with 4
byte characters in order to be UNICODE compliant. Also the fact
that UTF-8 works so well with ASCII inheritance and it's fast
acceptance in the UNIX world make me feel all warm and fuzzy about
it, despite its variable character size.
By sticking with UCS-2 and UCS-4 we'll still be in present day's
situation, all internationalization will be clumpsy addons to
the standard ASCII strings and every program will have to decide
which routines to use etc. (a hell when you're developing big
applications and/or exchanging data from different countries and
different systems.) The best thing would offcourse be if the whole
world including all legacy databases, every file ever written and
every old and unsupported application just magically and instantly
were rewritten or converted to UCS-4. But there is _no_ way that
it's ever going to happen. UTF-8 gives the best compromise IMO.

	Jakob Kemi
March 11, 2002
UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages).  But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.

UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option.

   Dan

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)
>
> Just a thought.
>
> Jakob Kemi


March 11, 2002
On Mon, 11 Mar 2002 22:52:46 +0100, J. Daniel Smith wrote:

> UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages).  But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.
True.
Globally however UTF-8 will save memory compared to UCS-4
(no UCS-2 isn't enough) since 6 byte wide characters are rare.
But I think that the memory issue doesn't really matter, it will
also matter even less as prices fall. Also if someone is storing
_huge_ amounts of text they will just compress it and remove most
of the redundancy in the codeset.


> UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option.
It sure does solve lots of problems and the best part is that you don't have to opt out anything else. Just design all string functions to handle UTF-8 and you'll have the best of both worlds (ordinary ASCII char strings and UTF-8 that is).

If there's need you can still have special UCS-4 functions and
ucs4_char (or whatever).

>    Dan
	Jakob
March 11, 2002
"Pavel Minayev" <evilone@omen.ru> wrote in message news:a6j584$1kuq$1@digitaldaemon.com...
> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message
> news:a6j4v5$1koq$1@digitaldaemon.com...
> BTW, Walter, a question. String literals seem to be char[] by
> default, I guess they are wchar[] if the program is written
> in UNICODE, though? Also, are UNICODE literals allowed in ASCII
> programs?

Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context.

You can insert unicode literals into strings with the \uUUUU syntax.


March 11, 2002
Supporting utf8 would just be by using char[] arrays!

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j8hj$1m6n$1@digitaldaemon.com...
> On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:
>
> > "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> >
> >> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> >
> > ...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference).
> >
> > BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
>
> I forgot to add. UNICODE is a very loose notion as it includes
> UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
> others.
>
> My first reaction was that variable width characters is kinda gross
> and inelegant.
> However, one would waste memory (yeah, I know, it's cheap) with 4
> byte characters in order to be UNICODE compliant. Also the fact
> that UTF-8 works so well with ASCII inheritance and it's fast
> acceptance in the UNIX world make me feel all warm and fuzzy about
> it, despite its variable character size.
> By sticking with UCS-2 and UCS-4 we'll still be in present day's
> situation, all internationalization will be clumpsy addons to
> the standard ASCII strings and every program will have to decide
> which routines to use etc. (a hell when you're developing big
> applications and/or exchanging data from different countries and
> different systems.) The best thing would offcourse be if the whole
> world including all legacy databases, every file ever written and
> every old and unsupported application just magically and instantly
> were rewritten or converted to UCS-4. But there is _no_ way that
> it's ever going to happen. UTF-8 gives the best compromise IMO.
>
> Jakob Kemi


March 11, 2002
"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
> As I understand from the docs, D is supposed to use wchars
> (2 to 4 bytes) for representing non-ASCII strings. I think it would be
> better to let all string functions only handle UTF-8 (which is fully
> backwards compatible with ASCII). UTF-8 is slowly becoming standard in
> UNIX. (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented.

It turned out to be a lot of trouble :-( and I finally converted it to wchar's.


March 12, 2002
On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:


> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com...
>> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)
> 
> At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented.
> 
> It turned out to be a lot of trouble :-( and I finally converted it to wchar's.

You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers.

	Jakob Kemi
« First   ‹ Prev
1 2