strings in D (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » strings in D (page 2)

February 19, 2005

Re: strings in D

Posted by Thomas Kühne
in reply to Andrew Fedoniouk

Thomas Kühne

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:
| I think that existing names of entities in D are misleading.
|
| 'char' in fact is not a character but element of UTF-8 sequence -
| ubyte.
| 'wchar' in fact is not a "wide" character but element of
| UTF-16 sequence - ushort. and only 'dchar' has meaning of character.

'dchar' is no _character_, it represents a _codepoint_.

While codepoints are interesting for some cases you are much more likely to
a) treat strings as void[]/byte[]/ubyte[] (most cases)
b) or are interested in graphemes (display/text editing)

http://www.unicode.org/faq/char_combmark.html

Hint: search the digitalmars.D newsgroup archive bevore posting any more
about strings/*chars.

Thomas

February 19, 2005

Re: strings in D

Posted by Thomas Kühne
in reply to Andrew Fedoniouk

Thomas Kühne

Posted in reply to Andrew Fedoniouk

Attachments:

pgp.sig

Andrew Fedoniouk wrote:
| I think that existing names of entities in D are misleading.
|
| 'char' in fact is not a character but element of UTF-8 sequence -
| ubyte.
| 'wchar' in fact is not a "wide" character but element of
| UTF-16 sequence - ushort. and only 'dchar' has meaning of character.

'dchar' is no _character_, it represents a _codepoint_.

While codepoints are interesting for some cases you are much more likely to
a) treat strings as void[]/byte[]/ubyte[] (most cases)
b) or are interested in graphemes (display/text editing)

http://www.unicode.org/faq/char_combmark.html

Hint: search the digitalmars.D newsgroup archive bevore posting any more
about strings/*chars.

Thomas

February 20, 2005

Re: strings in D

Posted by Andrew Fedoniouk
in reply to Thomas Kühne

Andrew Fedoniouk

Posted in reply to Thomas Kühne

According to
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
which is based on "digitalmars.D newsgroup archive" I believe,
D's 'char' and 'wchar' are not 'characters' as their names state but
rather "code units". Right?

And about "code point": in terms of UNICODE code point is a number between 0
and 0x10FFFF.
To represent this codes (or unicode character indexes) from this range you
may use
either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane
codes)  or uint21 (full UNICODE range).

Some code values from 0 and 0x10FFFF range are illeagal.
E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world.

If D wants to treat its strings in UTF8 or UTF16 form it should provide methods recommended by W3C: http://www.w3.org/TR/DOM-Level-2-Core/i18n.html

I think that ideally
D.char, D.wchar and D.dchar should be treated as code point value storage
types and not as code units.
This will give some meaning to these type names at least.

String literals should have type of 'utf8' like this: typedef ubyte[] utf8;

Intrinsic conversion routines like:
wchar[] str = "?????? ???"; // utf8  ("Hello World" in Russian)
should create str as sequence of codepoints with substitution of unsupported
values for wchar with lets say 0xFFFF.

The same rule should apply to
char[] str = "?????? ???"; // utf8
(in this case str will contain ten 0xFF as these are not Latin-1 codes)

Andrew Fedoniouk.
http://terrainformatica.com

February 20, 2005

Re: strings in D

Posted by John Reimer
in reply to Charlie Patterson

John Reimer

Posted in reply to Charlie Patterson

On Sat, 19 Feb 2005 10:54:44 -0500, Charlie Patterson wrote:

> "John Reimer" <brk_6502@yahoo.com> wrote in message news:pan.2005.02.19.05.02.08.170345@yahoo.com...
>> On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:
>>
>> > Is there any string class for the D?
>> > ...
> 
>> This question has been asked many times in the D groups.  If there ever were a "big three" in the D debates department, I think this one would rank as one of them.
> 
> The D newsgroup could probably use a FAQ.  I also don't know where the land mines are buried!

Yep, Navigating this newsgroup can be quite a chore.

I'm not sure, but I thought the D wiki site has some references to these topics.  Justin Calvarese would probably know.

February 20, 2005

Re: strings in D

Posted by Anders F Björklund
in reply to Andrew Fedoniouk

Anders F Björklund

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:

> Is there any string class for the D?

There is no built-in (Phobos) class, as reasoned in:
http://www.digitalmars.com/d/cppstrings.html

However, there are at least two 3rd-party ones:
http://dool.sourceforge.net/dool_String_String.html
http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classUString.html

I'm not sure having a default *class* in a hybrid
language is such a great idea in the first place ?
(then again, Exceptions are classes and default...)

> Or are there any plans to create string for D?

As a built-in value type ? No, that will not happen.

Although, there are three good alternatives already...
(the famous: str, wstr, dstr as I prefer to call them)

> char[], dchar[] and qchar[] cannot serve string purposes as they
> use utf encodings which are "transport" encodings and cannot be used
> in most cases as strings.

This is not true. All of UTF-8, UTF-16 and UTF32 can
be used for storing an array of Unicode code points...

Just that some code points require more than just one
code unit, just as one "grapheme" might require more
than just one "code point" anyway when using Unicode.
http://www.unicode.org/faq/utf_bom.html#1

> String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane)
> and ucs-4 so operator[] always returns character in full (for the given supported plane).
> The same should apply to foreach().

You can "foreach dchar", over all three string types.
If you want to index by code point, you will need to
convert the two smaller code units to UTF-32 first...

--anders

February 20, 2005

Re: strings in D

Posted by Thomas Kühne
in reply to Andrew Fedoniouk

Thomas Kühne

Posted in reply to Andrew Fedoniouk

Attachments:

pgp.sig

Andrew Fedoniouk wrote:

| According to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
| which is based on "digitalmars.D newsgroup archive" I believe, D's
| 'char' and 'wchar' are not 'characters' as their names state but
| rather "code units". Right?
|
| And about "code point": in terms of UNICODE code point is a number
| between 0 and 0x10FFFF. To represent this codes (or unicode character
| indexes) from this range you may use either uint8 (for Latin-1 code
| points)
UTF-8 supports all code points
depending on the value of the codepoint value 1 - 4 chars are required

| or uint16 (Basic Multilang Plane  codes)  or uint21 (full UNICODE
| range).
UTF-16 supports all code points
depending on the value of the codepoint value 1 - 2 wchars are required

| Some code values from 0 and 0x10FFFF range are illeagal. E.g.
| cast(dchar)0xD800 should rise an error in ideal 'D' world.
The codepoint 0xD800 isn't illegal, it's unassigned and is very likely
to remain unassigned in all future Unicode version.
The uint16 0xD800 on it's own is illegal as it is part of a UTF-16
surrogate pair.

| If D wants to treat its strings in UTF8 or UTF16 form it should
| provide methods recommended by W3C:
| http://www.w3.org/TR/DOM-Level-2-Core/i18n.html
findOffset8/16/32 are very simple functions.

I'm sure that there is at least one project at dsource.org providing
this functionality.

Thomas

February 20, 2005

Re: strings in D

Posted by Anders F Björklund
in reply to Andrew Fedoniouk

Anders F Björklund

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:

> According to
> http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
> which is based on "digitalmars.D newsgroup archive" I believe,
> D's 'char' and 'wchar' are not 'characters' as their names state but
> rather "code units". Right?

Right, if you want to get all technical about it at once. :-)

However, "char" is still a perfectly good *ASCII* character.
It's just that the-high-bit-set is now defined, unlike in C...

And "wchar" is also *usually* a character (BMP), just like "char"
was in Java for a number of years... (they're now using int instead:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html,
which means that D wchar = Java char, D dchar = Java int nowadays)

So they are still "characters" ? Just that there are "exceptions"
(being the surrogate code units, referring to next unit in array)
And as long as you watch out for these, it's perfectly OK to use
them as good-old-fashioned characters (and it could be faster, too)

> And about "code point": in terms of UNICODE code point is a number between 0 and 0x10FFFF.
> To represent this codes (or unicode character indexes) from this range you may use
> either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane codes)  or uint21 (full UNICODE range).
> 
> Some code values from 0 and 0x10FFFF range are illeagal.
> E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world.

For reasons of efficiency, D does not check all values upon assignment.
You must instead call the Phobos helper function: std.utf.isValidDchar
Note that "char" only holds ASCII in D, wchar must be used for Latin-1.

I suggested adding the new functions isAscii and isSurrogate too,
but it was ignored. (They're all copied and pasted at the moment)
http://www.digitalmars.com/d/archives/digitalmars/D/bugs/2154.html

> Intrinsic conversion routines like:
> wchar[] str = "?????? ???"; // utf8  ("Hello World" in Russian)
> should create str as sequence of codepoints with substitution of unsupported values for wchar with lets say 0xFFFF.

Substituting all surrogates with invalid characters will *lose data*.
That is clearly not good, and using UTF-8 sounds like a better idea ?
If you want single-codeunit strings, you can search/replace yourself.

In the example above, the string literal will be converted to UTF-16.
(as in: the actual literal data, it will also be '\0'-escaped for C)

> The same rule should apply to
> char[] str = "?????? ???"; // utf8
> (in this case str will contain ten 0xFF as these are not Latin-1 codes)

You can use ubyte[] for storing 8-bit encodings (such as Latin-1, etc.)

Using char[] will give "invalid UTF sequence", when encountering high bytes, although the first 0x100 characters are the same in both "sets",
that is ISO-8559-1 and UTF-8. But only 0x80 will fit in a single "char".

Note that (char*) is still used for NUL-terminated 8-bit strings too!
This is mostly for making it much simpler to use external C functions,
which is the same reason why all D string literals are NUL-terminated.

--anders

February 20, 2005

Re: strings in D

Posted by Anders F Björklund
in reply to Andrew Fedoniouk

Anders F Björklund

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:

> Ok. Seems like I did not explain this clearly. Let's try again then from different point of view (this time more technical).
[...]
> What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that utf16string is sequence of wchars. I wouldn't say so.These are not characters in common sense but just parts of the sequence of16bit units. You cannot treat them as characters e.g. you cannotinsert new wchar at position 3 of utf16string.

Length in D counts code units. Always. (but yes, an array insert
operation only gives useful results when there's no surrogates)

As been said, counting codeunits is a lot faster than codepoints.

> Only dchar could be considered as a real UNICODE character (UCS-4).
> But modern computers are not ready yet for UCS-4. Too much memory needed.

dchar is quite alright for use in parameters and such, since the
registers are 32-bit wide anyway. For string storage, I agree...

UTF-32 wastes too much space, and UTF-16 or even UTF-8 is better.

> As soon as D has built-in conversion routines then list of character types
> should look like as:
> 
> char    - element of utf8 sequence. char[] - utf8 encoded unicode sequence.
> wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence.
> dchar  - ucs-4 character. full unicode character. dchar[] - ucs-4 string.
> char2  - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of
>              UTF16 sequence - do not expand into ucs-4 by system.
> char2[] - ucs-2 string - sequence of characters.
>         Could be manipulated arbitrarye.g. characters (char2) could
>         be inserted or deleted at any given position.

As you've discovered, D "only" concerns itself with UTF code units...
(dchar is of the UTF-32 subset instead of the full ucs-4, but anyway)
This means that if you want to handle arrays of Latin-1 characters
or arrays of BMP characters, you can not use the "character" types.

However, you are free to use the ubyte and ushort types to represent
those types of strings (that are still Unicode, encoded differently)
But there is really not much use of introducing two new types just
to represent those two special cases of the more general UTF ones ?

For ASCII (only), char[] and ubyte[] with ISO-8859-1 would be the same.
Just as for non-surrogates (only), wchar[] and ushort[] are identical.
But the latter two types would be unable to handle higher code points.

Converting between the two is trivial, but there could be a loss of
data when going from char[] -> ubyte[], or from wchar[] -> ushort[]
(e.g. if replacing any surrogates with something like \xFF or \uFFFF)

And I think it's better to go with the lossless format, than to
support the rare operation of indexing individual codepoints...
(and in case you need to to this often, there's still dchar[])

> Let me highlight again:
> 
> /////
> /////   elements of utf sequence *are not* characters.
> /////
> 
> So such functions as strchr(string,char) must be declared either as
> 
> int strchr(char1[], char1 c) // latin-1 string
> --or--
> int strchr(char2[], char2 c) // ucs-2 string and char
> --or--
> int strchr(char4[], char4 c) // ucs-4 string or 'dchar'
> 
> This message has one sole reason: to make D close to perfect.

int strchr(char[], dchar c) would also work...
(would return the *start* of 1-4 code units)

--anders

February 20, 2005

Re: strings in D

Posted by Thomas Kühne
in reply to Anders F Björklund

Thomas Kühne

Posted in reply to Anders F Björklund

Attachments:

pgp.sig

Anders F Björklund wrote:

| For reasons of efficiency, D does not check
| all values upon assignment. You must instead call the Phobos helper
| function: std.utf.isValidDchar Note that "char" only holds ASCII in
| D, wchar must be used for Latin-1.

clarification

char:
can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment

char[]/char*
can hold any Unicode codepoint/codepoint sequence

Thomas

February 20, 2005

Re: strings in D (FAQ)

Posted by Anders F Björklund
in reply to John Reimer

Anders F Björklund

Posted in reply to John Reimer

John Reimer wrote:

>>The D newsgroup could probably use a FAQ.  I also don't know where the land mines are
>>buried!
> 
> Yep, Navigating this newsgroup can be quite a chore.
> 
> I'm not sure, but I thought the D wiki site has some references to these
> topics.  Justin Calvarese would probably know.

There are several FAQ.
http://www.digitalmars.com/d/faq.html (Offical FAQ)
http://int19h.tamb.ru/faq.html (Inoffical FAQ)
http://www.prowiki.org/wiki4d/wiki.cgi?FaqRoadmap


But you might be looking for simple things like:
http://www.prowiki.org/wiki4d/wiki.cgi?ShortFrequentAnswers

> Strings are not null-terminated but hold explicit length information.
> Therefore you need to use %.*s not %s in printf, or just use writef!

> Comparing an object reference like: "if (object == null)" will crash.
> You must use "if (object is null)"

> Checking for a key in an AA like: "if(array[key])" will create it
> if it's missing. You must use "if(key in array)"


Or just a quick summary, like I posted earlier:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/12609

> Q: What's the default boolean type in D ?
> A: bit.
> (bool is an "alias")
> 
> Q: Is that really type-safe ?
> A: No.
> (just as in C99/C++)
> 
> Q: What's the default string type in D ?
> A: char[].
> (since main() uses it)
> 
> Q: Is that a single class ?
> A: No.
> (it's a primitive type)
> 
> Q: Was this done by accident or by choice ?
> A: choice.
> (by Walter Bright)
>
> Q: Will this change before D version 1.0 ?
> A: No.
> (at least unlikely)


At least the String Wars and the Boolean Wars are *over*...
And it was char[]/wchar[]/dchar[] and bit/wbit/dbit that won.

--anders

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation