July 31, 2006
Serg Kovrov wrote:
> * Oskar Linde:
>> Having char[].length return something other than the actual number
>> of char-units would break it's array semantics.
> 
> Yes, I see. Thats why I do not like much char[] as substitute for string
> type.
> 
>> It is actually not very often that you need to count the number
>> of characters as opposed to the number of (UTF-8) code units.
> 
> Why not use separate properties for that?
> 
>> Counting the number of characters is also a rather expensive
>> operation. 
> 
> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

The question is, how often do you need it? Especially if you are not indexing by character.

>> All the ordinary operations (searching, slicing, concatenation, sub-string  search, etc) operate on code units rather than
>> characters.
> 
> Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.

Why? Code unit indices will work equally well for substrings, searching etc.

> Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.

Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].

> And besides, string as opposite to char[] is more pleasant for my eyes =)

There is always alias.
July 31, 2006
* Oskar Linde:
> Serg Kovrov wrote:
>> * Oskar Linde:
>>> Having char[].length return something other than the actual number
>>> of char-units would break it's array semantics.
>>
>> Yes, I see. Thats why I do not like much char[] as substitute for string
>> type.
>>
>>> It is actually not very often that you need to count the number
>>> of characters as opposed to the number of (UTF-8) code units.
>>
>> Why not use separate properties for that?
>>
>>> Counting the number of characters is also a rather expensive
>>> operation. 
>>
>> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
> 
> The question is, how often do you need it? Especially if you are not indexing by character.
> 
>>> All the ordinary operations (searching, slicing, concatenation, sub-string  search, etc) operate on code units rather than
>>> characters.
>>
>> Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.
> 
> Why? Code unit indices will work equally well for substrings, searching etc.
> 
>> Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.
> 
> Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].
> 
>> And besides, string as opposite to char[] is more pleasant for my eyes =)
> 
> There is always alias.

You've got some valid points, I just showed mine.
July 31, 2006
Serg Kovrov wrote:
> * Frits van Bommel:
>> Serg Kovrov wrote:
>>> * Oskar Linde:
>>>> Counting the number of characters is also a rather expensive
>>>> operation. 
>>>
>>> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
>>
>> Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).
> 
> Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me.
> 
> As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.

The length is stored in the reference, but the character count would not only depend on the memory location and size (which the reference holds) but also the data it holds (at least for char and wchar) which may be accessed through different references as well. That's the problem I was pointing out.
July 31, 2006
Oskar Linde wrote:
> It is easy to implement your own character count though:
> 
> size_t count(char[] arr) {
>     size_t c = 0;
>     foreach(dchar c;arr)
>         c++;
>     return c;
> }
> 
> assert("ั‚ะตัั‚".count() == 4);

std.utf.toUCSindex(s, s.length) will also give the character count.
July 31, 2006
Oskar Linde schrieb am 2006-07-31:
> Serg Kovrov wrote:

>> For example,
>> char[] str = "????";
>> word "test" in russian - 4 cyrillic characters, would give you
>> str.length 8, which make no use of this length property if you not sure
>> that string is latin characters only.
>
> It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.
>
> It is easy to implement your own character count though:
>
> size_t count(char[] arr) {
> 	size_t c = 0;
> 	foreach(dchar c;arr)
> 		c++;
> 	return c;
> }
>
> assert("????".count() == 4);
>
> Also note that:
>
> assert("????"d.length == 4);

I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts.

- -> http://www.unicode.org

Thomas


July 31, 2006
Derek thanks for summarizing all this but I will put it as following.

There are two type of text encodings for two distinct use cases:
  1) transport/storage encodings - one unicode code point
      represented as multiple code units of encoded sequence ( e.g. UTF )
      string.length returns length in code units of encoding - not
characters.

  2) manipulation encodings - one unicode code point represented
      as one and only one element of the sequence (e.g. one byte, word or
dword)
      string.length here returns length in code points (mapped character
glyphs).

The problem as I can see is this:
D propose to use transport encoding for manipulation purposes
which is main problem imo here - transport encodings are not
designed for the manipulation - it is extremely difficult to use
them for manipualtion in practice as we may see.

One more problem:

Encoding like UTF-8 and UTF-16 are almost useless
with let's say Windows API, say TextOutA and TextOutW functions.
Neither one of them will accept D's char[] and wchar[] directly.

- ***A  functions in Windows take byte string (LPSTR) and current
  codepage id  to render text. ( byte + codepage = Unicode Code Point )

- ***W functions in Windows use LPWSTR things which are
  sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
  (  cast(dword) word  = Unicode Code Point )
  Only few functions in Windows API treat LPWSTR as UTF-16.

-----------------
"D strings are utf encoded sequences only" is a design mistake, IMO.
On disk (serialized form) - yes. But not in memory for manipulation please.

Andrew Fedoniouk.
http://terrainformatica.com



"Derek" <derek@psyc.ward> wrote in message news:177u058vq8cdj.koexsq99n112.dlg@40tude.net...
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
>
>
>> ... but this is far from concept of null codepoint in character encodings.
>
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
>
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8,
> and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
>
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh
> these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
>
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you
> wish
> to use some other encodings, then use a more appropriate data structure
> for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.
>
> -- 
> Derek Parnell
> Melbourne, Australia
> "Down with mediocrity!"


July 31, 2006
"Thomas Kuehne" <thomas-dloop@kuehne.cn> wrote in message news:ls52q3-3o8.ln1@birke.kuehne.cn...
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Oskar Linde schrieb am 2006-07-31:
>> Serg Kovrov wrote:
>
>>> For example,
>>> char[] str = "????";
>>> word "test" in russian - 4 cyrillic characters, would give you
>>> str.length 8, which make no use of this length property if you not sure
>>> that string is latin characters only.
>>
>> It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.
>>
>> It is easy to implement your own character count though:
>>
>> size_t count(char[] arr) {
>> size_t c = 0;
>> foreach(dchar c;arr)
>> c++;
>> return c;
>> }
>>
>> assert("????".count() == 4);
>>
>> Also note that:
>>
>> assert("????"d.length == 4);
>
> I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts.
>
> - -> http://www.unicode.org
>


Right, Thomas,

umlaut as a separate code point can exist
so A with umlaut can be represented by two code points.
But as far as I remember the intention was and is
to have in Unicode also all full forms like "A-with-umlaut"
So you can always "compress" multi code point forms into
single point counterparts.

This way "????"d.length == 4 will be true -
it is just depeneds on your text parser.

Andrew.



> Thomas
>
>
> -----BEGIN PGP SIGNATURE-----
>
> iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y
> mp2vyHbFrwExwr3h6/etjWc=
> =9RLJ
> -----END PGP SIGNATURE----- 


July 31, 2006
Andrew Fedoniouk wrote:
> The problem as I can see is this:
> D propose to use transport encoding for manipulation purposes
> which is main problem imo here - transport encodings are not
> designed for the manipulation - it is extremely difficult to use
> them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

It's also certainly easier than codepage based multibyte designs like shift-JIS (I used to write code for shift-JIS).

> Encoding like UTF-8 and UTF-16 are almost useless
> with let's say Windows API, say TextOutA and TextOutW functions.
> Neither one of them will accept D's char[] and wchar[] directly.
> 
> - ***A  functions in Windows take byte string (LPSTR) and current
>   codepage id  to render text. ( byte + codepage = Unicode Code Point )

Win9x only supports the A functions, and Phobos does a translation of the output into the Win9x code page when running on Win9x. Of course, this fails when one has characters not supported by Win9x, but code pages aren't going to help that either.

Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.

When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.

> - ***W functions in Windows use LPWSTR things which are
>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>   (  cast(dword) word  = Unicode Code Point )
>   Only few functions in Windows API treat LPWSTR as UTF-16.

BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.

So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine.

The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."

> -----------------
> "D strings are utf encoded sequences only" is a design mistake, IMO.
> On disk (serialized form) - yes. But not in memory for manipulation please.

There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).
August 01, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eam1ec$10e1$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> The problem as I can see is this:
>> D propose to use transport encoding for manipulation purposes
>> which is main problem imo here - transport encodings are not
>> designed for the manipulation - it is extremely difficult to use
>> them for manipualtion in practice as we may see.
>
> I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.

Sorry but strings in DMDScript are quite different in terms of
0) there are no such thing as char in JavaScript.
1) strings are Strings - not vectors of octets - js::string[] and d::char[]
are different things.
2) are not supposed to be used by any OS API.
3) there are 12 or so methods of String class in JS - limited perimeter -
what model you've choosen to store them is irrelevant -
in some implementations they represented even by list of fixed runs.

>
> It's also certainly easier than codepage based multibyte designs like shift-JIS (I used to write code for shift-JIS).
>
>> Encoding like UTF-8 and UTF-16 are almost useless
>> with let's say Windows API, say TextOutA and TextOutW functions.
>> Neither one of them will accept D's char[] and wchar[] directly.
>>
>> - ***A  functions in Windows take byte string (LPSTR) and current
>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>
> Win9x only supports the A functions,

You are not right here.

TextOutA and TextOutW are both supported by Win98.
And intention in Harmonia was to use only those ***W
functions which come out of the box on Win98 (without need of MSLU)

> and Phobos does a translation of the output into the Win9x code page when running on Win9x. Of course, this fails when one has characters not supported by Win9x, but code pages aren't going to help that either.
>
> Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.

There is a huge market of embedded devices.
If you think that computer evolution expands only in more-ram-speed
direction than you are in trouble.

http://www.litepc.com/graphics/eossystem.jpg


>
> When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.

http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx


>
>> - ***W functions in Windows use LPWSTR things which are
>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>   (  cast(dword) word  = Unicode Code Point )
>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>
> BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16"
UTF-16 is a group name of *byte stream encodings*
(UTF-16LE and UTF-16BE) of Unicode Code Set.

BTW: which one of this UTFs D uses? Platform dependent I beleive.


>
> So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine.
>
> The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."
>

It should work well. Efficent I mean.
The language shall be agnostic to the meaning of char as much as possible.
It shall not prevent you to write effective algorithms.

>> -----------------
>> "D strings are utf encoded sequences only" is a design mistake, IMO. On disk (serialized form) - yes. But not in memory for manipulation please.
>
> There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).

We are speaking in different languages:

A: "strings are utf encoded sequences only" is a design mistake. W: "use any encoding other that utf" is a design mistake.

Different meaning, eh?

Forget about codepages.
Let those who aware about them to deal with them efficiently.
"Codepage" (c) Walter  (e.g. ASCII) is an efficient way of
representing text. That is it.

Others who can afford full set will work with full 21bit values. Practically it is enough to have 16 (BMP) but...

Andrew Fedoniouk.
http://terrainformatica.com


























August 01, 2006
On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:

> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eam1ec$10e1$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> The problem as I can see is this:
>>> D propose to use transport encoding for manipulation purposes
>>> which is main problem imo here - transport encodings are not
>>> designed for the manipulation - it is extremely difficult to use
>>> them for manipualtion in practice as we may see.
>>
>> I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
> 
> Sorry but strings in DMDScript are quite different in terms of
> 0) there are no such thing as char in JavaScript.
> 1) strings are Strings - not vectors of octets - js::string[] and d::char[]
> are different things.
> 2) are not supposed to be used by any OS API.
> 3) there are 12 or so methods of String class in JS - limited perimeter -
> what model you've choosen to store them is irrelevant -
> in some implementations they represented even by list of fixed runs.

For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format.

char[] somefunc(char[] x)
{
   return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) );
}

wchar[] somefunc(wchar[] x)
{
   return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) );
}

dchar[] somefunc(dchar[] x)
{
   dchar[] result;
   ...
   return result;
}

This seems to work fast enough for my purposes. DBuild (nee Build) uses
this a lot.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
1/08/2006 11:45:36 AM