September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote: > >>As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper. > > The Build program does lots of 'tampering'. I had to rewrite many standard > routines and create some new ones to deal with unicode characters because > the standard ones just don't work. Do you still remember which they were? > And Build still fails to do somethings > correctly (e.g. case insensitive compares) but that's on the TODO list. Yes, case insensitive compares are difficult if you want to cater for non-ASCII strings. While it may not be unreasonably difficult to get American, European and Russian strings right, there will always be languages and character sets where even the Unicode guys aren't sure what is right. Unfortunately. |
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chad J | Chad J > wrote:
> But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this:
>
> char[] str = "some string in nonenglish text";
> for ( int i = 0; i < str.length; i++ )
> {
> str[i] = doSomething( str[i] );
> }
>
> and this will fail right?
>
> If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.
Yes, you do have to be aware of it being UTF, just like in C you have to be aware that strings are 0 terminated. But once aware of it, there is plenty of support for it in the core language and in std.utf.
You can also simply use dchar[], which has a one to one mapping between characters and indices, if you prefer.
Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.
You can also wrap char[] inside a class that provides a view of the data as if it were dchar's. But I don't think the performance of such a class would be competitive. Interestingly, it turns out that most string operations do not need to be concerned with the number of char's in a character (like "find this substring"), and forcing them to care just makes for inefficiency.
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote:
> Derek Parnell wrote:
>
>> And is it there yet? I mean, given that a string is just a lump of text, is
>> there any text processing operation that cannot be simply done to a char[]
>> item? I can't think of any but maybe somebody else can.
>
>
> I believe it's there. I don't think std::string or java.lang.String have anything over it.
>
>> And if a char[] is just as capable as a std::string, then why not have an
>> official alias in Phobos? Will 'alias char[] string' cause anyone any
>> problems?
>
>
> I don't think it'll cause problems, it just seems pointless.
The reason *I* want it is _alias_ does not respect the private: visibility modifier.
So when I pull out an old piece of code which says
alias char[] string
and import it in my newer module I get conflicts when I compile.
Then I must do this silly hack where I include the newer file from the old or vice versa.
If you didn't add this into phobos, at least or adopt a method to discriminate between more than one alias with the same name to resolve the issue.
-DavidM
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Georg Wrede | Georg Wrede wrote:
> Geoff Carlton wrote:
>> But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".
>
> Yes.
How should we chop strings on character boundaries?
I have a text rendering function that uses freetype and want to restrict the width of the renderd string, (i have to use some sort of search here, binary or linear) by truncating it. Right now I use dchar but if char is sufficient it would save me conversions all over the place.
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > I'm pretty sure that the phobos routines for search and replace only work > for ASCII text. For example, std.string.find(japanesetext, "a") will nearly > always fail to deliver the correct result. It finds the first occurance of > the byte value for the letter 'a' which may well be inside a Japanese > character. That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems. > It looks for byte-subsets rather than character sub-sets. I don't think it's broken, but if it is, those are bugs, not fundamental problems with char[], and should be filed in bugzilla. > It may very well be pointless for your way of thinking, but your language > is also for people who may not necessarily think in the same manner as > yourself. I, for example, think there is a point to having my code read > like its dealing with strings rather than arrays of characters. I suspect > I'm not alone. We could all write the alias in all our code, but you could > also be helpful and do it for us - like you did with bit/bool. I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string. (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2). And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively. If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!). |
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Granberg | Johan Granberg wrote:
> How should we chop strings on character boundaries?
std.utf.toUTFindex() should do the trick.
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell Attachments: | Derek Parnell schrieb am 2006-09-30:
> On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:
>
>> Derek Parnell wrote:
>>> And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.
>>
>> I believe it's there. I don't think std::string or java.lang.String have anything over it.
>
> I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.
~wow~
Have a look at std.string.find's source and try to stop giggling *g*
The correct implementation would be:
# import std.string;
# import std.c.string;
# import std.utf;
#
# int find(char[] s, dchar c)
# {
# if (c <= 0x7F)
# { // Plain old ASCII
# auto p = cast(char*)memchr(s, c, s.length);
# if (p)
# return p - cast(char *)s;
# else
# return -1;
# }
#
# // c is a universal character
# return std.string.find(s, toUTF8([c]));
# }
The same applies to ifind and the like.
Thomas
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne Attachments: | Thomas Kuehne schrieb am 2006-09-30:
>
> Derek Parnell schrieb am 2006-09-30:
>> On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:
>>
>>> Derek Parnell wrote:
>>>> And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.
>>>
>>> I believe it's there. I don't think std::string or java.lang.String have anything over it.
>>
>> I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.
>
>
> ~wow~
>
> Have a look at std.string.find's source and try to stop giggling *g*
>
> The correct implementation would be:
As it seems, the original code depends on the undocumented index behavior with regards to silent transcoding in foreach.
Thomas
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote:
>
> As it seems, the original code depends on the undocumented index behavior
> with regards to silent transcoding in foreach.
The wording could be more explicit, but I think the current documentation implies the actual behavior:
"The index must be of int or uint type, it cannot be inout, and it is set to be the index of the array element."
The docs should probably also be revised to allow for 64-bit indices, where the index would be long or ulong. Something along the lines of:
"The index must be an integer type of size equal to size_t.sizeof. . ."
Sean
|
September 30, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote: > > Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode. As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching. > You can also wrap char[] inside a class that provides a view of the data as if it were dchar's. But I don't think the performance of such a class would be competitive. Interestingly, it turns out that most string operations do not need to be concerned with the number of char's in a character (like "find this substring"), and forcing them to care just makes for inefficiency. Yup. I realized this while working on array operations and it came as a surprise--when I began I figured I would have to provide overloads for char strings, but in most cases it simply isn't necessary. Sean |
Copyright © 1999-2021 by the D Language Foundation