Thread overview | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 31, 2004 arrays and strings | ||||
---|---|---|---|---|
| ||||
All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. For example, logically speaking I may want to get the second and third characters of this string (UTF8): 彼は来る (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick. Now, tell me what I would get in a UTF8 environment: char[] kyokimasu = "彼は来る"; char[] test = kyokimasu[1..3]; assert "は来" == test; I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. Not to mention the UTF8 string listed above would be more than 8 bytes long (the wchar[] version). The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. I just don't think we can rely on D's native (up to now) way of dealing with String manipulation. |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Berin Loritsch | Berin Loritsch wrote: > For example, logically speaking I may want to get the second and third > characters of this string (UTF8): 彼は来る (only four characters). It > is the Japanese text for "kyo kimasu" (he comes). I'm into martial > arts, so I can't get away from the Japanese language (it is tied to > what I study)--even though I can't really speak a lick. Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array. -Sebastian |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sebastian Beschke | Sebastian Beschke wrote:
> don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D
Whoops, the link doesn't work. Nevermind.
|
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sebastian Beschke | Sebastian Beschke wrote:
> Berin Loritsch wrote:
>
>> For example, logically speaking I may want to get the second and third
>> characters of this string (UTF8): 彼は来る (only four characters). It
>> is the Japanese text for "kyo kimasu" (he comes). I'm into martial
>> arts, so I can't get away from the Japanese language (it is tied to
>> what I study)--even though I can't really speak a lick.
>
>
> Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D
>
> Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array.
>
> -Sebastian
Blasted electronic translators...
|
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Berin Loritsch | This is what dchar[] is for. With dchar[] array indexing === character indexing. A couple of helper function in std.string char[] slice(char[] str, int a, int b); % slice characters a to b, not index a to b wchar[] slice(wchar[] str, int a, int b); would also be nice for those cases when one doesn't want to convert to dchar[]. Maybe such functions area already in phobos somewhere? I haven't looked too hard. "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com... > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. > > For example, logically speaking I may want to get the second and third > characters of this string (UTF8): ???? (only four characters). It > is the Japanese text for "kyo kimasu" (he comes). I'm into martial > arts, so I can't get away from the Japanese language (it is tied to > what I study)--even though I can't really speak a lick. > > Now, tell me what I would get in a UTF8 environment: > > char[] kyokimasu = "????"; > char[] test = kyokimasu[1..3]; > > assert "??" == test; > > I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. > > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. > > Not to mention the UTF8 string listed above would be more than 8 bytes > long (the wchar[] version). > > The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. > > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. > > I just don't think we can rely on D's native (up to now) way of dealing > with String manipulation. |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character: int character(char[] str, int n); and then slicing is str[character(a) .. character(b)]; That is probably better than special slicing functions. "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com... > This is what dchar[] is for. With dchar[] array indexing === character > indexing. > A couple of helper function in std.string > char[] slice(char[] str, int a, int b); % slice characters a to b, not > index a to b > wchar[] slice(wchar[] str, int a, int b); > would also be nice for those cases when one doesn't want to convert to > dchar[]. Maybe such functions area already in phobos somewhere? I haven't > looked too hard. > > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com... > > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. > > > > For example, logically speaking I may want to get the second and third > > characters of this string (UTF8): ???? (only four characters). It > > is the Japanese text for "kyo kimasu" (he comes). I'm into martial > > arts, so I can't get away from the Japanese language (it is tied to > > what I study)--even though I can't really speak a lick. > > > > Now, tell me what I would get in a UTF8 environment: > > > > char[] kyokimasu = "????"; > > char[] test = kyokimasu[1..3]; > > > > assert "??" == test; > > > > I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. > > > > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. > > > > Not to mention the UTF8 string listed above would be more than 8 bytes > > long (the wchar[] version). > > > > The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. > > > > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. > > > > I just don't think we can rely on D's native (up to now) way of dealing > > with String manipulation. > > |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | OK - enough replying to myself, I know, I know. Here's the code implementing what I'm talking about: import std.utf; size_t character(char[] str, size_t n) { size_t i = 0; while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n) { size_t i = 0; while (n--) { decode(str,i); } return i; } "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26je$sq4$1@digitaldaemon.com... > actually now that I think about it another way to slice from character a to > b is to have a function that returns the index of the nth character: > int character(char[] str, int n); > and then slicing is > str[character(a) .. character(b)]; > That is probably better than special slicing functions. > > "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com... > > This is what dchar[] is for. With dchar[] array indexing === character > > indexing. > > A couple of helper function in std.string > > char[] slice(char[] str, int a, int b); % slice characters a to b, not > > index a to b > > wchar[] slice(wchar[] str, int a, int b); > > would also be nice for those cases when one doesn't want to convert to > > dchar[]. Maybe such functions area already in phobos somewhere? I haven't > > looked too hard. > > > > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com... > > > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. > > > > > > For example, logically speaking I may want to get the second and third > > > characters of this string (UTF8): ???? (only four characters). It > > > is the Japanese text for "kyo kimasu" (he comes). I'm into martial > > > arts, so I can't get away from the Japanese language (it is tied to > > > what I study)--even though I can't really speak a lick. > > > > > > Now, tell me what I would get in a UTF8 environment: > > > > > > char[] kyokimasu = "????"; > > > char[] test = kyokimasu[1..3]; > > > > > > assert "??" == test; > > > > > > I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. > > > > > > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte > > > encoding. > > > > > > Not to mention the UTF8 string listed above would be more than 8 bytes > > > long (the wchar[] version). > > > > > > The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. > > > > > > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, > > > as long as you speak English does not make sense). General purpose i18n > > > and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. > > > > > > I just don't think we can rely on D's native (up to now) way of dealing > > > with String manipulation. > > > > > > |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says... > >actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character: > int character(char[] str, int n); >and then slicing is > str[character(a) .. character(b)]; >That is probably better than special slicing functions. It's more flexible, but it is slightly slower. The two calls to character() will parse the string once each, while a splice() function could do it in one run. Nick |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nick | In article <ch2i6t$13ma$1@digitaldaemon.com>, Nick says... > > >It's more flexible, but it is slightly slower. The two calls to character() will >parse the string once each, while a splice() function could do it in one run. ^^^^^^ Err, that should be slice() :-) Nick |
August 31, 2004 Re: arrays and strings | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nick | "Nick" <Nick_member@pathlink.com> wrote in message news:ch2i6t$13ma$1@digitaldaemon.com... > In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says... > > > >actually now that I think about it another way to slice from character a to > >b is to have a function that returns the index of the nth character: > > int character(char[] str, int n); > >and then slicing is > > str[character(a) .. character(b)]; > >That is probably better than special slicing functions. > > It's more flexible, but it is slightly slower. The two calls to character() will > parse the string once each, while a splice() function could do it in one run. > > Nick > > good point. plus it is less typing. So here's version 2: import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } |
Copyright © 1999-2021 by the D Language Foundation