Jump to page: 1 2
Thread overview
arrays and strings
Aug 31, 2004
Berin Loritsch
Aug 31, 2004
Sebastian Beschke
Aug 31, 2004
Sebastian Beschke
Aug 31, 2004
Berin Loritsch
Aug 31, 2004
Ben Hinkle
Aug 31, 2004
Ben Hinkle
Aug 31, 2004
Ben Hinkle
Aug 31, 2004
Nick
Aug 31, 2004
Nick
Aug 31, 2004
Ben Hinkle
Aug 31, 2004
Berin Loritsch
Aug 31, 2004
Regan Heath
Aug 31, 2004
Regan Heath
Sep 01, 2004
Nick
Sep 01, 2004
Regan Heath
Sep 02, 2004
Sean Kelly
Sep 02, 2004
Nick
Sep 01, 2004
Walter
Sep 01, 2004
Ben Hinkle
Sep 01, 2004
Arcane Jill
August 31, 2004
All this talk about unicode made it clear that using a straight array
may not be the right tool for string handling.  Sure the most common
operations can be done on an array (concatenation, sub-arrays, etc.).
However, if we are to assume any kind of encoding support other than
ASCII, it is simply not safe unless we are talking about "dchar" arrays.

For example, logically speaking I may want to get the second and third
characters of this string (UTF8): 彼は来る (only four characters).  It
is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
arts, so I can't get away from the Japanese language (it is tied to
what I study)--even though I can't really speak a lick.

Now, tell me what I would get in a UTF8 environment:

char[] kyokimasu = "彼は来る";
char[] test = kyokimasu[1..3];

assert "は来" == test;

I guarantee you the assertion would fail.  Why?  because strict array
slicing does not take into account multibyte encoding.  Essentially I
will get part of the first character's encoding alone.

Any UTF aware system would either need to build this knowlege into the
language (bad idea IMO), or have a string to take care of that info
for you.  Things are a bit better with wchar[], (I'm not sure, but I
think the above will pass)--but there are still some cases of multibyte
encoding.

Not to mention the UTF8 string listed above would be more than 8 bytes
long (the wchar[] version).

The only way to make it work seamlessly is to have a string class that
would make the proper adjustments.  Of course this would also affect
the speed deamons here.

I think having something generally useful for internationalization is
very important, or we shoot ourselves in the foot (we want D to succeed,
as long as you speak English does not make sense).  General purpose i18n
and l10n is not easy to do by any stretch--but I think it is generally
agreed that it would have to be done in libraries.

I just don't think we can rely on D's native (up to now) way of dealing
with String manipulation.
August 31, 2004
Berin Loritsch wrote:
> For example, logically speaking I may want to get the second and third
> characters of this string (UTF8): 彼は来る (only four characters).  It
> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> arts, so I can't get away from the Japanese language (it is tied to
> what I study)--even though I can't really speak a lick.

Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D

Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array.

-Sebastian
August 31, 2004
Sebastian Beschke wrote:
> don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D

Whoops, the link doesn't work. Nevermind.
August 31, 2004
Sebastian Beschke wrote:

> Berin Loritsch wrote:
> 
>> For example, logically speaking I may want to get the second and third
>> characters of this string (UTF8): 彼は来る (only four characters).  It
>> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
>> arts, so I can't get away from the Japanese language (it is tied to
>> what I study)--even though I can't really speak a lick.
> 
> 
> Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D
> 
> Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array.
> 
> -Sebastian

Blasted electronic translators...
August 31, 2004
This is what dchar[] is for. With dchar[] array indexing === character
indexing.
A couple of helper function in std.string
 char[] slice(char[] str, int a, int b); % slice characters a to b, not
index a to b
 wchar[] slice(wchar[] str, int a, int b);
would also be nice for those cases when one doesn't want to convert to
dchar[]. Maybe such functions area already in phobos somewhere? I haven't
looked too hard.

"Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays.
>
> For example, logically speaking I may want to get the second and third
> characters of this string (UTF8): ???? (only four characters).  It
> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> arts, so I can't get away from the Japanese language (it is tied to
> what I study)--even though I can't really speak a lick.
>
> Now, tell me what I would get in a UTF8 environment:
>
> char[] kyokimasu = "????";
> char[] test = kyokimasu[1..3];
>
> assert "??" == test;
>
> I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
>
> Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding.
>
> Not to mention the UTF8 string listed above would be more than 8 bytes
> long (the wchar[] version).
>
> The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
>
> I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense).  General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
>
> I just don't think we can rely on D's native (up to now) way of dealing
> with String manipulation.


August 31, 2004
actually now that I think about it another way to slice from character a to
b is to have a function that returns the index of the nth character:
 int character(char[] str, int n);
and then slicing is
 str[character(a) .. character(b)];
That is probably better than special slicing functions.

"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com...
> This is what dchar[] is for. With dchar[] array indexing === character
> indexing.
> A couple of helper function in std.string
>  char[] slice(char[] str, int a, int b); % slice characters a to b, not
> index a to b
>  wchar[] slice(wchar[] str, int a, int b);
> would also be nice for those cases when one doesn't want to convert to
> dchar[]. Maybe such functions area already in phobos somewhere? I haven't
> looked too hard.
>
> "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays.
> >
> > For example, logically speaking I may want to get the second and third
> > characters of this string (UTF8): ???? (only four characters).  It
> > is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> > arts, so I can't get away from the Japanese language (it is tied to
> > what I study)--even though I can't really speak a lick.
> >
> > Now, tell me what I would get in a UTF8 environment:
> >
> > char[] kyokimasu = "????";
> > char[] test = kyokimasu[1..3];
> >
> > assert "??" == test;
> >
> > I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
> >
> > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding.
> >
> > Not to mention the UTF8 string listed above would be more than 8 bytes
> > long (the wchar[] version).
> >
> > The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
> >
> > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense).  General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
> >
> > I just don't think we can rely on D's native (up to now) way of dealing
> > with String manipulation.
>
>


August 31, 2004
OK - enough replying to myself, I know, I know. Here's the code implementing what I'm talking about:

import std.utf;

size_t character(char[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}

size_t character(wchar[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}


"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26je$sq4$1@digitaldaemon.com...
> actually now that I think about it another way to slice from character a
to
> b is to have a function that returns the index of the nth character:
>  int character(char[] str, int n);
> and then slicing is
>  str[character(a) .. character(b)];
> That is probably better than special slicing functions.
>
> "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com...
> > This is what dchar[] is for. With dchar[] array indexing === character
> > indexing.
> > A couple of helper function in std.string
> >  char[] slice(char[] str, int a, int b); % slice characters a to b, not
> > index a to b
> >  wchar[] slice(wchar[] str, int a, int b);
> > would also be nice for those cases when one doesn't want to convert to
> > dchar[]. Maybe such functions area already in phobos somewhere? I
haven't
> > looked too hard.
> >
> > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> > > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar"
arrays.
> > >
> > > For example, logically speaking I may want to get the second and third
> > > characters of this string (UTF8): ???? (only four characters).  It
> > > is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> > > arts, so I can't get away from the Japanese language (it is tied to
> > > what I study)--even though I can't really speak a lick.
> > >
> > > Now, tell me what I would get in a UTF8 environment:
> > >
> > > char[] kyokimasu = "????";
> > > char[] test = kyokimasu[1..3];
> > >
> > > assert "??" == test;
> > >
> > > I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
> > >
> > > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of
multibyte
> > > encoding.
> > >
> > > Not to mention the UTF8 string listed above would be more than 8 bytes
> > > long (the wchar[] version).
> > >
> > > The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
> > >
> > > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to
succeed,
> > > as long as you speak English does not make sense).  General purpose
i18n
> > > and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
> > >
> > > I just don't think we can rely on D's native (up to now) way of
dealing
> > > with String manipulation.
> >
> >
>
>


August 31, 2004
In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says...
>
>actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character:
> int character(char[] str, int n);
>and then slicing is
> str[character(a) .. character(b)];
>That is probably better than special slicing functions.

It's more flexible, but it is slightly slower. The two calls to character() will
parse the string once each, while a splice() function could do it in one run.

Nick


August 31, 2004
In article <ch2i6t$13ma$1@digitaldaemon.com>, Nick says...
>
>
>It's more flexible, but it is slightly slower. The two calls to character() will
>parse the string once each, while a splice() function could do it in one run.
^^^^^^
Err, that should be slice() :-)

Nick


August 31, 2004
"Nick" <Nick_member@pathlink.com> wrote in message news:ch2i6t$13ma$1@digitaldaemon.com...
> In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says...
> >
> >actually now that I think about it another way to slice from character a
to
> >b is to have a function that returns the index of the nth character:
> > int character(char[] str, int n);
> >and then slicing is
> > str[character(a) .. character(b)];
> >That is probably better than special slicing functions.
>
> It's more flexible, but it is slightly slower. The two calls to
character() will
> parse the string once each, while a splice() function could do it in one
run.
>
> Nick
>
>

good point. plus it is less typing. So here's version 2:

import std.utf;

size_t character(char[] str, size_t n, size_t i = 0) {
  while (n--) {
    decode(str,i);
  }
  return i;
}

size_t character(wchar[] str, size_t n, size_t i = 0) {
  while (n--) {
    decode(str,i);
  }
  return i;
}

char[] slice(char[] str, size_t a, size_t b) {
  size_t ai = character(str,a);
  size_t bi = character(str,b-a,ai);
  return str[ai .. bi];
}

wchar[] slice(wchar[] str, size_t a, size_t b) {
  size_t ai = character(str,a);
  size_t bi = character(str,b-a,ai);
  return str[ai .. bi];
}


« First   ‹ Prev
1 2