arrays and strings - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » arrays and strings

Thread overview

arrays and strings
Aug 31, 2004 Berin Loritsch
Aug 31, 2004 Sebastian Beschke
Aug 31, 2004 Sebastian Beschke
Aug 31, 2004 Berin Loritsch
Aug 31, 2004 Ben Hinkle
Aug 31, 2004 Ben Hinkle
Aug 31, 2004 Ben Hinkle
Aug 31, 2004 Nick
Aug 31, 2004 Nick
Aug 31, 2004 Ben Hinkle
Aug 31, 2004 Berin Loritsch
Aug 31, 2004 Regan Heath
Aug 31, 2004 Regan Heath
Sep 01, 2004 Nick
Sep 01, 2004 Regan Heath
Sep 02, 2004 Sean Kelly
Sep 02, 2004 Nick
Sep 01, 2004 Walter
Sep 01, 2004 Ben Hinkle
Sep 01, 2004 Arcane Jill

August 31, 2004

arrays and strings

Posted by Berin Loritsch

Berin Loritsch

All this talk about unicode made it clear that using a straight array
may not be the right tool for string handling.  Sure the most common
operations can be done on an array (concatenation, sub-arrays, etc.).
However, if we are to assume any kind of encoding support other than
ASCII, it is simply not safe unless we are talking about "dchar" arrays.

For example, logically speaking I may want to get the second and third
characters of this string (UTF8): 彼は来る (only four characters).  It
is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
arts, so I can't get away from the Japanese language (it is tied to
what I study)--even though I can't really speak a lick.

Now, tell me what I would get in a UTF8 environment:

char[] kyokimasu = "彼は来る";
char[] test = kyokimasu[1..3];

assert "は来" == test;

I guarantee you the assertion would fail.  Why?  because strict array
slicing does not take into account multibyte encoding.  Essentially I
will get part of the first character's encoding alone.

Any UTF aware system would either need to build this knowlege into the
language (bad idea IMO), or have a string to take care of that info
for you.  Things are a bit better with wchar[], (I'm not sure, but I
think the above will pass)--but there are still some cases of multibyte
encoding.

Not to mention the UTF8 string listed above would be more than 8 bytes
long (the wchar[] version).

The only way to make it work seamlessly is to have a string class that
would make the proper adjustments.  Of course this would also affect
the speed deamons here.

I think having something generally useful for internationalization is
very important, or we shoot ourselves in the foot (we want D to succeed,
as long as you speak English does not make sense).  General purpose i18n
and l10n is not easy to do by any stretch--but I think it is generally
agreed that it would have to be done in libraries.

I just don't think we can rely on D's native (up to now) way of dealing
with String manipulation.

August 31, 2004

Re: arrays and strings

Posted by Sebastian Beschke
in reply to Berin Loritsch

Sebastian Beschke

Posted in reply to Berin Loritsch

Berin Loritsch wrote:
> For example, logically speaking I may want to get the second and third
> characters of this string (UTF8): 彼は来る (only four characters).  It
> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> arts, so I can't get away from the Japanese language (it is tied to
> what I study)--even though I can't really speak a lick.

Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D

Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array.

-Sebastian

August 31, 2004

Re: arrays and strings

Posted by Sebastian Beschke
in reply to Sebastian Beschke

Sebastian Beschke

Posted in reply to Sebastian Beschke

Sebastian Beschke wrote:
> don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D

Whoops, the link doesn't work. Nevermind.

August 31, 2004

Re: arrays and strings

Posted by Berin Loritsch
in reply to Sebastian Beschke

Berin Loritsch

Posted in reply to Sebastian Beschke

Sebastian Beschke wrote:

> Berin Loritsch wrote:
> 
>> For example, logically speaking I may want to get the second and third
>> characters of this string (UTF8): 彼は来る (only four characters).  It
>> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
>> arts, so I can't get away from the Japanese language (it is tied to
>> what I study)--even though I can't really speak a lick.
> 
> 
> Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D
> 
> Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array.
> 
> -Sebastian

Blasted electronic translators...

August 31, 2004

Re: arrays and strings

Posted by Ben Hinkle
in reply to Berin Loritsch

Ben Hinkle

Posted in reply to Berin Loritsch

This is what dchar[] is for. With dchar[] array indexing === character
indexing.
A couple of helper function in std.string
 char[] slice(char[] str, int a, int b); % slice characters a to b, not
index a to b
 wchar[] slice(wchar[] str, int a, int b);
would also be nice for those cases when one doesn't want to convert to
dchar[]. Maybe such functions area already in phobos somewhere? I haven't
looked too hard.

"Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays.
>
> For example, logically speaking I may want to get the second and third
> characters of this string (UTF8): ???? (only four characters).  It
> is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> arts, so I can't get away from the Japanese language (it is tied to
> what I study)--even though I can't really speak a lick.
>
> Now, tell me what I would get in a UTF8 environment:
>
> char[] kyokimasu = "????";
> char[] test = kyokimasu[1..3];
>
> assert "??" == test;
>
> I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
>
> Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding.
>
> Not to mention the UTF8 string listed above would be more than 8 bytes
> long (the wchar[] version).
>
> The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
>
> I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense).  General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
>
> I just don't think we can rely on D's native (up to now) way of dealing
> with String manipulation.

August 31, 2004

Re: arrays and strings

Posted by Ben Hinkle
in reply to Ben Hinkle

Ben Hinkle

Posted in reply to Ben Hinkle

actually now that I think about it another way to slice from character a to
b is to have a function that returns the index of the nth character:
 int character(char[] str, int n);
and then slicing is
 str[character(a) .. character(b)];
That is probably better than special slicing functions.

"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com...
> This is what dchar[] is for. With dchar[] array indexing === character
> indexing.
> A couple of helper function in std.string
>  char[] slice(char[] str, int a, int b); % slice characters a to b, not
> index a to b
>  wchar[] slice(wchar[] str, int a, int b);
> would also be nice for those cases when one doesn't want to convert to
> dchar[]. Maybe such functions area already in phobos somewhere? I haven't
> looked too hard.
>
> "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays.
> >
> > For example, logically speaking I may want to get the second and third
> > characters of this string (UTF8): ???? (only four characters).  It
> > is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> > arts, so I can't get away from the Japanese language (it is tied to
> > what I study)--even though I can't really speak a lick.
> >
> > Now, tell me what I would get in a UTF8 environment:
> >
> > char[] kyokimasu = "????";
> > char[] test = kyokimasu[1..3];
> >
> > assert "??" == test;
> >
> > I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
> >
> > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding.
> >
> > Not to mention the UTF8 string listed above would be more than 8 bytes
> > long (the wchar[] version).
> >
> > The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
> >
> > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense).  General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
> >
> > I just don't think we can rely on D's native (up to now) way of dealing
> > with String manipulation.
>
>

August 31, 2004

Re: arrays and strings

Posted by Ben Hinkle
in reply to Ben Hinkle

Ben Hinkle

Posted in reply to Ben Hinkle

OK - enough replying to myself, I know, I know. Here's the code implementing what I'm talking about:

import std.utf;

size_t character(char[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}

size_t character(wchar[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}


"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26je$sq4$1@digitaldaemon.com...
> actually now that I think about it another way to slice from character a
to
> b is to have a function that returns the index of the nth character:
>  int character(char[] str, int n);
> and then slicing is
>  str[character(a) .. character(b)];
> That is probably better than special slicing functions.
>
> "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:ch26as$sl7$1@digitaldaemon.com...
> > This is what dchar[] is for. With dchar[] array indexing === character
> > indexing.
> > A couple of helper function in std.string
> >  char[] slice(char[] str, int a, int b); % slice characters a to b, not
> > index a to b
> >  wchar[] slice(wchar[] str, int a, int b);
> > would also be nice for those cases when one doesn't want to convert to
> > dchar[]. Maybe such functions area already in phobos somewhere? I
haven't
> > looked too hard.
> >
> > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:ch24jt$rs0$1@digitaldaemon.com...
> > > All this talk about unicode made it clear that using a straight array may not be the right tool for string handling.  Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar"
arrays.
> > >
> > > For example, logically speaking I may want to get the second and third
> > > characters of this string (UTF8): ???? (only four characters).  It
> > > is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
> > > arts, so I can't get away from the Japanese language (it is tied to
> > > what I study)--even though I can't really speak a lick.
> > >
> > > Now, tell me what I would get in a UTF8 environment:
> > >
> > > char[] kyokimasu = "????";
> > > char[] test = kyokimasu[1..3];
> > >
> > > assert "??" == test;
> > >
> > > I guarantee you the assertion would fail.  Why?  because strict array slicing does not take into account multibyte encoding.  Essentially I will get part of the first character's encoding alone.
> > >
> > > Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you.  Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of
multibyte
> > > encoding.
> > >
> > > Not to mention the UTF8 string listed above would be more than 8 bytes
> > > long (the wchar[] version).
> > >
> > > The only way to make it work seamlessly is to have a string class that would make the proper adjustments.  Of course this would also affect the speed deamons here.
> > >
> > > I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to
succeed,
> > > as long as you speak English does not make sense).  General purpose
i18n
> > > and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.
> > >
> > > I just don't think we can rely on D's native (up to now) way of
dealing
> > > with String manipulation.
> >
> >
>
>

August 31, 2004

Re: arrays and strings

Posted by Nick
in reply to Ben Hinkle

Nick

Posted in reply to Ben Hinkle

In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says...
>
>actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character:
> int character(char[] str, int n);
>and then slicing is
> str[character(a) .. character(b)];
>That is probably better than special slicing functions.

It's more flexible, but it is slightly slower. The two calls to character() will
parse the string once each, while a splice() function could do it in one run.

Nick

August 31, 2004

Re: arrays and strings

Posted by Nick
in reply to Nick

Nick

Posted in reply to Nick

In article <ch2i6t$13ma$1@digitaldaemon.com>, Nick says...
>
>
>It's more flexible, but it is slightly slower. The two calls to character() will
>parse the string once each, while a splice() function could do it in one run.
^^^^^^
Err, that should be slice() :-)

Nick

August 31, 2004

Re: arrays and strings

Posted by Ben Hinkle
in reply to Nick

Ben Hinkle

Posted in reply to Nick

"Nick" <Nick_member@pathlink.com> wrote in message news:ch2i6t$13ma$1@digitaldaemon.com...
> In article <ch26je$sq4$1@digitaldaemon.com>, Ben Hinkle says...
> >
> >actually now that I think about it another way to slice from character a
to
> >b is to have a function that returns the index of the nth character:
> > int character(char[] str, int n);
> >and then slicing is
> > str[character(a) .. character(b)];
> >That is probably better than special slicing functions.
>
> It's more flexible, but it is slightly slower. The two calls to
character() will
> parse the string once each, while a splice() function could do it in one
run.
>
> Nick
>
>

good point. plus it is less typing. So here's version 2:

import std.utf;

size_t character(char[] str, size_t n, size_t i = 0) {
  while (n--) {
    decode(str,i);
  }
  return i;
}

size_t character(wchar[] str, size_t n, size_t i = 0) {
  while (n--) {
    decode(str,i);
  }
  return i;
}

char[] slice(char[] str, size_t a, size_t b) {
  size_t ai = character(str,a);
  size_t bi = character(str,b-a,ai);
  return str[ai .. bi];
}

wchar[] slice(wchar[] str, size_t a, size_t b) {
  size_t ai = character(str,a);
  size_t bi = character(str,b-a,ai);
  return str[ai .. bi];
}

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation