The length of strings vs. # of chars vs. sizeof - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » The length of strings vs. # of chars vs. sizeof

Thread overview

The length of strings vs. # of chars vs. sizeof
Nov 01, 2009 Charles Hixson
Nov 01, 2009 Rainer Deyke
Nov 01, 2009 Jérôme M. Berger
Nov 02, 2009 Rainer Deyke
Nov 02, 2009 Jesse Phillips
Nov 02, 2009 Rainer Deyke
Nov 02, 2009 Daniel Keep
Nov 02, 2009 Charles Hixson
Nov 03, 2009 rmcguire
Nov 03, 2009 Bill Baxter
Nov 03, 2009 rmcguire
Nov 03, 2009 Charles Hixson

November 01, 2009

The length of strings vs. # of chars vs. sizeof

Posted by Charles Hixson

Charles Hixson

I've read and re-read the documentation, but I can't decide whether a UTF-8 character that takes multiple bytes to express counts as one or multiple values in length and sizeof.  Sizeof seems to presume that all entries are the same length, but otherwise it seems to be the property I need.  (I suppose that I could just enter a string that I know is multi-byte chars, but it sure would be better if I could find out from the documentation.)  I'm pretty certain that it just counts as one character for indexing, so length would almost need to also count the number of characters rather than bytes.

Sizeof *should* be the correct property, and I've been assuming that it is, but I'm a bit afraid that I'll run across some unexpected character and it won't act the way I think it should.  And the documentation reads ambiguously.

Does anyone just *know* the answer?  (And if so, could they make the documentation explicit?)

November 01, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Rainer Deyke
in reply to Charles Hixson

Rainer Deyke

Posted in reply to Charles Hixson

Charles Hixson wrote:
> I've read and re-read the documentation, but I can't decide whether a UTF-8 character that takes multiple bytes to express counts as one or multiple values in length and sizeof.  Sizeof seems to presume that all entries are the same length, but otherwise it seems to be the property I need.  (I suppose that I could just enter a string that I know is multi-byte chars, but it sure would be better if I could find out from the documentation.)  I'm pretty certain that it just counts as one character for indexing, so length would almost need to also count the number of characters rather than bytes.

Strings are just arrays of code units.  Their length is the number of elements (i.e. code units) they contain, just like other arrays.  A code point may comprise multiple code units, and a logical character may comprise multiple code points.  The latter is true even with dchar/utf-32.


-- 
Rainer Deyke - rainerd@eldwood.com

November 01, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Jérôme M. Berger
in reply to Rainer Deyke

Jérôme M. Berger

Posted in reply to Rainer Deyke

Attachments:

signature.asc (OpenPGP digital signature)

Rainer Deyke wrote:
> Charles Hixson wrote:
>> I've read and re-read the documentation, but I can't decide whether a UTF-8 character that takes multiple bytes to express counts as one or multiple values in length and sizeof.  Sizeof seems to presume that all entries are the same length, but otherwise it seems to be the property I need.  (I suppose that I could just enter a string that I know is multi-byte chars, but it sure would be better if I could find out from the documentation.)  I'm pretty certain that it just counts as one character for indexing, so length would almost need to also count the number of characters rather than bytes.
> 
> Strings are just arrays of code units.  Their length is the number of elements (i.e. code units) they contain, just like other arrays.  A code point may comprise multiple code units, and a logical character may comprise multiple code points.  The latter is true even with dchar/utf-32.
> 
	So, in UTF-8, length is the number of bytes in the string and
sizeof is 8 (on 32-bits systems).

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr

November 02, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Jesse Phillips
in reply to Charles Hixson

Jesse Phillips

Posted in reply to Charles Hixson

On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

> Does anyone just *know* the answer?  (And if so, could they make the
> documentation explicit?)

I believe the documentation you are looking for is:

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

It is more about understanding UTF than it is about learning strings.

November 02, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Rainer Deyke
in reply to Jérôme M. Berger

Rainer Deyke

Posted in reply to Jérôme M. Berger

Jérôme M. Berger wrote:
> Rainer Deyke wrote:
>> Strings are just arrays of code units.  Their length is the number of elements (i.e. code units) they contain, just like other arrays.  A code point may comprise multiple code units, and a logical character may comprise multiple code points.  The latter is true even with dchar/utf-32.
>>
>     So, in UTF-8, length is the number of bytes in the string and sizeof
> is 8 (on 32-bits systems).

Yes.


-- 
Rainer Deyke - rainerd@eldwood.com

November 02, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Rainer Deyke
in reply to Jesse Phillips

Rainer Deyke

Posted in reply to Jesse Phillips

Jesse Phillips wrote:
> I believe the documentation you are looking for is:
> 
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
> 
> It is more about understanding UTF than it is about learning strings.

One thing that page fails to mention is that D has no awareness of
anything higher-level than code points.  In particular:
  - dchar contains a code point, not a logical character.
  - D has no awareness of canonical forms and precomposed/decomposed
characters (at the language level).  (Some characters can be represented
as either one or two code points.  D does not know that these are
supposed to represent the same character.)
  - Although D stops you from outputting an incomplete code point, it
does not stop you from outputting an incomplete logical character.

Also, some D library functions only work on the ASCII subset of utf-8.


-- 
Rainer Deyke - rainerd@eldwood.com

November 02, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Daniel Keep
in reply to Rainer Deyke

Daniel Keep

Posted in reply to Rainer Deyke


Rainer Deyke wrote:
> Jesse Phillips wrote:
>> I believe the documentation you are looking for is:
>>
>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>>
>> It is more about understanding UTF than it is about learning strings.
> 
> One thing that page fails to mention is that D has no awareness of
> anything higher-level than code points.  In particular:
>   - dchar contains a code point, not a logical character.
>   - D has no awareness of canonical forms and precomposed/decomposed
> characters (at the language level).  (Some characters can be represented
> as either one or two code points.  D does not know that these are
> supposed to represent the same character.)
>   - Although D stops you from outputting an incomplete code point, it
> does not stop you from outputting an incomplete logical character.
> 
> Also, some D library functions only work on the ASCII subset of utf-8.

Well, it *is* on a Wiki.

November 02, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Charles Hixson
in reply to Jesse Phillips

Charles Hixson

Posted in reply to Jesse Phillips

Jesse Phillips wrote:
> On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:
>
>> Does anyone just *know* the answer?  (And if so, could they make the
>> documentation explicit?)
>
> I believe the documentation you are looking for is:
>
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>
> It is more about understanding UTF than it is about learning strings.
Thanks, that does appear to be the answer.

So if a string is too long, and I shorten it by one character, I'd better test it with std.utf.validate(str).  If it doesn't throw an error, it's ok.  Otherwise shorten it again and retry.

I hope I understood this correctly.  (I'm sure there's a more elegant way to do this, but here I'm going for a simple approach, as I should rarely be encountering this problem.)

November 03, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by rmcguire
in reply to Charles Hixson

rmcguire

Posted in reply to Charles Hixson

Charles Hixson <charleshixsn@earthlink.net> wrote:

> Jesse Phillips wrote:
>> On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:
>>
>>> Does anyone just *know* the answer?  (And if so, could they make the
>>> documentation explicit?)
>>
>> I believe the documentation you are looking for is:
>>
>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>>
>> It is more about understanding UTF than it is about learning strings.
> Thanks, that does appear to be the answer.
> 
> So if a string is too long, and I shorten it by one character, I'd better test it with std.utf.validate(str).  If it doesn't throw an error, it's ok.  Otherwise shorten it again and retry.
> 
> I hope I understood this correctly.  (I'm sure there's a more elegant way to do this, but here I'm going for a simple approach, as I should rarely be encountering this problem.)
> 
> 
As far as I know if you want to shorten a utf8 string you just check the first bit of the last byte to see if its 0. If its 0 go back further until you find a byte that starts with 1, and then remove that byte too.

All characters start with a byte that starts with 1, the number of 1s in the first byte of the character tell you how many bytes in the character.

Hope that helps, but you should find a library that already has a "shorten my string" function.

-Rory

November 03, 2009

Re: The length of strings vs. # of chars vs. sizeof

Posted by Bill Baxter
in reply to rmcguire

Bill Baxter

Posted in reply to rmcguire

On Tue, Nov 3, 2009 at 2:47 AM, rmcguire <rjmcguire@gmail.com> wrote:
> Charles Hixson <charleshixsn@earthlink.net> wrote:
>
>> Jesse Phillips wrote:
>>> On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:
>>>
>>>> Does anyone just *know* the answer?  (And if so, could they make the
>>>> documentation explicit?)
>>>
>>> I believe the documentation you are looking for is:
>>>
>>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>>>
>>> It is more about understanding UTF than it is about learning strings.
>> Thanks, that does appear to be the answer.
>>
>> So if a string is too long, and I shorten it by one character, I'd better test it with std.utf.validate(str).  If it doesn't throw an error, it's ok.  Otherwise shorten it again and retry.
>>
>> I hope I understood this correctly.  (I'm sure there's a more elegant way to do this, but here I'm going for a simple approach, as I should rarely be encountering this problem.)
>>
>>
> As far as I know if you want to shorten a utf8 string you just check the first bit of the last byte to see if its 0. If its 0 go back further until you find a byte that starts with 1, and then remove that byte too.
>
> All characters start with a byte that starts with 1, the number of 1s in the first byte of the character tell you how many bytes in the character.
>
> Hope that helps, but you should find a library that already has a "shorten my string" function.

It's explained well in Andrei's book.
0* -- single byte character
11* -- first byte of multi-byte char
10* -- subsequent byte of multi-byte char

--bb

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation