Thread overview
size of a string in bytes
Jan 28, 2017
Nestor
Jan 28, 2017
rikki cattermole
Jan 28, 2017
Nestor
Jan 28, 2017
H. S. Teoh
Jan 28, 2017
rikki cattermole
Jan 28, 2017
Ivan Kazmenko
Jan 28, 2017
Nestor
Jan 28, 2017
Adam D. Ruppe
Jan 28, 2017
ag0aep6g
Jan 29, 2017
Nestor
January 28, 2017
Hi,

One can get the length of a string easily, however since strings are UTF-8, sometimes characters take more than one byte. I would like to know then how many bytes does a string take, but this code didn't work as I expected:

import std.stdio;
void main() {
  string mystring1;
  string mystring2 = "A string of just 48 characters for testing size.";
  writeln(mystring1.sizeof);
  writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me just the default size of a string type and not the size of the variable in memory, which is what I want.

Ideas?
January 29, 2017
On 29/01/2017 3:51 AM, Nestor wrote:
> Hi,
>
> One can get the length of a string easily, however since strings are
> UTF-8, sometimes characters take more than one byte. I would like to
> know then how many bytes does a string take, but this code didn't work
> as I expected:
>
> import std.stdio;
> void main() {
>   string mystring1;
>   string mystring2 = "A string of just 48 characters for testing size.";
>   writeln(mystring1.sizeof);
>   writeln( mystring2.sizeof);
> }
>
> In both cases the size is 8, so apparently sizeof is giving me just the
> default size of a string type and not the size of the variable in
> memory, which is what I want.
>
> Ideas?

A few misconceptions going on here.
A string element is not a grapheme it is a character which is one byte.

So what you want is mystring.length

Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example.

If you want to know about graphemes and code points that is another story.
For that you'll want std.uni[0] and std.utf[1].

[0] http://dlang.org/phobos/std_uni.html
[1] http://dlang.org/phobos/std_utf.html
January 28, 2017
On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:
> On 29/01/2017 3:51 AM, Nestor wrote:
>> Hi,
>>
>> One can get the length of a string easily, however since strings are
>> UTF-8, sometimes characters take more than one byte. I would like to
>> know then how many bytes does a string take, but this code didn't work
>> as I expected:
>>
>> import std.stdio;
>> void main() {
>>   string mystring1;
>>   string mystring2 = "A string of just 48 characters for testing size.";
>>   writeln(mystring1.sizeof);
>>   writeln( mystring2.sizeof);
>> }
>>
>> In both cases the size is 8, so apparently sizeof is giving me just the
>> default size of a string type and not the size of the variable in
>> memory, which is what I want.
>>
>> Ideas?
>
> A few misconceptions going on here.
> A string element is not a grapheme it is a character which is one byte.
>
> So what you want is mystring.length
>
> Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example.
>
> If you want to know about graphemes and code points that is another story.
> For that you'll want std.uni[0] and std.utf[1].
>
> [0] http://dlang.org/phobos/std_uni.html
> [1] http://dlang.org/phobos/std_utf.html

I do not want string lenth or code points. Perhaps I didn't explain myselft.

I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size).

How can I get that?
January 28, 2017
On Sat, Jan 28, 2017 at 03:32:33PM +0000, Nestor via Digitalmars-d-learn wrote: [...]
> I do not want string lenth or code points. Perhaps I didn't explain myselft.

The .length property of a string is the number of bytes used to store the string.


> I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size).

What you call "string length" is called grapheme count in D.  What you want is the .length property.

The number of bytes in a UTF-8 string is the same thing as the number of code units (note: do not confuse with code points, which is something else).


--T
January 29, 2017
On 29/01/2017 4:32 AM, Nestor wrote:
> On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:
>> On 29/01/2017 3:51 AM, Nestor wrote:
>>> Hi,
>>>
>>> One can get the length of a string easily, however since strings are
>>> UTF-8, sometimes characters take more than one byte. I would like to
>>> know then how many bytes does a string take, but this code didn't work
>>> as I expected:
>>>
>>> import std.stdio;
>>> void main() {
>>>   string mystring1;
>>>   string mystring2 = "A string of just 48 characters for testing size.";
>>>   writeln(mystring1.sizeof);
>>>   writeln( mystring2.sizeof);
>>> }
>>>
>>> In both cases the size is 8, so apparently sizeof is giving me just the
>>> default size of a string type and not the size of the variable in
>>> memory, which is what I want.
>>>
>>> Ideas?
>>
>> A few misconceptions going on here.
>> A string element is not a grapheme it is a character which is one byte.
>>
>> So what you want is mystring.length
>>
>> Now sizeof is not telling you about the elements, its telling you how
>> big the reference to it is. Specifically length + pointer. It would
>> have been 16 if you compiled in 64bit mode for example.
>>
>> If you want to know about graphemes and code points that is another
>> story.
>> For that you'll want std.uni[0] and std.utf[1].
>>
>> [0] http://dlang.org/phobos/std_uni.html
>> [1] http://dlang.org/phobos/std_utf.html
>
> I do not want string lenth or code points. Perhaps I didn't explain
> myselft.
>
> I want to know variable size in memory. For example, say I have an UTF-8
> string of only 2 characters, but each of them takes 2 bytes. string
> length would be 2, but the content of the string would take 4 bytes in
> memory (excluding overhead for type size).
>
> How can I get that?

.length

You are misunderstanding a char will always be exactly one byte in size.

Check[0] for proof.

Keep in mind here is the definition of string[1]:
alias immutable(char)[]  string;

There is nothing fancy going on.
What you were asking about "characters" wise is actually graphemes as per the unicode standard, they can be multiple bytes and codepoints in size but not a char.

[0] http://dlang.org/spec/type.html
[1] https://github.com/dlang/druntime/blob/master/src/object.d
January 28, 2017
On Saturday, 28 January 2017 at 15:32:33 UTC, Nestor wrote:
> I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size).

As said, the byte count is indeed string.length.
The number of code points can be found by std.range.walkLength, but be aware it takes O(answer) time to compute.

Example:

-----
import std.range, std.stdio;
void main () {
	auto s = "Привет!";
	writeln (s.length); // 13 bytes
	writeln (s.walkLength); // 7 code points
}
-----

Ivan Kazmenko.

January 28, 2017
On Saturday, 28 January 2017 at 16:01:38 UTC, Ivan Kazmenko wrote:
> As said, the byte count is indeed string.length.
> The number of code points can be found by std.range.walkLength, but be aware it takes O(answer) time to compute.
>
> Example:
>
> -----
> import std.range, std.stdio;
> void main () {
> 	auto s = "Привет!";
> 	writeln (s.length); // 13 bytes
> 	writeln (s.walkLength); // 7 code points
> }

Thank you Ivan,

I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than one byte, so since a string is an array of chars, I thought length behaved like walkLength (which I had not seen), in other words, that it simply returned the amount of elements in the array.
January 28, 2017
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
> I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than

Not true in the language, but the Phobos library does treat char and ubyte differently because of the multi-char things.

But the built-in .length on a string and indexing all work the same as bytes.

Note that .length on a wstring or dstring (utf-16 or utf-32) are not bytes, but words. So wstring.length = number of wchars = number of 16 bit items. And dstring is 32 bit. Exactly the same as ushort[].length or int[].length - it is length of elements so if you actually want byte length, you'd cast it first or something.
January 28, 2017
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
> I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than one byte,

In D, a `char` is a UTF-8 code unit. Its size is one byte, exactly and always.

A `char` is not a "character" in the common meaning of the word. There's a more specialized word for "character" as a visual unit: grapheme. For example, 'Ä' is a grapheme (a visual unit, a "character"), but there is no single `char` for it. To encode 'Ä' in UTF-8, a sequence of multiple code units is used.

> so since a string is an array of chars, I thought length behaved like walkLength (which I had not seen), in other words, that it simply returned the amount of elements in the array.

The elements of a `string` are (immutable) `char`s. That is, `string` is an array of UTF-8 code units. It's not an array of graphemes.

A `string`'s .length gives you the number of `char`s in it, i.e. the number of UTF-8 code units, i.e. the number of bytes.
January 29, 2017
On Saturday, 28 January 2017 at 19:09:01 UTC, ag0aep6g wrote:
> In D, a `char` is a UTF-8 code unit. Its size is one byte, exactly and always.
>
> A `char` is not a "character" in the common meaning of the word. There's a more specialized word for "character" as a visual unit: grapheme. For example, 'Ä' is a grapheme (a visual unit, a "character"), but there is no single `char` for it. To encode 'Ä' in UTF-8, a sequence of multiple code units is used.
> 
> ...
> 
> The elements of a `string` are (immutable) `char`s. That is, `string` is an array of UTF-8 code units. It's not an array of graphemes.
>
> A `string`'s .length gives you the number of `char`s in it, i.e. the number of UTF-8 code units, i.e. the number of bytes.

Very good explanation.
Thank you all for making this clear to me.