Jump to page: 1 2
Thread overview
fixedstring: a @safe, @nogc string type
Jan 10
Moth
Jan 10
zjh
Jan 10
jmh530
Jan 10
Moth
Jan 10
Moth
Jan 11
Moth
Jan 11
vit
Jan 12
Moth
Jan 12
Moth
Jan 13
Moth
January 10

hi all.

i got fed up with the built-in string type having so many features unavailable in @nogc code, so i made my own.

introducing fixedstring: a templated fixed-length array of chars, compatible with @safe, @nogc, and nothrow code.

licenced under the AGPL-3.0 or later, but i'm open to relicensing if someone really really wants it.

have fun =]

https://github.com/Moth-Tolias/fixedstring

special thanks to snarwin on the d discord for convincing me to post here.

January 10

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:

>

hi all.

Good.Let's use betterC.

January 10

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:

>

hi all.

i got fed up with the built-in string type having so many features unavailable in @nogc code, so i made my own.

introducing fixedstring: a templated fixed-length array of chars, compatible with @safe, @nogc, and nothrow code.

licenced under the AGPL-3.0 or later, but i'm open to relicensing if someone really really wants it.

have fun =]

https://github.com/Moth-Tolias/fixedstring

[snip]

You might add some examples to the Readme.md

January 10

On Monday, 10 January 2022 at 13:12:13 UTC, jmh530 wrote:

>

You might add some examples to the Readme.md

good observation, i'll work on that. in the meantime the examples in the unittests should suffice.

January 10

On Monday, 10 January 2022 at 14:06:27 UTC, Moth wrote:

>

On Monday, 10 January 2022 at 13:12:13 UTC, jmh530 wrote:

>

You might add some examples to the Readme.md

good observation, i'll work on that. in the meantime the examples in the unittests should suffice.

fixed.

for those who don't want to visit the github just to see the change, here's the example code:

void main() @safe @nogc nothrow
{
	FixedString!14 foo = "clang";
	foo[0] = 'd';
	foo ~= " is cool";
	assert (foo == "dlang is cool");

	foo.length = 9;

	auto bar = FixedString!4("neat");
	assert (foo ~ bar == "dlang is neat");
}
January 11

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:

>

have fun =]

https://github.com/Moth-Tolias/fixedstring

I try Fixedstring and, to my great relief I got the results I expected. Thank you, good luck with your work.

So how to fix this double character issue:

FixedString!6 sugar = "şeker"; // in Turkish
  assert(sugar[0..3] == "şe");

  FixedString!5 şeker = "sugar"; // in English
  assert(şeker[0..2] == "su");

  assert(sugar.length > şeker.length);

How about adding that member among FixedString?

public size_t usefulCapacity()const pure @nogc @safe
  {
    return size - _length;
  }
January 11

On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:

>

[snip]

glad to hear you're finding it useful! =]

hm, i'm not sure how i would go about fixing that double character issue. i know there's currently some wierdness with wchars / dchars equality that needs to be fixed [shouldn't be too much trouble, just need to set aside the time for it], but i think being able to tell how many chars there are in a glyph requires unicode awareness? i'll look into it.

what's your usecase for usefulCapacity()?

January 11

On Tuesday, 11 January 2022 at 11:16:13 UTC, Moth wrote:

>

On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:

>

[snip]

glad to hear you're finding it useful! =]

... i know there's currently some wierdness with wchars / dchars equality that needs to be fixed [shouldn't be too much trouble...

If you try mixing char/wchar/dchar, you need encoding/decoding for utf-8, utf-16 and utf-32 ( maybe even LE/BE ). It become complicated very fast...

January 11

On Tuesday, 11 January 2022 at 11:16:13 UTC, Moth wrote:

>

On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:

>

[snip]

glad to hear you're finding it useful! =]

hm, i'm not sure how i would go about fixing that double character issue. i know there's currently some wierdness with wchars / dchars equality that needs to be fixed [shouldn't be too much trouble, just need to set aside the time for it], but i think being able to tell how many chars there are in a glyph requires unicode awareness? i'll look into it.

[...]

you can relatively easily find out how many bytes a string takes up with std.utf. You can also iterate by code points or graphemes there if you want to translate some kind of character index to byte position.

HOWEVER it's not clear what a character is. Sure for the posted cases here it's no problem but when it comes to languages based on combining glyphs together to form new glyphs it's no longer clear what is a character. There are Graphemes (grapheme clusters) which are probably the closest to what everybody would think a character is, but IIRC there are edge cases with that a programmer wouldn't expect, like adding a character not increasing the count of characters of the string because it merges with the last Grapheme. Additionally there is a performance impact on using Graphemes over simpler things like codepoints which fit 98% of use-cases with strings. Codepoints in D are mapped 1:1 using dchar, take up to 2 wchars or up to 4 chars. You can use std.utf to compute byte lengths for a codepoint given a string.

I would rather suggest you support FixedString with types other than char. (wchar, dchar, heck users could even use any arbitrary type and use this as array class) For languages that commonly use more than 1 byte per codepoint or for interop with Win32 unicode APIs, JavaScript strings, C# strings, UTF16 files in general, etc. programmers might opt to use FixedString with wchar then.

With D's templates that should be quite easy to do (add a template parameter to the struct like struct FixedString(size_t maxSize, CharT = char) and replace all usage of char in your code with CharT in this case)

January 11
On Tue, Jan 11, 2022 at 11:16:13AM +0000, Moth via Digitalmars-d-announce wrote:
> On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:
> > [snip]
> 
> glad to hear you're finding it useful! =]

One minor usability issue I found just glancing over the code: many of your methods take char[] as argument. Generally, you want const(char)[] instead, so that it will work with both char[] and immutable(char)[]. No reason why you can't copy some immutable chars into a FixedString, for example.

Another potential issue is with the range interface. Your .popFront is implemented by copying the entire buffer 1 char forwards, which can easily become a hidden performance bottleneck. Iteration over a FixedString currently is O(N^2), which is a problem if performance is your concern.

Generally, I'd advise not conflating your containers with ranges over your containers: I'd make .opSlice return a traditional D slice (i.e., const(char)[]) instead of a FixedString, and just require writing `[]` when you need to iterate over the string as a range:

	FixedString!64 mystr;
	foreach (ch; mystr[]) { // <-- iterates over const(char)[]
		...
	}

This way, no redundant copying of data is done during iteration.

Another issue is the way concatenation is implemented. Since FixedStrings have compile-time size, this potentially means every time you concatenate a string in your code you get another instantiation of FixedString. This can lead to a LOT of template bloat if you're not careful, which may quickly outweigh any benefits you may have gained from not using the built-in strings.


> hm, i'm not sure how i would go about fixing that double character issue. i know there's currently some wierdness with wchars / dchars equality that needs to be fixed [shouldn't be too much trouble, just need to set aside the time for it], but i think being able to tell how many chars there are in a glyph requires unicode awareness? i'll look into it.
[...]

Yes, you will require Unicode-awareness, and no, it will NOT be as simple as you imagine.

First of all, you have the wide-character issue: if you're dealing with anything outside of the ASCII range, you will need to deal with code points (potentially wchar, dchar).  You can either take the lazy way out (FixedString!(n, wchar), FixedString!(n, dchar)), but that will exacerbate your template bloat very quickly. Plus, it wastes a lot of memory, esp. if you start using dchar[] -- 4 bytes per character potentially makes ASCII strings use up 4x more memory. (And even if you decide using dchar[] isn't a concern, there's still the issue of graphemes -- see below, which requires non-trivial decoding anyway.)

Or you can handle UTF-8, which is a better solution in terms of memory usage. But then you will immediately run into the encoding/decoding problem. Your .opSlice, for example, will not work correctly unless you auto-decode. But that will be a performance hit -- this is one of the design mistakes in hindsight that's still plaguing Phobos today. IMO the better approach is to iterate over the string *without* decoding, but just detecting codepoint boundaries.  Regardless, you will need *some* way of iterating over code points instead of code units in order to deal with this properly.

But that's only the beginning of the story. In Unicode, a "code point" is NOT what most people imagine a "character" is. For most European languages this is the case, but once you go outside of that, you'll start finding things like accented characters that are composed of multiple code points.  In Unicode, that's called a Grapheme, and here's the bad news: the length of a Grapheme is technically unbounded (even though in practice it's usually 2 or occasionally 3 -- but you *will* find more on rare occasions). And worst of all, determining the length of a grapheme requires an expensive, non-trivial algorithm that will KILL your performance if you blindly do it every time you traverse your string.

And generally, you don't *want* to do grapheme segmentation anyway -- most code doesn't even care what the graphemes are, it just wants to treat strings as opaque data that you may occasionally want to segment into substrings (and substrings don't necessarily require grapheme segmentation to compute, depending on what the final goal is). But occasionally you *will* need grapheme segmentation (e.g., if you need to know how many visual "characters" there are in a string); for that, you will need std.uni. And no, it's not something you can implement overnight.  It requires some heavy-duty lookup tables and a (very careful!) implementation of TR14.

Because of the foregoing, you have at least 4 different definitions of the length of the string:

1. The number of code units it occupies, i.e., the number of chars / wchars / dchars.

2. The number of code points it contains, which, in UTF-8, is a non-trivial quantity that requires iterating over the entire string to compute. Or you can just use wchar[] or dchar[], but then your memory footprint will increase, potentially up to 4x.

3. The number of graphemes it contains, i.e., how many "visual characters" (the way most people understand the word "character") it contains. This requires grapheme segmentation, is expensive to compute, and generally shouldn't be done unless you have some concrete reason why you want to do this.

4. The rendered width of the string, i.e., how much space it occupies if displayed on the screen. Even on a monospace-font text terminal, this is a non-trivial quantity because some Unicode codepoints are double-width (e.g., East Asian block), and some are *zero*-width (e.g., shy hyphens, zero-width breaking spaces). And it depends on how your terminal emulator renders these characters (what Unicode defines as a double-width may not necessarily be rendered that way).  And of course, on a GUI application measuring the length of a string requires font details.

Welcome to the *cough* wonderful world of Unicode, where everything is possible but nothing is simple. :-D


T

-- 
This sentence is false.
« First   ‹ Prev
1 2