Thread overview
How to pass an InputRange of dchars to a function that wants an Input range of chars?
4 days ago
realhet
4 days ago
bauss
4 days ago
realhet
4 days ago
monkyyy
3 days ago
realhet
4 days ago

Hello,

The problematic line is at the very bottom.

I only managed to make it run by precedding the .byChar with .text, but that is unwanted because it would convert the whole InputRange.

How can I do this dchar->char conversion on only the required number of chars? (The kwSearch function will pop only a few chars, it doesn't whant a whole RandomAccessRange)

import std;

struct KeywordSearchTree
{
	char ch/+root is always 0xff, it is there to hold subNodes.+/;
	int code;
	KeywordSearchTree[] subNodes;
	
	void addSubNode(string s, int scode)
	{
		if(s.length==0) return;
		
		auto idx = subNodes.map!"a.ch".countUntil(s[0]);
		if(idx<0) { idx = subNodes.length; subNodes ~= KeywordSearchTree(s[0]); }
		
		if(s.length==1)	subNodes[idx].code = scode;
		else	subNodes[idx].addSubNode(s[1..$], scode);
	}
	
	static KeywordSearchTree build(string[] keywords)
	{
		KeywordSearchTree root;
		foreach(i, act; keywords)
		root.addSubNode(act, (cast(int)(i+1)));
		return root;
	}
}

int kwSearch(KeywordSearchTree tree, R)(R s)
if(isInputRange!(R, immutable char))
{
	//pragma(msg, __FUNCTION__);
	
	if(s.empty) return 0;
	switch(s.front)
	{
		static foreach(sn; tree.subNodes)
		{
			case sn.ch: {
				s.popFront;
				if(!s.empty) if(const res = kwSearch!sn(s)) return res;
				return sn.code;
			}
		}
		default: return 0;
	}
}

int kwSearch(string[] keywords, R)(R s)
if(isInputRange!(R, immutable char))
{
	enum tree = KeywordSearchTree.build(keywords);
	return kwSearch!tree(s);
}

string[] keywords;

void main()
{
	enum ctKeywords = [
		"!", //Link: shebang https://dlang.org/spec/lex.html#source_text
		"version", "extension", "line", 	//Link: GLSL directives
		"pragma", "warning", "error", "assert", 	//Link: Opencl directives
		"include", "define", "undef", "ifdef", "ifndef", "if", "else", "elif", "endif",  	//Link: Arduino directives
	];
	keywords = ctKeywords;
	
	foreach(kw; keywords)
	{ (kw.dtext~" garbage"d).byChar.kwSearch!ctKeywords.writeln; }
}
4 days ago

On Wednesday, 19 February 2025 at 18:13:24 UTC, realhet wrote:

>

...

std.conv.to can convert for you.

4 days ago

On Wednesday, 19 February 2025 at 19:08:07 UTC, bauss wrote:

>

On Wednesday, 19 February 2025 at 18:13:24 UTC, realhet wrote:

>

...

std.conv.to can convert for you.

Thx!

I tried .map!(to!dchar) instead of .byChar and it still failed.

But then I deleted the constraints where I detect if the parameter is an input range of a char, and it suddenly worked.

So this

if(isInputRange!(R, immutable char))

wont let through obvious ranges that are input ranges and contain chars...

Anyone have a clue why?

What constraint should I use then?

4 days ago
On Wednesday, February 19, 2025 11:13:24 AM MST realhet via Digitalmars-d-learn wrote:
> Hello,
>
> The problematic line is at the very bottom.
>
> I only managed to make it run by precedding the .byChar with .text, but that is unwanted because it would convert the whole InputRange.
>
> How can I do this dchar->char conversion on only the required number of chars? (The kwSearch function will pop only a few chars, it doesn't whant a whole RandomAccessRange)

I don't have time at the moment to decipher your code and figure out what you're doing, but at a glance, it looks like you're expecting strings to be treated as ranges of immutable char, and they're not.

Phobos treats all strings as ranges of dchar. We call it auto-decoding, because it means that the range API is automatically decoding UTF-8 and UTF-16 code units to UTF-32. It was an attempt to make Unicode handling correct by default (by making it harder to accidentally split code points), but it doesn't actually succeed at providing full Unicode-correctness (since graphemes and normalization are a thing), and dealing with auto-decoding can be pretty annoying. So, we'd like to get rid of it in the next major version of Phobos and just treat all arrays as arrays of their actual element types, but for now, we have to deal with the range API treating all arrays of char and wchar as bidirectional ranges of dchar.

So, if you're looking to do anything with ranges of char, you can't use strings directly. byChar is one way to wrap a string to get a range of char. byCodeUnit would be another so long as it's an array of char specifically (rather than an array of wchar or dchar - for those, byCodeUnit would give you a range of wchar and dchar respectively). Neither of them actually converts the underlying range. Rather, they wrap it and lazily convert the elements as you access them.

So you should probably either just make your code operate on ranges of dchar, or you'll need to wrap your ranges using byChar or byCodeUnit in order to get ranges of char. All range-based functions will treat your strings as ranges of dchar. So, if really need to have strings and be treating them as ranges of char without wrapping them and without potentially creating new strings from wrapped ranges, then you can't use any range-based functions to do what you're doing.

If you're using byCodeUnit (or you use byChar on an array of char, which in turn uses byCodeUnit), then you can use the source member on the result to get the underlying string back at whatever point it is in the iteration, but in general, if you pass a string wrapped by byCodeUnit to any range-based function that returns its own range type, then you can't convert back to a string without using something like std.conv.to to allocate a new string.

Range-based functions which are eager rather than lazy (e.g. find) will return the original range, but a large percentage of range-based functions are lazy and will return wrapped ranges. So, depending on what you're doing, it's going to be difficult to do a bunch of range-based operations and then get a string at the end of the result witout allocating a new string - even if strings were treated as ranges of their actual element type.

- Jonathan M Davis



4 days ago

On Wednesday, 19 February 2025 at 20:46:24 UTC, realhet wrote:

>

On Wednesday, 19 February 2025 at 19:08:07 UTC, bauss wrote:

>

On Wednesday, 19 February 2025 at 18:13:24 UTC, realhet wrote:

>

...

std.conv.to can convert for you.

Thx!

I tried .map!(to!dchar) instead of .byChar and it still failed.

But then I deleted the constraints where I detect if the parameter is an input range of a char, and it suddenly worked.

So this

if(isInputRange!(R, immutable char))

wont let through obvious ranges that are input ranges and contain chars...

Anyone have a clue why?

What constraint should I use then?

import std;
auto fixstring(string s)=>(cast(ubyte[])s).map!(a=>cast(char)a);
void main(){
"hello world".fixstring.map!(a=>std.ascii.toUpper(a)).each!write;
}

4 days ago
On Wednesday, February 19, 2025 7:48:48 PM MST Jonathan M Davis via Digitalmars-d-learn wrote:
> So you should probably either just make your code operate on ranges of dchar, or you'll need to wrap your ranges using byChar or byCodeUnit in order to get ranges of char. All range-based functions will treat your strings as ranges of dchar. So, if really need to have strings and be treating them as ranges of char without wrapping them and without potentially creating new strings from wrapped ranges, then you can't use any range-based functions to do what you're doing.

To add to this, another option is to use std.string.representation to cast a string to immutable(ubyte)[] - or an array of whatever the corresponding integer type is if you're not dealing with immutable(char)[] - and operate on that instead. But you can't use string as-is and have a range of char.

- Jonathan M Davis



3 days ago
On Thursday, 20 February 2025 at 03:54:28 UTC, Jonathan M Davis wrote:
> On Wednesday, February 19, 2025 7:48:48 PM MST Jonathan M Davis via Digitalmars-d-learn wrote:
>> So you should probably either just make your code operate on ranges of dchar

Thank You for the deep explanation!

This thing checks is a range of characters starts with a small set of keywords and returns the index+1 of the particular keyword.
It does not fetches a whole identifier, it lookus up every character in a tree.  The tree is generated as compile time template functions: every branch is translated to a switch/case statement, and there comes the hardcore optimization by LDC.

Yesterday, at one point I already decided to transition my code from chat to dchar.

My actual use case is also presents a complicated range of dchar-s.

And I just don't constraint the range type with isInputRange(R, immutable dchar). It generates a nasty error message and if popFront and empty is not supported it will show it anyways.


On the performance side of 8bit vs 32bit:
It has now 4x more data to work with with.  (Mostly because "cmp ax, im8" operation was replaced by "cmp eax, im32". So the early I can convert down to 8 bit is the better.)
The keywords I looking for at the beginning of the 'string' are only 8bit ascii characters, "#define", "#ifdef", and so on. Because of the dchar it must ensure that the high 24 bits are zero as well.
But this thing only runs for 1msec (8bit version .5ms) so it will stay in the dchar form, which - as now I learned - was the preferred usage at this moment.

I have better understanding now, Thx!