Using lazy code to process large files (page 2)

On 8/2/17 1:16 PM, kdevel wrote: > On Wednesday, 2 August 2017 at 15:52:13 UTC, Steven Schveighoffer wrote: > >> If we use the correct code unit sequence (0xc3 0x9c), then [...] > > If I avoid std.string.stripLeft and use std.algorithm.stripLeft(' ') instead it works as expected: What is expected? What I see on the screen when I run my code is: [Ü] What I see when I run your "working" code is: [?] You are missing the point that your input string is invalid. std.algorithm is not validating the entire string, and so it doesn't throw an error like string.stripLeft does. writeln doesn't do any decoding of individual strings. It avoids the problem and just copies your bad data directly. If you fix the input, both will work correctly. -Steve

On Wednesday, 2 August 2017 at 17:37:09 UTC, Steven Schveighoffer wrote: > What is expected? What I see on the screen when I run my code is: > > [Ü] Upper case? > What I see when I run your "working" code is: > > [?] Your terminal is incapable of rendering the Latin-1 encoding. The program prints one byte of value 0xfc. You may pipe the output into hexdump -C: 00000000 5b fc 5d 0a |[ü].| 00000004 > You are missing the point that your input string is invalid. It's perfectly okay to put any value a octet can take into an octet. I did not claim that the data in the string memory is syntactically valid UTF-8. Read the comment in line 9 of my post of 15:02:22. > std.algorithm is not validating the entire string, True and it should not. So this is what I want. > and so it doesn't throw an error like string.stripLeft does. That is the point. You wrote | I wouldn't expect good performance from this, as there is auto-decoding all | over the place. I erroneously thought that using byCodeUnit disables the whole UTF-8 processing and enforces operation on (u)bytes. But this is not the case at least not for stripLeft and probably other string functions. > writeln doesn't do any decoding of individual strings. It avoids the problem and just copies your bad data directly. That is what I expected.

On 08/02/2017 08:28 PM, kdevel wrote: > It's perfectly okay to put any value a octet can take into an octet. I did not claim that the data in the string memory is syntactically valid UTF-8. Read the comment in line 9 of my post of 15:02:22. You're claiming that the data is in UTF-8 when you use `string` as the type. For arbitrary octets, use something like `ubyte[]`.

August 02, 2017

Re: Using lazy code to process large files

Posted by Steven Schveighoffer
in reply to kdevel

Permalink

Steven Schveighoffer

Posted in reply to kdevel

Permalink

On 8/2/17 2:28 PM, kdevel wrote:
> On Wednesday, 2 August 2017 at 17:37:09 UTC, Steven Schveighoffer wrote:
> 
>> What is expected? What I see on the screen when I run my code is:
>>
>> [Ü]
> 
> Upper case?

Sorry, should be c3 bc, not c3 9c. I misread the table on that wikipedia entry.

>> What I see when I run your "working" code is:
>>
>> [?]
> 
> Your terminal is incapable of rendering the Latin-1 encoding. The program prints one byte of value 0xfc. You may pipe the output into hexdump -C:
> 
> 00000000  5b fc 5d 0a                                       |[ü].|
> 00000004

Right, I saw that. But it's still not valid utf8, which is what char and string are.

>> You are missing the point that your input string is invalid.
> 
> It's perfectly okay to put any value a octet can take into an octet. I did not claim that the data in the string memory is syntactically valid UTF-8. Read the comment in line 9 of my post of 15:02:22.

Except a string is utf8, period. char is a utf8 code unit, period.

If you want some other encoding, it has to be defined as a different type. Otherwise, you will get errors when using any D library, all of which should expect char to be a utf8 code-unit.

>> std.algorithm is not validating the entire string,
> 
> True and it should not. So this is what I want.

But it's not the same as the original. For instance, the original would strip tabs, yours does not.

>> and so it doesn't throw an error like string.stripLeft does.
> 
> That is the point. You wrote
> 
> | I wouldn't expect good performance from this, as there is auto-decoding all
> | over the place.
> 
> I erroneously thought that using byCodeUnit disables the whole UTF-8 processing and enforces operation on (u)bytes. But this is not the case at least not for stripLeft and probably other string functions.

std.string.stripLeft is still expecting unicode, as it's testing std.uni.isWhite. So it has to do decoding. std.algorithm.stripLeft (the way you called it anyway) is looking at char instances and doing a direct comparison to ONE char (' '), so it can be much much faster and does not have to decode. This is an optimization, not a feature. I wouldn't be surprised, for instance, if byCodeUnit threw an error when encountering an invalid sequence in debug mode or something.

If your goal is to only look for that ascii character, then using byCodeUnit is required to avoid auto-decoding, which is where the unexpected slowdown would come.

But string functions that are specifically looking for unicode sequences are still going to decode, even if the range isn't doing it proactively.

In any case, the input data is not valid, you should use ubyte[], or some other type array, not strings.

-Steve

Forums