Thread overview
Reading binary streams with decoding to Unicode
Oct 15, 2018
Vinay Sajip
Oct 15, 2018
Dukc
Oct 15, 2018
Vinay Sajip
Oct 15, 2018
Nicholas Wilson
Oct 15, 2018
Vinay Sajip
Oct 15, 2018
Nicholas Wilson
Oct 16, 2018
Vinay Sajip
October 15, 2018
Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction?


October 15, 2018
On Monday, 15 October 2018 at 10:49:49 UTC, Vinay Sajip wrote:
> Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction?

This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.

Vice-versa, if you want to iterate a character array by code unit, convert it to ubyte[]/ushort[] (depending on code unit length) or use std.utf.byCodeUnit
October 15, 2018
On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
> This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.

Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality in e.g. Python, C#, Go, Kotlin, Java etc. but I'm surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?

October 15, 2018
On Monday, 15 October 2018 at 18:57:19 UTC, Vinay Sajip wrote:
> On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
>> This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.
>
> Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality in e.g. Python, C#, Go, Kotlin, Java etc. but I'm surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?

import std.file : readText;
import std.uni : byCodePoint, byGrapheme;
// or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
string a = readText("foo");

foreach(cp; a.byCodePoint)
{
    // do stuff with code point 'cp'
}

foreach(g; a.byGrapheme)
{
    // do stuff with grapheme 'g'
}

October 15, 2018
On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:
>
> import std.file : readText;
> import std.uni : byCodePoint, byGrapheme;
> // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
> string a = readText("foo");
>
> foreach(cp; a.byCodePoint)
> {
>     // do stuff with code point 'cp'
> }

Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.
October 15, 2018
On Monday, 15 October 2018 at 21:48:05 UTC, Vinay Sajip wrote:
> On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote:
>>
>> import std.file : readText;
>> import std.uni : byCodePoint, byGrapheme;
>> // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
>> string a = readText("foo");
>>
>> foreach(cp; a.byCodePoint)
>> {
>>     // do stuff with code point 'cp'
>> }
>
> Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.

Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe
October 16, 2018
On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
> Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe

Great, thanks.
October 16, 2018
On 10/16/18 11:42 AM, Vinay Sajip wrote:
> On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote:
>> Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe
> 
> Great, thanks.

Let me know if anything doesn't work there. The text processing is pretty robust, but haven't done a lot of work with sockets.

-Steve