Reading binary streams with decoding to Unicode

Oct 15, 2018

Vinay Sajip

Oct 15, 2018

Dukc

Oct 15, 2018

Oct 15, 2018

Oct 15, 2018

Oct 15, 2018

Oct 16, 2018

Oct 16, 2018

Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction?

On Monday, 15 October 2018 at 10:49:49 UTC, Vinay Sajip wrote: > Is there a standardised way of reading over buffered binary streams (at least strings, files, and sockets) where you can layer a decoder on top, so you get a character stream you can read one Unicode char at a time? Initially UTF-8, but later also other encodings. I see that std.stream was deprecated, but can't see what other options there are. Can anyone point me in the right direction? This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points. Vice-versa, if you want to iterate a character array by code unit, convert it to ubyte[]/ushort[] (depending on code unit length) or use std.utf.byCodeUnit

On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote: > This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points. Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality in e.g. Python, C#, Go, Kotlin, Java etc. but I'm surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?

October 15, 2018

Re: Reading binary streams with decoding to Unicode

Posted by Nicholas Wilson
in reply to Vinay Sajip

Permalink

Nicholas Wilson

Posted in reply to Vinay Sajip

Permalink

On Monday, 15 October 2018 at 18:57:19 UTC, Vinay Sajip wrote:
> On Monday, 15 October 2018 at 17:55:34 UTC, Dukc wrote:
>> This is done automatically for character arrays, which includes strings. wchar arrays wil iterate by UTF-16, and dchar arrays by UTF-32. If you have a byte/ubyte array you know to be unicode-encoded, convert it to char[] to iterate by code points.
>
> Thanks for the response. I was looking for something where I don't have to manage buffers myself (e.g. when handling buffered file or socket I/O). It's really easy to find this functionality in e.g. Python, C#, Go, Kotlin, Java etc. but I'm surprised there doesn't seem to be a ready-to-go equivalent in D. For example, I can find D examples of opening files and reading a line at a time, but no examples of opening a file and reading Unicode chars one at a time. Perhaps I've just missed them?

import std.file : readText;
import std.uni : byCodePoint, byGrapheme;
// or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF  /*utf8(?)*/;
string a = readText("foo");

foreach(cp; a.byCodePoint)
{
    // do stuff with code point 'cp'
}

foreach(g; a.byGrapheme)
{
    // do stuff with grapheme 'g'
}

On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote: > > import std.file : readText; > import std.uni : byCodePoint, byGrapheme; > // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; > string a = readText("foo"); > > foreach(cp; a.byCodePoint) > { > // do stuff with code point 'cp' > } Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time.

On Monday, 15 October 2018 at 21:48:05 UTC, Vinay Sajip wrote: > On Monday, 15 October 2018 at 19:56:22 UTC, Nicholas Wilson wrote: >> >> import std.file : readText; >> import std.uni : byCodePoint, byGrapheme; >> // or import std.utf : byCodeUnit, byChar /*utf8*/, byWchar /*utf16*/, byDchar /*utf32*/, byUTF /*utf8(?)*/; >> string a = readText("foo"); >> >> foreach(cp; a.byCodePoint) >> { >> // do stuff with code point 'cp' >> } > > Your example shows reading an entire file into memory (string a = readText("foo")), then iterating over that. I know you can iterate over a string; I'm interested in iterating over a stream, which is perhaps read over a network or from another I/O source, where you can't assume you have access to all of it at once - just one Unicode character at a time. Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe

On 10/16/18 11:42 AM, Vinay Sajip wrote: > On Monday, 15 October 2018 at 22:49:31 UTC, Nicholas Wilson wrote: >> Oh, sorry I missed that. Take a look at https://github.com/schveiguy/iopipe > > Great, thanks. Let me know if anything doesn't work there. The text processing is pretty robust, but haven't done a lot of work with sockets. -Steve

Forums