Need to do some "dirty" UTF-8 handling (page 2)

On 26.06.2011 3:25, Nick Sabalausky wrote: > "Dmitry Olshansky"<dmitry.olsh@gmail.com> wrote in message > news:iu5n32$2vjd$1@digitalmars.com... >> On 26.06.2011 1:49, Nick Sabalausky wrote: >>> "Andrej Mitrovic"<andrej.mitrovich@gmail.com> wrote in message >>> news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com... >>>> I've had a similar requirement some time ago. I've had to copy and >>>> modify the phobos function std.utf.decode for a custom text editor >>>> because the function throws when it finds an invalid code point. This >>>> is way too slow for my needs. I'm actually displaying invalid code >>>> points with special marks (just like Scintilla), so I need decoding to >>>> work as fast as possible. >>>> >>>> The new function simply replaces throwing exceptions with flagging a >>>> boolean. >>> I think I may end up doing something like that :/ >>> >>> I was hoping to be able to do something vaguely sensible like this: >>> >>> string newStr; >>> foreach(dchar dc; str) >>> { >>> if(isValidDchar(dc)) >>> newStr ~= dc; >>> else >>> newStr ~= 'X'; >>> } >>> str = newStr; >>> >>> But that just blows up in my face. >>> >>> >> std.encoding to the rescue? >> It looks like a well established module that was forgotten for some >> reason. >> >> And here I'm wondering what a function named sanitize could do :) >> > Ahh, I didn't even notice that module. Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!) > Even if it's imperfect and goes away, it looks like it'll at least get the > job done for me. And the encoding conversions should even give me an easy > way to save at least some of the invalid chars (which wasn't really a > requirement of mine, but it'll still be nice). > > Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;) -- Dmitry Olshansky

June 26, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky
in reply to Dmitry Olshansky

Permalink

Nick Sabalausky

Posted in reply to Dmitry Olshansky

Permalink

"Dmitry Olshansky" <dmitry.olsh@gmail.com> wrote in message news:iu5tan$ets$1@digitalmars.com...
> On 26.06.2011 3:25, Nick Sabalausky wrote:
>> "Dmitry Olshansky"<dmitry.olsh@gmail.com>  wrote in message news:iu5n32$2vjd$1@digitalmars.com...
>>> On 26.06.2011 1:49, Nick Sabalausky wrote:
>>>> "Andrej Mitrovic"<andrej.mitrovich@gmail.com>   wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com...
>>>>> I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible.
>>>>>
>>>>> The new function simply replaces throwing exceptions with flagging a boolean.
>>>> I think I may end up doing something like that :/
>>>>
>>>> I was hoping to be able to do something vaguely sensible like this:
>>>>
>>>> string newStr;
>>>> foreach(dchar dc; str)
>>>> {
>>>>       if(isValidDchar(dc))
>>>>           newStr ~= dc;
>>>>       else
>>>>           newStr ~= 'X';
>>>> }
>>>> str = newStr;
>>>>
>>>> But that just blows up in my face.
>>>>
>>>>
>>> std.encoding to the rescue?
>>> It looks like a well established module that was forgotten for some
>>> reason.
>>>
>>> And here I'm wondering what a function named sanitize could do :)
>>>
>> Ahh, I didn't even notice that module.
>
> Same here, It's just a couple of days(!) ago I somehow managed to find decode in the wrong place (in std.encoding  instead of std.utf). And it looked useful, but I never heard about it. Seriously, how many totally irrelevant old modules we have around here? (hint: std.gregorian!)
>> Even if it's imperfect and goes away, it looks like it'll at least get
>> the
>> job done for me. And the encoding conversions should even give me an easy
>> way to save at least some of the invalid chars (which wasn't really a
>> requirement of mine, but it'll still be nice).
>>
>>
> Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;)
>

Yea, and even when it does go, I can just copy it and include it manually (although it'll probably need some work once typedef goes away).

This seems to get the job done well enough for me, and even manages to save some of the intended chars:

// With std.utf and std.encoding imported:
string src = ...;
bool valid=true;
try
    validate(src);
catch(UtfException e)
    valid=false;

if(!valid)
{
    auto tmpStr = sanitize( cast(Windows1252String) src );
    transcode(tmpStr, src);
}

On 2011-06-25 17:04, Dmitry Olshansky wrote: > On 26.06.2011 3:25, Nick Sabalausky wrote: > > "Dmitry Olshansky"<dmitry.olsh@gmail.com> wrote in message news:iu5n32$2vjd$1@digitalmars.com... > > > >> On 26.06.2011 1:49, Nick Sabalausky wrote: > >>> "Andrej Mitrovic"<andrej.mitrovich@gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com... > >>> > >>>> I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible. > >>>> > >>>> The new function simply replaces throwing exceptions with flagging a boolean. > >>> > >>> I think I may end up doing something like that :/ > >>> > >>> I was hoping to be able to do something vaguely sensible like this: > >>> > >>> string newStr; > >>> foreach(dchar dc; str) > >>> { > >>> > >>> if(isValidDchar(dc)) > >>> > >>> newStr ~= dc; > >>> > >>> else > >>> > >>> newStr ~= 'X'; > >>> > >>> } > >>> str = newStr; > >>> > >>> But that just blows up in my face. > >> > >> std.encoding to the rescue? > >> It looks like a well established module that was forgotten for some > >> reason. > >> > >> And here I'm wondering what a function named sanitize could do :) > > > > Ahh, I didn't even notice that module. > > Same here, It's just a couple of days(!) ago I somehow managed to find > decode in the wrong place (in std.encoding instead of std.utf). And it > looked useful, but I never heard about it. Seriously, how many totally > irrelevant old modules we have around here? (hint: std.gregorian!) > > > Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice). > > Yeah, given the amount of necessary work in the Phobos realm it could hang around for quite sometime ;) Oh, it'll probably be around for a while. It'll take time before a replacement is devised. After, std.stream is still around, isn't it? And there's actually supposedly a plan regarding its replacement's implementation. There's no such thing with regards to std.encoding. I just thought that I should point out that it's likely to be replaced at some point (hopefully with something much better). - Jonathan M Davis

On Sat, 25 Jun 2011 23:17:37 +0300, Nick Sabalausky <a@a.a> wrote: > "Vladimir Panteleev" <vladimir@thecybershadow.net> wrote in message > news:op.vxmuvzqbtuzx1w@cybershadow.mshome.net... >> >> string s; >> foreach (dchar c; r) > > That doesn't throw on an invalid sequence? You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8. You use UTF8ToRaw to convert the output of rawToUTF8 back to the original string. -- Best regards, Vladimir mailto:vladimir@thecybershadow.net

Forums