Splitting up large dirty file (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Splitting up large dirty file (page 2)

May 17, 2018

Re: Splitting up large dirty file

Posted by Dennis
in reply to Jon Degenhardt

Dennis

Posted in reply to Jon Degenhardt

On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
> If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode them), so it won't trigger an exception if there are invalid utf-8 characters.

When printing to stdout it seems to skip any validation, but writing to a file does give an exception:

```
    auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File;
	auto outputFile = new File("output.txt");
    foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line);
```
std.exception.ErrnoException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\stdio.d(2877):  (No error)

According to the documentation, byLine can throw an UTFException so relying on the fact that it doesn't in some cases doesn't seem like a good idea.

May 17, 2018

Re: Splitting up large dirty file

Posted by Dennis
in reply to Jonathan M Davis

Dennis

Posted in reply to Jonathan M Davis

On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> For various reasons, that doesn't always hold true like it should, but pretty much all of Phobos is written with that assumption and will generally throw an exception if it isn't.

It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it. The UTFException doesn't contain what the character in question was. You just have to abort whatever you were trying to do.

On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> If you're ever dealing with a different encoding (or with invalid Unicode), you really need to use integral types like ubyte

I tried something like byChunk(4096).joiner.splitter(cast(ubyte) '\n') but it turns out splitter wants at least a forward range, even when the separator is a single element.

May 17, 2018

Re: Splitting up large dirty file

Posted by Neia Neutuladh
in reply to Dennis

Neia Neutuladh

Posted in reply to Dennis

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read)

Memory mapping should work. That's in core.sys.posix.sys.mman for Posix systems, and Windows has some equivalent probably. (But nobody uses Windows, right?)

> - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)

std.algorithm should generally work with sequences of anything, not just strings. So memory map, cast to ubyte[], and deal with it that way?

> - When you convert chunks to arrays, you have the risk of a split being in the middle of a character with multiple code units

It's straightforward to scan for the start of a Unicode character; you just skip past characters where the highest bit is set and the next-highest is not. (0b1100_0000 through 0b1111_1110 is the start of a multibyte character; 0b0000_0000 through 0b0111_1111 is a single-byte character.)

That said, you seem to only need to split based on a newline character, so you might be able to ignore this entirely, even if you go by chunks.

May 18, 2018

Re: Splitting up large dirty file

Posted by ag0aep6g
in reply to Neia Neutuladh

ag0aep6g

Posted in reply to Neia Neutuladh

On 05/17/2018 11:40 PM, Neia Neutuladh wrote:
> 0b1100_0000 through 0b1111_1110 is the start of a multibyte character

Nitpick: It only goes up to 0b1111_0100. The highest code point is U+10FFFF. There are no sequences with more than four bytes.

May 17, 2018

Re: Splitting up large dirty file

Posted by Jon Degenhardt
in reply to Dennis

Jon Degenhardt

Posted in reply to Dennis

On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
> On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
>> If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode them), so it won't trigger an exception if there are invalid utf-8 characters.
>
> When printing to stdout it seems to skip any validation, but writing to a file does give an exception:
>
> ```
>     auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File;
> 	auto outputFile = new File("output.txt");
>     foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line);
> ```
> std.exception.ErrnoException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\stdio.d(2877):  (No error)
>
> According to the documentation, byLine can throw an UTFException so relying on the fact that it doesn't in some cases doesn't seem like a good idea.

Instead of:

     auto outputFile = new File("output.txt");

try:

    auto outputFile = File("output.txt", "w");

That works for me. The second arg ("w") opens the file for write. When I omit it, I also get an exception, as the default open mode is for read:

 * If file does not exist:  Cannot open file `output.txt' in mode `rb' (No such file or directory)
 * If file does exist:   (Bad file descriptor)

The second error presumably occurs when writing.

As an aside - I agree with one of your bigger picture observations: It would be preferable to have more control over utf-8 error handling behavior at the application level.

May 17, 2018

Re: Splitting up large dirty file

Posted by Jonathan M Davis
in reply to Dennis

Jonathan M Davis

Posted in reply to Dennis

On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote:
> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > For various reasons, that doesn't always hold true like it should, but pretty much all of Phobos is written with that assumption and will generally throw an exception if it isn't.
>
> It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it. The UTFException doesn't contain what the character in question was. You just have to abort whatever you were trying to do.

UTFException has a sequence member and a len member (which appear to be public but undocumented) which should contain the invalid sequence of code units. In general though, exceptions aren't a great way to deal with this problem. I think that you either want to be calling decode manually (in which case, you have direct access to where the invalid Unicode is and have the freedom to deal with it however is appropriate), or using the Unicode replacement character would be better (which std.utf.decode supports, but it's not what's used by default). Really, what's byting you here is the auto-decoding. With Phobos, you have to fight to have it not happen by doing stuff like special-casing your code for strings or using std.string.representation or std.utf.byCodeUnit.

In principle, the way that Unicode would ideally be handled would be to validate all character data when it enters the program (soing whatever is appropriate with invalid Unicode at that point), and then the rest of the program then either is always dealing with valid Unicode, or it's dealing with integral values that it doesn't treat as Unicode (e.g. ubyte[]). But the way that Phobos is written, it ends up decoding and validating all over the place.

> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > If you're ever dealing with a different encoding (or with invalid Unicode), you really need to use integral types like ubyte
>
> I tried something like byChunk(4096).joiner.splitter(cast(ubyte)
> '\n') but it turns out splitter wants at least a forward range,
> even when the separator is a single element.

Actually, I'm pretty sure that splitter curently requires a random-access range (even though it should theoretically work with a forward range). I don't think that it can be made to work with an input range though given how the range API works - or at least, it were made to work with it, you'd have to deal with the fact that popping front on the spitter range would invalidate anything that had been returned from front. And it would be difficult to implement it @safely if what gets returned by front is not completely independent of the splitter range (which means that it needs save). Basic input ranges in general tend to be extremely limited in what they can do, which can get really annoying when you deal with stuff like files or sockets where making it a forward range likely means either reading it all into memory or having buffers that potentially have to be dup-ed by each call to save.

- Jonathan M Davis

May 18, 2018

Re: Splitting up large dirty file

Posted by Kagamin
in reply to Dennis

Kagamin

Posted in reply to Dennis

On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:
> ```
>     auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File;
> 	auto outputFile = new File("output.txt");
>     foreach (line; inputStream.byLine(KeepTerminator.yes)) outputFile.write(line);
> ```

Do it old school?
---
int line;
auto outputFile = File("output.txt", "wb");
foreach (chunk; inputStream.byChunk(4<<10))
{
  auto rem=chunk;
  while(rem!=null)
  {
    auto i=rem.countUntil(10);
    auto len=i+1;
    if(i<0)len=rem.length; else line++;
    outputFile.rawWrite(rem[0..len]);
    rem=rem[len..$];
  }
}
---

May 21, 2018

Re: Splitting up large dirty file

Posted by Dennis
in reply to Dennis

Dennis

Posted in reply to Dennis

On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
> It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it.

I have to take that back since I found out about std.encoding which has functions like `sanitize`, but also `transcode`. (My file turned out to actually be encoded with ANSI / Windows-1252, not UTF-8)
Documentation is scarce however, and it requires strings instead of forward ranges.

@Jon Degenhardt
> Instead of:
> 
>      auto outputFile = new File("output.txt");
> 
> try:
> 
>     auto outputFile = File("output.txt", "w");

Wow I really butchered that code. So it is the `drop(4)` that triggers the UTFException? I find Exceptions in range code hard to interpret.

@Kagamin
> Do it old school?

I want to be convinved that Range programming works like a charm, but the procedural approaches remain more flexible (and faster too) it seems. Thanks for the example.

May 21, 2018

Re: Splitting up large dirty file

Posted by Jonathan M Davis
in reply to Dennis

Jonathan M Davis

Posted in reply to Dennis

On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
> On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
> > It's unfortunate that Phobos tells you 'there's problems with the encoding' without providing any means to fix it or even diagnose it.
>
> I have to take that back since I found out about std.encoding
> which has functions like `sanitize`, but also `transcode`. (My
> file turned out to actually be encoded with ANSI / Windows-1252,
> not UTF-8)
> Documentation is scarce however, and it requires strings instead
> of forward ranges.
>
> @Jon Degenhardt
>
> > Instead of:
> >      auto outputFile = new File("output.txt");
> >
> > try:
> >     auto outputFile = File("output.txt", "w");
>
> Wow I really butchered that code. So it is the `drop(4)` that
> triggers the UTFException?

drop is range-based, so if you give it a string, it's going to decode because of the whole auto-decoding mess with std.range.primitives.front and popFront. If you can't have auto-decoding, you either have to be dealing with functions that you know avoid it, or you need to do use something like std.string.representation or std.utf.byCodeUnit to get around the auto-decoding. If you're dealing with invalid Unicode, you basically have to either convert it all up front or do something like treat it like binary data, or Phobos is going to try to decode it as Unicode and give you a UTFExceptions.

> I find Exceptions in range code hard to interpret.

Well, if you just look at the stack trace, it should tell you. I don't see why ranges would be any worse than any other code except for maybe the fact that it's typical to chain a lot of calls, and you frequently end up with wrapper types in the stack trace that you're not necessarily familiar with. The big problem here really is that all you're really being told is that your string has invalid Unicode in it somewhere and the chain of function calls that resulted in std.utf.decode being called on your invalid Unicode. But even if you weren't dealing with ranges, if you passed invalid Unicode to something completely string-based which did decoding, you'd run into pretty much the same problem. The data is being used outside of its original context where you could easily figure out what it relates to, so it's going to be a problem by its very nature. The only real solution there is to be controlling the decoding yourself, and even then, it's easy to be in a position where it's hard to figure out where in the data the bad data is unless you've done something like keep track of exactly what index your at, which really doesn't work well once you're dealing with slicing data.

> @Kagamin
>
> > Do it old school?
>
> I want to be convinved that Range programming works like a charm, but the procedural approaches remain more flexible (and faster too) it seems. Thanks for the example.

The whole auto-decoding mess makes things worse than they should be, but if you find procedural examples more flexible, then I would guess that that would be simply a matter of getting more experience with ranges. Ranges are far more composable in terms of how they're used, which tends to inherently make them more flexible. However, it does result in code that's a mixture of functional and procedural programming, which can be quite a shift for some folks. So, there's no question that it takes some getting used to, but D does allow for the more classic approaches, and ranges are not always the best approach.

As for performance, that depends on the code and the compiler. It wouldn't surprise me if dmd didn't optimize out the range stuff as much as it really should, but it's my understanding that ldc typically manages to generate code where the range abstraction didn't cost you anything. If there's an issue, I think that it's frequently an algorithmic one or the fact that some range-processing has a tendency to process the same data multiple times, because that's the easiest, most abstract way to go about it and works in general but isn't always the best solution.

For instance, because of how the range API works, when using splitter, if you iterate through the entire range, you pretty much have to iterate through it twice, because it does look-ahead to find the delimiter and then returns you a slice up to that point, after which, you process that chunk of the data to do whatever it is you want to do with each split piece. At a conceptual level, what you're doing with your code with splitter is then really clean and easy to write, and often, it should be plenty efficient, but it does require going over the data twice, whereas if you looped over the data yourself, looking for each delimiter, you'd only need to iterate over it once. So, in cases like that, I'd fully expect the abstraction to cost you, though whether it costs enough to matter depends on what you're doing.

As is the case when dealing with most abstractions, I think that it's mostly a matter of using it where it makes sense to write cleaner code more quickly and then later figuring out the hot spots where you need to optimize better. In many cases, ranges will be pretty much the same as writing loops, and in others, the abstraction is worth the cost. Where it isn't, you don't use them or implement something yourself rather than using the standard function for it, because you can write something faster for your use case.

Just the other day, I refactored some code to not use splitter, because in that particular case, it was costing too much, but there are still tons of cases where I'd use splitter without thinking twice about it, because it's the simplest, fastest way to get the job done, and it's going to be fast enough in most cases.

- Jonathan M Davis

May 21, 2018

Re: Splitting up large dirty file

Posted by Dennis
in reply to Jonathan M Davis

Dennis

Posted in reply to Jonathan M Davis

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
> On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
> drop is range-based, so if you give it a string, it's going to decode because of the whole auto-decoding mess with std.range.primitives.front and popFront.

In this case I used drop to drop lines, not characters. The exception was thrown by the joiner it turns out.

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
>> I find Exceptions in range code hard to interpret.
>
> Well, if you just look at the stack trace, it should tell you. I don't see why ranges would be any worse than any other code except for maybe the fact that it's typical to chain a lot of calls, and you frequently end up with wrapper types in the stack trace that you're not necessarily familiar with.

Exactly that: stack trace full of weird mangled names of template functions, lambdas etc. And because of lazy evaluation and chains of range functions, the line number doesn't easily show who the culprit is.

On Monday, 21 May 2018 at 17:42:19 UTC, Jonathan M Davis wrote:
> In many cases, ranges will be pretty much the same as writing loops, and in others, the abstraction is worth the cost.

From the benchmarking I did, I found that ranges are easily an order of magnitude slower even with compiler optimizations:

https://run.dlang.io/gist/5f243ca5ba80d958c0bc16d5b73f2934?compiler=ldc&args=-O3%20-release

```
LDC -O3 -release
             Range   Procedural
Stringtest: ["267ns", "11ns"]
Numbertest: ["393ns", "153ns"]

DMD -O -inline -release
              Range   Procedural
Stringtest: ["329ns", "8ns"]
Numbertest: ["1237ns", "282ns"]
```

This first range test is an opcode scanner I wrote for an assembler. The range code is very nice and it works, but it needlessly allocates a new string. So I switched to a procedural version, which runs (and compiles) faster. This procedural version did have some bugs initially though.

The second test is a simple number calculation. I thought that the range code inlines to roughly the same procedural code so it could be optimized the same, but there remains a factor 2 gap. I don't know where the difficulty is, but I did notice that switching the maximum number from int to enum makes the procedural version 0 ns (calculated at compile time) while LDC can't deduce the outcome in the range version (which still runs for >300 ns).

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation