Review of Jesse Phillips's CSV Parser (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Review of Jesse Phillips's CSV Parser (page 5)

November 05, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jesse Phillips
in reply to dsimcha

Jesse Phillips

Posted in reply to dsimcha

On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:

> Formal review of Jesse Phillips's CSV parser module begins today and runs through November .  Following that, a vote will take place for one week.  Please post any comments about this library in this thread.
> 
> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
> 
> Code:
> https://github.com/he-the-great/phobos/tree/csv

I've done another update this covers:

- Documentation more complete
- A provided heading can be any InputRange of string
- csvText => csvReader
- RecordList => Records
- csvReader no longer takes a Malformed enum, and instead provides
customization to comma and quote.

Tomorrow I will try and complete one or more of the following with this priority:

- Have row/column information for thrown exceptions.
- Support [string][string] Contents.
- Write a CSV formatter. Support for range of struct, range of [string]
[string], and range of string[].

November 07, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jesse Phillips
in reply to Jesse Phillips

Jesse Phillips

Posted in reply to Jesse Phillips

And only one pending request, csvFormatter. I am still considering how I want to approach this. I don't want to rush it in so I don't plan on having it complete before voting starts.

It has been fun and I look forward to the verdict.

On Sat, 05 Nov 2011 23:37:43 +0000, Jesse Phillips wrote:

> On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:
> 
>> Formal review of Jesse Phillips's CSV parser module begins today and runs through November .  Following that, a vote will take place for one week.  Please post any comments about this library in this thread.
>> 
>> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>> 
>> Code:
>> https://github.com/he-the-great/phobos/tree/csv
> 
> I've done another update this covers:
> 
> - Documentation more complete - A provided heading can be any InputRange of string - csvText => csvReader - RecordList => Records - csvReader no longer takes a Malformed enum, and instead provides customization to comma and quote.
> 
> Tomorrow I will try and complete one or more of the following with this priority:
> 
> - Have row/column information for thrown exceptions.
> - Support [string][string] Contents.
> - Write a CSV formatter. Support for range of struct, range of [string]
> [string], and range of string[].

November 07, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by dsimcha
in reply to dsimcha

dsimcha

Posted in reply to dsimcha

On 10/28/2011 9:18 AM, dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November . Following that, a vote will take place for one
> week. Please post any comments about this library in this thread.
>
> Docs:
> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>
> Code:
> https://github.com/he-the-great/phobos/tree/csv

I apologize.  I just noticed that I forgot to put the end date in this message.  The review of the CSV parser runs the standard two weeks, until November 11.

November 08, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Robert Jacques
in reply to Jesse Phillips

Robert Jacques

Posted in reply to Jesse Phillips

On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips <jessekphillips+d@gmail.com> wrote:
> And only one pending request, csvFormatter. I am still considering how I
> want to approach this. I don't want to rush it in so I don't plan on
> having it complete before voting starts.

My primary use of csv files is as a method of saving data to later graph / view / format in Excel. i.e.

csvWrite("filename",["x","y","z"],x,y,z);

where x, y, z are ranges with possibly different lengths and/or types.

I also have some API design concerns. Primarily, most of the type names are overly broad and unclear. For example: Record, Records, Malformed could just as easily be part of a database API. And because of this, not only does one reading the code not know what the object is, but if namespace conflicts can occur, then code reviewers will probably make the wrong assumption half the time as to what Record is and not bother looking it up.

Furthermore, Records and Record, are essentially internal ranges/structs returned by csvReader. In the tradition of std.algorithms these should be defined inside the csvReader function (preferably) or as CsvReader / CsvReader.Record.

November 10, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jonathan M Davis
in reply to dsimcha

Jonathan M Davis

Posted in reply to dsimcha

On Friday, October 28, 2011 09:18:27 dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and runs through November .  Following that, a vote will take place for one week.  Please post any comments about this library in this thread.
> 
> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
> 
> Code:
> https://github.com/he-the-great/phobos/tree/csv

- Nitpick: You should probably proofread the module documentation again. It has several mispellings and grammar errors.

- The functions and types mentioned in the documentation should use LREF so that they're links to their documentation.

- Shouldn't the term "header" be used rather than "heading?" So, you'd have HeaderMismatchException instead of HeadingMismatchException. Usually, the term heading is only used for stuff like the heading of an article. Usually, the term header is used for stuff like files. RFC 4180 uses the term header but never heading. So, unless I'm missing something, I really think that all of the instances of heading (both in the documentation and in the API) should be changed to header.

- I'd argue against Separator in csvReader being char by default. Individual characters should generally be dchar, not char. Using char by itself is almost always bad practice IMHO.

- I'd say that csvReader should say that it returns a Records struct which is an input range rather than saying that it provides std.range.isInputRange. We just don't talk that way about ranges normally. We say that something is a particular range type, not that it provides the API for a particular range type. If you want to make "is an input range" a link to std.range.isInputRange, that's fine, but what's currently there is not how ranges are typically referred to. I'd also suggest that you on the range functions where you say that they're part of the std.range.isInputRange interface, you should use the term API instead of interface, since interface has a specific meaning in the language which has nothing to do with this context.

- Why does csvNextToken take an Appender!(char[])? That seems overly restrictive. I'd argue that it should take an output range of dchar (which Appender!(char[]) is). Then it's not restricted to Appender, and it's not restricted to char[].

- Also, the fact that you have isSomeChar!(ElementType!Range) concerns me somewhat. Normally, I'd expect something more like is(ElementType!Range == dchar). As it is, you could have a range of char or wchar, which could do funny things. Phobos doesn't normally support that, and I'd be concerned about what would happen if someone tried. ElementType!Range will be dchar for all string types, so they'll work fine either way, but programmers who are overly adventurous or ignorant may try and use a char or wchar range, and I'm willing to be that there would be problems with that. Just all around, I'd advise checking that the ranges are ranges of dchar and using dchar for the separators rather than char.

- You should have unit tests testing ranges _other_ than string. The fact that neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff like call filter!"true" on a string so that you have an input range which isn't a string. Without at least some of that, I'd be very worried that your code can't actually handle generic ranges of dchar and really only works with string.

- You should have some unit tests with unicode values - preferably some which are multiple code units for both char and wchar. For instance, std.string uses \U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in UTF-16. The complete lack of testing with unicode is worrisome in the same way that only testing with string is worrisome. It's too easy to screw something up such that it happens to work with string and ASCII but not in general. So, please add tests which involve unicode - not only in the data but in the separators.

- I find it somewhat odd that you would use a pointer to the range in Record rather than using a slice, but I'd have to examine it in more detail to see whether that's actually necessary or not. But in general, I'd argue that a slice should be used rather than a pointer if it can be.

- Jonathan M Davis


P.S.  Just as a note, since you seem to like not using braces if you don't need to for if statements and the like, I though that I'd point out that try and catch don't require braces in D if htey contain only one statement. You have several one line catches which use braces when they don't need to. I'm not arguing that you should or shouldn't change it, but I thought that I would point out that the braces aren't necessary in case you didn't know (since they _are_ necessary in other languages).

November 11, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jesse Phillips
in reply to Robert Jacques

Jesse Phillips

Posted in reply to Robert Jacques

On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:

> On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips <jessekphillips+d@gmail.com> wrote:
>> And only one pending request, csvFormatter. I am still considering how I want to approach this. I don't want to rush it in so I don't plan on having it complete before voting starts.
> 
> My primary use of csv files is as a method of saving data to later graph / view / format in Excel. i.e.
> 
> csvWrite("filename",["x","y","z"],x,y,z);
> 
> where x, y, z are ranges with possibly different lengths and/or types.

I don't want to do anything with files, I'm thinking an output buffer makes the most sense (should be able to have such a buffer write to a file).

The x,y,z seems interesting but what about the poor sole who has a range of

struct {
    auto x,y,z;
}

If I'm going to make a csvFormatter than making sure something like

csvFormatter(csvReader(txt));

works is number one priority. But you have added a style to consider when
building the interface.

> I also have some API design concerns. Primarily, most of the type names are overly broad and unclear. For example: Record, Records, Malformed could just as easily be part of a database API. And because of this, not only does one reading the code not know what the object is, but if namespace conflicts can occur, then code reviewers will probably make the wrong assumption half the time as to what Record is and not bother looking it up.

The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for.

Malformed, I just don't have any clue what to actually call it. No CSVMalformed doesn't help.

But then you go on to the point below, that you shouldn't even see them.

> Furthermore, Records and Record, are essentially internal ranges/structs returned by csvReader. In the tradition of std.algorithms these should be defined inside the csvReader function (preferably) or as CsvReader / CsvReader.Record.

I was going to explain how wrong you are and how I tried having all the options part of csvReader but it just muddies up the clean API when customizing something. But actually I already found the solution with my previous updates and incorrectly chose to move Malformed customization out.

Long story short the problems I see:

* The documentation still needs to exist and will be a little harder to
do without these
* Records contains the heading field, this is abnormal for an InputRange
and means it won't show as an auto link if it becomes a hidden range.
* Currently you can find exactly which function call result in a specific
error, if I hide the ranges you don't.

I'm expecting that introducing CSVFormatter will require another review, at that time I can try making two forms of documentation to see which is preferred.

Thank you! Very useful food for thought.

November 11, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jesse Phillips
in reply to Jonathan M Davis

Jesse Phillips

Posted in reply to Jonathan M Davis

Thank you, lots of stuff to ponder and comment.

On Thu, 10 Nov 2011 02:10:28 -0800, Jonathan M Davis wrote:

> On Friday, October 28, 2011 09:18:27 dsimcha wrote:
>> Formal review of Jesse Phillips's CSV parser module begins today and runs through November .  Following that, a vote will take place for one week.  Please post any comments about this library in this thread.
>> 
>> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>> 
>> Code:
>> https://github.com/he-the-great/phobos/tree/csv
> 
> - Nitpick: You should probably proofread the module documentation again. It has several mispellings and grammar errors.

Err, I'll try, use some of the great tips you gave me :)

> - The functions and types mentioned in the documentation should use LREF so that they're links to their documentation.

I'll see if I can find them all while proofing.

> - Shouldn't the term "header" be used rather than "heading?" So, you'd have HeaderMismatchException instead of HeadingMismatchException. Usually, the term heading is only used for stuff like the heading of an article. Usually, the term header is used for stuff like files. RFC 4180 uses the term header but never heading. So, unless I'm missing something, I really think that all of the instances of heading (both in the documentation and in the API) should be changed to header.

Used both in development, then chose one, guess I got it wrong. Changing.

> - I'd argue against Separator in csvReader being char by default. Individual characters should generally be dchar, not char. Using char by itself is almost always bad practice IMHO.

The only reason I put defaults on it was because it wouldn't choose the
value for it from my default values. Otherwise it works with what you
give it, be it char, wchar, dchar.

> - I'd say that csvReader should say that it returns a Records struct which is an input range rather than saying that it provides std.range.isInputRange. We just don't talk that way about ranges normally. We say that something is a particular range type, not that it provides the API for a particular range type. If you want to make "is an input range" a link to std.range.isInputRange, that's fine, but what's currently there is not how ranges are typically referred to. I'd also suggest that you on the range functions where you say that they're part of the std.range.isInputRange interface, you should use the term API instead of interface, since interface has a specific meaning in the language which has nothing to do with this context.

Thanks.

> - Why does csvNextToken take an Appender!(char[])? That seems overly
> restrictive. I'd argue that it should take an output range of dchar
> (which Appender!(char[]) is). Then it's not restricted to Appender, and
> it's not restricted to char[].

Appender is used so that the exception thrown can contain the consumed
data. Thinking again on this, I suppose I can catch the exception add the
data and rethrow. Sounds pretty good actually.

> - Also, the fact that you have isSomeChar!(ElementType!Range) concerns me somewhat. Normally, I'd expect something more like is(ElementType!Range == dchar). As it is, you could have a range of char or wchar, which could do funny things. Phobos doesn't normally support that, and I'd be concerned about what would happen if someone tried. ElementType!Range will be dchar for all string types, so they'll work fine either way, but programmers who are overly adventurous or ignorant may try and use a char or wchar range, and I'm willing to be that there would be problems with that. Just all around, I'd advise checking that the ranges are ranges of dchar and using dchar for the separators rather than char.

I was of a different mindset of the time. You are probably right I should
prevent someone from eating their leg off.

> - You should have unit tests testing ranges _other_ than string. The fact that neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff like call filter!"true" on a string so that you have an input range which isn't a string. Without at least some of that, I'd be very worried that your code can't actually handle generic ranges of dchar and really only works with string.

Did add a wchar input range test, but now it should just be an dchar
input range test since wchar ranges won't be allowed.

> - You should have some unit tests with unicode values - preferably some which are multiple code units for both char and wchar. For instance, std.string uses \U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in UTF-16. The complete lack of testing with unicode is worrisome in the same way that only testing with string is worrisome. It's too easy to screw something up such that it happens to work with string and ASCII but not in general. So, please add tests which involve unicode - not only in the data but in the separators.

Didn't create any new tests, just added unicode in to input and as
separators into old tests.

> - I find it somewhat odd that you would use a pointer to the range in Record rather than using a slice, but I'd have to examine it in more detail to see whether that's actually necessary or not. But in general, I'd argue that a slice should be used rather than a pointer if it can be.

Record requires modifying the range, for a true input range this is not a problem, but strings and some other struct based ranges are implicit forward ranges and save upon passing.

Slicing doesn't exist on input ranges, though I'd almost argue [] should
take the place of save() in a forward range.

> - Jonathan M Davis
> 
> 
> P.S.  Just as a note, since you seem to like not using braces if you don't need to for if statements and the like, I though that I'd point out that try and catch don't require braces in D if htey contain only one statement. You have several one line catches which use braces when they don't need to. I'm not arguing that you should or shouldn't change it, but I thought that I would point out that the braces aren't necessary in case you didn't know (since they _are_ necessary in other languages).

Thanks, hopefully I'm consistent in that regard and I'll just leave it. I knew of try, but not catch.

November 11, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jonathan M Davis
in reply to Jesse Phillips

Jonathan M Davis

Posted in reply to Jesse Phillips

On Friday, November 11, 2011 03:17:10 Jesse Phillips wrote:
> > I also have some API design concerns. Primarily, most of the type names are overly broad and unclear. For example: Record, Records, Malformed could just as easily be part of a database API. And because of this, not only does one reading the code not know what the object is, but if namespace conflicts can occur, then code reviewers will probably make the wrong assumption half the time as to what Record is and not bother looking it up.
> 
> The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for.

Actually, I'd argue that Row would be better than Record, since it _is_ a row in a table. Personally, I'd find it to be more immediately clear that way. With Record, I have to figure out what the heck it is a record of, whereas Row is immediately obvious in this context.

- Jonathan M Davis

November 11, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Don
in reply to Jacob Carlborg

Don

Posted in reply to Jacob Carlborg

On 02.11.2011 09:02, Jacob Carlborg wrote:
> On 2011-11-01 02:28, Jesse Phillips wrote:
>> On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:
>>
>>> On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha<dsimcha@yahoo.com> wrote:
>>>
>>>> Docs:
>>>> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>>>
>>> Checked the new docs today. If I'm reading them right, the top example
>>> prints:
>>>
>>> "Fred works as a Fly and earns $4 per year"
>>>
>>> Is this a pop culture reference I'm not catching?
>>
>> No, not that I know of. Should it be? Should I go with a more
>> professional like example?
>>
>>> An idea I had the other day was to include convenience presets for the
>>> more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
>>
>> Well if Excel actually conforms to what it claims:
>>
>> http://office.microsoft.com/en-us/excel-help/excel-formatting-and-
>> features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx?
>> CTT=5&origin=HP010099725#BM4
>>
>> Which I doubt it does, then my parser already defaults to being able to
>> read such formats. In my experience Excel sucks at CSV and follows no
>> rules.
>>
>> The implementation I choose as default is the most common and any other
>> style is just a butchafication and likely to be unreliable, usually stems
>> from "my data has commas, so I'll use colon." Which is great until the
>> data also has colon. But maybe that is just my experience.
>
> Excel + CSV == Pain in the ass
>
> Excel uses different value delimiters as the default setting depending
> on the locale of the Excel application.

Far more horrible than that -- in an old Excel from around 2000 the user interface used US settings for reading short dates, if the day of the month was 12 or less. Otherwise it uses the locale. But it always uses the locale for displaying them.

So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.

I wrote a macro called "November 9" (9/11) for swapping them back if they were in the buggy range.

November 11, 2011

Re: Review of Jesse Phillips's CSV Parser

Posted by Jacob Carlborg
in reply to Don

Jacob Carlborg

Posted in reply to Don

On 2011-11-11 12:50, Don wrote:
> Far more horrible than that -- in an old Excel from around 2000 the user
> interface used US settings for reading short dates, if the day of the
> month was 12 or less. Otherwise it uses the locale. But it always uses
> the locale for displaying them.
>
> So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.
>
> I wrote a macro called "November 9" (9/11) for swapping them back if
> they were in the buggy range.

Haha that's just insane.

-- 
/Jacob Carlborg

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation