October 29, 2011
On 10/28/2011 12:18 PM, Walter Bright wrote:
> Formal review of Jesse Phillips's CSV parser module

Walter,
Asking for high precision documentation is fine.
And since you face it ...
What about having such a high precision for D2 language definition ?

std.csv is pretty cool. No doubt, But shouldn't std.csv be based on std.lexer ?

A+
October 29, 2011
Thank you all.

I've made my first round of updates to the documentation, will be taking another pass and looking at adding to functionality. I'd like some help with Walters comments as described below.

On Fri, 28 Oct 2011 12:18:47 -0700, Walter Bright wrote:

> On 10/28/2011 6:18 AM, dsimcha wrote:
>> Formal review of Jesse Phillips's CSV parser module begins today and runs through November . Following that, a vote will take place for one week. Please post any comments about this library in this thread.
>>
>> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>>
>> Code:
>> https://github.com/he-the-great/phobos/tree/csv
> 
> First off, mucho thanks to Jesse for writing this. It's an important module for Phobos.

Thank you. Fun fact, many popular languages don't include a CSV parser, C# and Java being notable ones (or at least I haven't found it).

> This is a review of the documentation:
> 
> Include this link: http://en.wikipedia.org/wiki/Comma-separated_values so the programmer can learn more about CSV.

Done, added See Also section.

> "This parser will loosely follow the RFC-4180"
> 
> How exactly will it differ from RFC-4180?

I do not know how to address this. After making this statement I list the criteria. I don't see the need to list where it differs:

* Records terminate with LF, CR and not just CRLF.
* Using Double Quotes or Comma can be customized.

I feel that I summarized well enough that listing yet another would be
redundant.

> "No exceptions are thrown due to incorrect CSV."
> 
> What happens, then, when the CSV is incorrect?

Added list.

> "csvText"
> 
> Be more specific about what csvText() returns.

You are right. csvText() return a RecordList, but I explain it as though
it were the Range. I'll think about what detail here, any suggestions are
welcome.

> I'm confused about the heading parameter to csvText().

Well I hope some reorganizing and an added example will help.

> "the Content of the input"
>
> What is that? There's no parameter named "Content".

I've replace all instances of Content with Contents. It is a template
parameter so hopefully this resolves the confusion.

> "RecordList"
> 
> "Range which provides access to CSV Records and Fields."
> 
> What is the Range? Do you mean RecordList is a Range? And what does "provides access" mean? If RecordList is a Range, why is it labeled a "List"?

Considering it is the first documentation line of RecordList, I don't see where the confusion on "what the range is."

I've re-written the line, it's shorter.

Ok, be prepared for confusion.

RecordList is a range of records, it is not named RecordRange because Record is also a range, a range of fields. If you have any suggestions for a better name, such as

Document or TabularData

Please let me know what you'd prefer.

> "Array of the heading contained in the file."
> 
> Now I'm really confused as to what a heading is. I thought it was the first line? Now there's an array of them?

The line did not say "headings," but I have change it to:

"Heading from the input in array form."

> No examples given for constructors.

I provide an example of using RecordList. Any example for the constructors would be the same or similar to the 7 other examples.

Is an example needed for both constructors, and maybe an example for
front too?

> No documentation for front() or empty().

These are range primitives, those programming in D already know "returns the current element of the range." I find it more appropriate to describe what is returned by front() in the description of the range (many ranges have become private and have no documentation at all, not even their existence).

I will of course add documentation, but recommendations on non-pointless documentation would be welcome. Or how to handle the different types which can be assigned to Contents.

> "Record"
> 
> "Returned by a RecordList when Contents is a non-struct."
> 
> RecordList is a struct, it does not return anything.

It is a range, are we not allowed to talk of them as though they were
high-level concepts? Do I leave off where you'd obtain a Record (it is
private). Should I say "Returned by an instance of RecordList when
calling front?"

> No examples.

It is private, removing doc comment for constructor.

> "csvNextToken"
> 
> Example?

Added.

Hopefully I addressed most of the documentation. Keep them coming.
October 29, 2011
On Sat, 29 Oct 2011 15:27:22 -0700, bls wrote:

> On 10/28/2011 12:18 PM, Walter Bright wrote:
>> Formal review of Jesse Phillips's CSV parser module
> 
> Walter,
> Asking for high precision documentation is fine.
> And since you face it ...
> What about having such a high precision for D2 language definition ?

You have to start somewhere.

> std.csv is pretty cool. No doubt, But shouldn't std.csv be based on std.lexer ?
> 
> A+

Well, std.lexer doesn't exist. And if it is reasonable to use one, then it could be done without changing the API.

I think the database API stuff will have much more influence on what needs to be done with std.csv, but I think what is currently implemented must remain as is.
October 30, 2011
On 10/29/2011 03:56 PM, Jesse Phillips wrote:
>> std.csv is pretty cool. No doubt, But shouldn't std.csv be based on
>> >  std.lexer ?
>> >
>> >  A+
> Well, std.lexer doesn't exist. And if it is reasonable to use one, then
> it could be done without changing the API.
>
> I think the database API stuff will have much more influence on what
> needs to be done with std.csv, but I think what is currently implemented
> must remain as is.

There is somebody working on std.lexer (a generic lexer) (as far as I remember the project started as std.range stress test )

However, I have no problem to imagine that f.i. the Tokenizer will be part of the std.lexer module. :)

Regarding std.database.. Please elaborate/  ATM I just can imagine that std.csv will be used for Import/Export.



October 30, 2011
On Saturday, October 29, 2011 22:33:30 Jesse Phillips wrote:
> > On 10/28/2011 6:18 AM, dsimcha wrote:
> > First off, mucho thanks to Jesse for writing this. It's an important
> > module for Phobos.
> 
> Thank you. Fun fact, many popular languages don't include a CSV parser, C# and Java being notable ones (or at least I haven't found it).

CSV parser libraries exist for Java (IIRC Apache has one), but there definitely isn't one in Java's standard library. CSV parsers are not the sort of thing that's generally found in a standard library from what I've seen. So, having one will make us abnormal (albeit in a good way).

- Jonathan M Davis
October 30, 2011
On 10/29/2011 3:27 PM, bls wrote:
> What about having such a high precision for D2 language definition ?

Specific suggestions are welcome. Patches are even more welcome.


> std.csv is pretty cool. No doubt, But shouldn't std.csv be based on std.lexer ?

I don't see a similarity.
October 30, 2011
On 10/29/2011 3:33 PM, Jesse Phillips wrote:
>> "This parser will loosely follow the RFC-4180"
>>
>> How exactly will it differ from RFC-4180?
>
> I do not know how to address this. After making this statement I list the
> criteria. I don't see the need to list where it differs:
>
> * Records terminate with LF, CR and not just CRLF.
> * Using Double Quotes or Comma can be customized.
>
> I feel that I summarized well enough that listing yet another would be
> redundant.

I saw the criteria listed, but I didn't see how it was connected with it being different from RFC-4180. Are those criteria the differences? If so, please clarify that it is.


>> "RecordList"
>>
>> "Range which provides access to CSV Records and Fields."
>>
>> What is the Range? Do you mean RecordList is a Range? And what does
>> "provides access" mean? If RecordList is a Range, why is it labeled a
>> "List"?
>
> Considering it is the first documentation line of RecordList, I don't see
> where the confusion on "what the range is."
>
> I've re-written the line, it's shorter.
>
> Ok, be prepared for confusion.
>
> RecordList is a range of records, it is not named RecordRange because
> Record is also a range, a range of fields. If you have any suggestions
> for a better name, such as
>
> Document or TabularData
>
> Please let me know what you'd prefer.

When it says "List" I think of a specific data structure, i.e. linked list. Since it is not, it shouldn't be labeled as one.

How about simply "Records" ?


>> "Array of the heading contained in the file."
>>
>> Now I'm really confused as to what a heading is. I thought it was the
>> first line? Now there's an array of them?
>
> The line did not say "headings," but I have change it to:
>
> "Heading from the input in array form."

An example would make this much, much clearer.

> Is an example needed for both constructors, and maybe an example for
> front too?

I know that when one is intimately familiar with the code, more examples seem redundant. They aren't to the new user!


>> No documentation for front() or empty().
>
> These are range primitives, those programming in D already know "returns
> the current element of the range." I find it more appropriate to describe
> what is returned by front() in the description of the range (many ranges
> have become private and have no documentation at all, not even their
> existence).
>
> I will of course add documentation, but recommendations on non-pointless
> documentation would be welcome. Or how to handle the different types
> which can be assigned to Contents.

I suggest at a minimum that the docs for these say that they are part of the range interface. I want to press Andrei to write up a doc on "What's a Range" and then you can just provide a link to the right spot in that.


>> "Record"
>>
>> "Returned by a RecordList when Contents is a non-struct."
>>
>> RecordList is a struct, it does not return anything.
>
> It is a range, are we not allowed to talk of them as though they were
> high-level concepts? Do I leave off where you'd obtain a Record (it is
> private). Should I say "Returned by an instance of RecordList when
> calling front?"

I think it is very important to be precise in the use of terms in technical documentation. A Range doesn't "return" anything either, though there is a method of Range which does. A range is a set of elements that can be iterated through.
October 30, 2011
On Sat, 29 Oct 2011 17:19:44 -0700, Jonathan M Davis wrote:

> On Saturday, October 29, 2011 22:33:30 Jesse Phillips wrote:
>> > On 10/28/2011 6:18 AM, dsimcha wrote:
>> > First off, mucho thanks to Jesse for writing this. It's an important
>> > module for Phobos.
>> 
>> Thank you. Fun fact, many popular languages don't include a CSV parser, C# and Java being notable ones (or at least I haven't found it).
> 
> CSV parser libraries exist for Java (IIRC Apache has one), but there definitely isn't one in Java's standard library. CSV parsers are not the sort of thing that's generally found in a standard library from what I've seen. So, having one will make us abnormal (albeit in a good way).
> 
> - Jonathan M Davis

That is all I meant by it. A CSV parser does exist for every language, but none come packaged with the language. C# even chose not to include an ini so people would use XML. It really is sad that instead of something solid they go the way of "discouraging."
October 30, 2011
On Sat, 29 Oct 2011 17:05:01 -0700, bls wrote:

> Regarding std.database.. Please elaborate/  ATM I just can imagine that std.csv will be used for Import/Export.

CSV is tabular data, so it would make sense that you may have a similar interface as you would a database.

csv.select("name").where!"a==4"("years")

Or whatever the interface look like.
October 30, 2011
Thank you for the response, only one more follow-up question.

Records
Record

Don't these look easy to confuse? And funny if we didn't have type inference.

Records records = csvText(str);

foreach(Record record; records) ...

Anyone want to complain about this suggestion? Record is private and can only be obtained from Records (is that appropriate phrasing?).