Thread overview
[Issue 17109] std.csv chokes on empty columns when parsing to struct
Jan 19, 2017
Jack Stouffer
Jan 19, 2017
Sophie
Jan 19, 2017
Jack Stouffer
Jan 19, 2017
Sophie
Jan 19, 2017
Jack Stouffer
Jan 19, 2017
Sophie
Jan 20, 2017
Jon Degenhardt
Dec 14, 2019
berni44
Dec 17, 2022
Iain Buclaw
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

Jack Stouffer <jack@jackstouffer.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Hardware|x86                         |All
                 OS|Mac OS X                    |All
           Severity|enhancement                 |normal

--
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

Sophie <meapineapple@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |meapineapple@gmail.com

--- Comment #1 from Sophie <meapineapple@gmail.com> ---
I think this is the correct behavior. The empty string is not valid as a floating point value, nan or otherwise.

--
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

--- Comment #2 from Jack Stouffer <jack@jackstouffer.com> ---
(In reply to Sophie from comment #1)
> I think this is the correct behavior. The empty string is not valid as a floating point value, nan or otherwise.

This is going to sound dramatic, but not correctly handling missing values in a CSV makes std.csv dead on arrival. The truth of the matter is that missing values in CSVs are the rule and not the exception.

Python's pandas does this correctly, and so should we.

--
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

--- Comment #3 from Sophie <meapineapple@gmail.com> ---
It's a missing value, but in the case of numeric types a missing value is simply one example of a malformed value. I think the better approach in the code you used as an example would be to not expect the CSV logic to handle malformed floats. Use the CSV parser to extract strings, and then your code should assume responsibility for validating and handling malformed inputs.

--
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

--- Comment #4 from Jack Stouffer <jack@jackstouffer.com> ---
(In reply to Sophie from comment #3)
> It's a missing value, but in the case of numeric types a missing value is simply one example of a malformed value. I think the better approach in the code you used as an example would be to not expect the CSV logic to handle malformed floats. Use the CSV parser to extract strings, and then your code should assume responsibility for validating and handling malformed inputs.

The problem is, if I just get strings, then std.csv is useless because I can just do this

    auto input = File("file.csv");
    auto data = input.byLine.map!(a => splitter(a, ','));

It would be faster too, as I'm just getting slices over byLine's buffer rather than creating a new Tuple. But, returning T.init probably is a wrong choice because int.init == 0.

Perhaps the replacement logic can be confined to nullable types and types with nan?

--
January 19, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

--- Comment #5 from Sophie <meapineapple@gmail.com> ---
(In reply to Jack Stouffer from comment #4)
> (In reply to Sophie from comment #3)
> The problem is, if I just get strings, then std.csv is useless because I can
> just do this
> 
>     auto input = File("file.csv");
>     auto data = input.byLine.map!(a => splitter(a, ','));

Not that I don't understand your argument, but be aware the example is not at all equivalent to parsing a CSV; it does not handle quoted columns or escaped metacharacters.

Null for nullable types would work, I think. I don't think using nan is ideal but it's probably the best solution for floats, the issue is that integers would have no such capability and this would an unusual inconsistency to have between the numeric types.

Perhaps when a type other than string is expected, it will be made nullable if it's not already and null will be returned when the column is blank?

--
January 20, 2017
https://issues.dlang.org/show_bug.cgi?id=17109

Jon Degenhardt <jrdemail2000-dlang@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jrdemail2000-dlang@yahoo.co
                   |                            |m

--- Comment #6 from Jon Degenhardt <jrdemail2000-dlang@yahoo.com> ---
There is no standard behavior for empty fields in CSV. Pragmatically, it's an application level decision, meaning that the application that generated the CSV chose  what the meaning is, and the application reading it needs to respect this. Different higher level packages make their own choices of course.

R, for example, treats empty empty fields as "NA", meaning "Not Applicable" or "Missing". This is a numeric value similar to but distinct from NaN (R borrows a bit from floats and integers to do this). See R's read.table documentation. Pandas treats empty as missing, but uses NaN to represent it. See "Working with missing data" in the Pandas documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html), and the pandas.read_csv documentation.

The real key though is that most of these CSV readers provide options controlling interpretation. Depending on use case, error, NaN or some other behavior may be warranted. D's CSV reader would benefit from having similar control.

--
December 14, 2019
https://issues.dlang.org/show_bug.cgi?id=17109

berni44 <bugzilla@d-ecke.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla@d-ecke.de
           Severity|normal                      |enhancement

--- Comment #7 from berni44 <bugzilla@d-ecke.de> ---
While I understand the wish for having defaults for empty statements, I don't think this is a bug. The current implementation just does not support this. Therefore I changed this to an enhancement request.

--
December 17, 2022
https://issues.dlang.org/show_bug.cgi?id=17109

Iain Buclaw <ibuclaw@gdcproject.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P4

--