<< Back to previous view

[DCSV-15] Use Reducers/Transducers for better performance & resource handling Created: 15/Sep/16  Updated: 16/Sep/16

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Rick Moynihan Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

One problem when using the clojure.data.csv library is that it's built upon lazy sequences which can lead to inefficiencies when processing large amounts of data, for example even before any transformation is done the base-line parsing of 1gb of data of CSV takes about 50s on my machine. Other parsers available on the JVM can parse this quantity of data in less than 4 seconds.

I'd like to discuss how we might port clojure.data.csv to use a reducer/transducer model, for improved performance and resource handling. Broadly speaking I think there are a few options:

1. Implement this as a secondary alternative API in c.d.csv leaving the existing API and implementation as is for legacy users.
2. Replace the API entirely with no attempt at retaining backwards compatibility.
3. Retain the same public API contracts, whilst trying to reimplement it underneath in terms of reducers/transducers. Use transducers underneath but use `sequence` to retain the current parse-csv lazy-seq contract, whilst offering access into a new pure transducer/reducer based API for non legacy users or those who don't require a lazy-seq based implementation.

1 and 3 are essentially the same idea, except in 3 users get the benefit of a faster underlying implementation, there may also be other options.

I think 3, if possible, would be the best option.

Options 1 and 2 raise the question, of making no attempt at backwards compatibility or improving the experience for legacy users.

Before delving into the details of how the reducer/transducer implementation, I'm curious what the core team think of exploring
this further.



 Comments   
Comment by Jonas Enlund [ 16/Sep/16 2:18 AM ]

Can you share this benchmark? I did some comparisons when I initially wrote the lib and I didn't see such big differences.

I think that the lazy approach is an important feature in many cases where you don't want all those gigabytes in memory.

If we add some non-lazy parsing for performance reasons I would argue it should be additions to the public api.

Comment by Rick Moynihan [ 16/Sep/16 8:46 AM ]

I agree not loading data into memory is a huge benefit, but we shouldn't necessarily conflate that streaming property with laziness/eagerness.

By using reducers/transducers you can still stream through a CSV file row by row and consume a constant amount of memory, e.g. reducing into a count of rows wouldn't require memory to be consumed, even though it is eager. Likewise if we used a transducer will a `CollReduce`able `CSVFile` object by using `transduce` you could request a lazy-seq of results with `sequence` where the parsing itself paid no laziness tax; alternatively you could request that results are loaded into memory eagerly by transducing into a vector.

Apologies for not providing any benchmark results with this ticket; it was actually Alex Miller who suggested I write this ticket after discussing things briefly with him on slack - and he'd suggested that I needn't provide the timings because the costs of laziness are well known. Regardless, I'll tidy up the code I used to take the timings and put them into a gist or something - maybe later on today.





[DCSV-14] Double quote at beginning of cell throws exception Created: 04/Aug/16  Updated: 04/Aug/16

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Dave Kincaid Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

If a cell has a double quote (an escaped quotation mark) as the first characters in the cell then an exception is thrown.

For example:
(csv/read-csv "this,\"\"that\"\",the other")

produces:

Exception CSV error (unexpected character: t) clojure.data.csv/read-quoted-cell (csv.clj:36)

but this:
(csv/read-csv "this, \"\"that\"\",the other")

produces this correct output:
(["this" " \"\"that\"\"" "the other"])






[DCSV-13] Port data.csv to clojurescript Created: 11/Apr/16  Updated: 11/Apr/16

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Erik Assum Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File 0001-port-data.csv-to-clojurescript.patch     Text File 0002-port-data.csv-to-clojurescript.patch     Text File 0003-port-data.csv-to-clojurescript.patch    

 Description   

Make data.csv available for clojurescript users.



 Comments   
Comment by Jonas Enlund [ 11/Apr/16 12:52 PM ]

I'm failing to run the clojure tests via the command `mvn clojure:test` which I think the ci server uses. I'm getting the following exception:

Exception in thread "main" java.io.FileNotFoundException: Could not locate clojure/data/test_runner/__init.class or clojure/data/test_runner/.clj on classpath:
	at clojure.lang.RT.load(RT.java:432)
	at clojure.lang.RT.load(RT.java:400)
	at clojure.core$load$fn__4890.invoke(core.clj:5415)
	at clojure.core$load.doInvoke(core.clj:5414)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.core$load_one.invoke(core.clj:5227)
	at clojure.core$load_lib.doInvoke(core.clj:5264)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invoke(core.clj:603)
	at clojure.core$load_libs.doInvoke(core.clj:5298)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.core$apply.invoke(core.clj:603)
	at clojure.core$require.doInvoke(core.clj:5381)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at user$eval1.invoke(run-test3909933917568395357.clj:1)
	at clojure.lang.Compiler.eval(Compiler.java:6511)
	at clojure.lang.Compiler.load(Compiler.java:6952)
	at clojure.lang.Compiler.loadFile(Compiler.java:6912)
	at clojure.main$load_script.invoke(main.clj:283)
	at clojure.main$script_opt.invoke(main.clj:343)
	at clojure.main$main.doInvoke(main.clj:427)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at clojure.lang.Var.invoke(Var.java:415)
	at clojure.lang.AFn.applyToHelper(AFn.java:161)
	at clojure.lang.Var.applyTo(Var.java:532)
	at clojure.main.main(main.java:37)
Comment by Erik Assum [ 11/Apr/16 12:58 PM ]

Bummer
https://groups.google.com/forum/m/#!topic/clojure-dev/PDyOklDEv7Y

Comment by Jonas Enlund [ 11/Apr/16 2:58 PM ]

Can we resolve the reflection warnings in patch 0002?

$ rlwrap mvn clojure:repl
...
Clojure 1.8.0
user=> (set! *warn-on-reflection* true)
true
user=> (require '[clojure.data.csv :as csv])
Reflection warning, clojure/data/csv.cljc:62:8 - call to method unread on java.io.PushbackReader can't be resolved (argument types: unknown).
Reflection warning, clojure/data/csv.cljc:91:8 - call to method write on java.io.Writer can't be resolved (argument types: unknown).
nil
user=>
Comment by Erik Assum [ 11/Apr/16 3:13 PM ]

fixed in the 0003 patch.





[DCSV-12] Add project.clj for easier local development Created: 07/Apr/16  Updated: 07/Apr/16

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Erik Assum Assignee: Erik Assum
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File 0001-Add-project.clj-to-make-local-development-easier.patch    
Patch: Code




[DCSV-11] Add convenience function to data.csv for returning CSV data as a string. Created: 25/Feb/16  Updated: 25/Feb/16  Resolved: 25/Feb/16

Status: Closed
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Trivial
Reporter: Steven Degutis Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: enhancement

Attachments: Text File 0001-Adding-convenience-function.patch    
Patch: Code and Test

 Description   

A common use-case for CSV is to obtain a string containing the data in CSV format. Currently this requires the user to write 3 lines instead of 1, involving a temporary StringWriter. This patch would add that 3-line convenience function to the repo itself.



 Comments   
Comment by Ghadi Shayban [ 25/Feb/16 1:13 PM ]

This ticket is in the wrong project. URL is /DCSV

Comment by Steven Degutis [ 25/Feb/16 1:21 PM ]

Ahh, my mistake. Alex moved it for me. Thanks both of you. (Can't figure out how to @mention either of you in JIRA, sorry.)

Comment by Jonas Enlund [ 25/Feb/16 1:38 PM ]

I'm not sure such a convenience function is needed as you can simply do

(with-out-str (csv/write-csv *out* [["foo" "bar"]["baz" "quux"]]))
Comment by Steven Degutis [ 25/Feb/16 1:48 PM ]

@Jonas That does work, but this idea came up in a discussion in the #clojure IRC channel, where it was agreed that it would be nice to have a small canonical convenience function for it, instead of people each having their own nearly identical wrapper convenience function.

Comment by Steven Degutis [ 25/Feb/16 1:50 PM ]

It's been further discussed as being unnecessary and encouraging bad code practices.





[DCSV-10] Specify RFC4180 compatibilty in README Created: 18/Mar/15  Updated: 19/Mar/15

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: Leon Grapenthin Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: documentation


 Description   

In the README it says: "Follows the RFC4180 specification but is more relaxed."
This is an oxymoron and confusing in other regards. E.g.:

  • What does "relaxed" mean?
  • If it is more "relaxed" than the specification, how can it follow it?
  • Does it follow the specification, or only parts of it?

Problem: If I use this lib to generate CSV for a third party, can I say "This is RFC4180 conform CSV" and feel safe with it? Or should I add "but it is more relaxed"

The task could be to add more specific explanation or a comparison table if necessary.



 Comments   
Comment by Jonas Enlund [ 18/Mar/15 10:54 AM ]

"relaxed" means it will read some files that does not adhere to the RFC4180 spec. Files written with write-csv will follow the spec. If this is not the case it should be considered a bug.

Comment by Leon Grapenthin [ 19/Mar/15 5:13 AM ]

Thanks for the explanation.
Then it should be pointed out in which regards read CSVs don't need to adhere to the spec and whether a strict mode exists or is planned and whether it is or will or would be more or less performant.

P.S.: Out of curiosity - Is this definition of relaxed some kind of standard in IT? I googled for it, but couldn't find anything related.

Comment by Jonas Enlund [ 19/Mar/15 5:33 AM ]

According to the RFC4180 spec:

  • the lines should end with CRLF, this library also supports only LF as well
  • cells should be separated with commas and this lib also supports other separators

I don't think "relaxed" is a standard term. I would certainly accept a patch that enhances the documentation.





[DCSV-9] write-csv and quote? predicate Created: 21/Oct/14  Updated: 21/Oct/14

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Frank Stebich Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

In version 0.1.2 the quote? predicate is called after the object to be written into a cell is converted into a string (see line 99). If the predicate quote? would be applied to the object instead, function write-csv could be called as follows:

(write-csv
"test.csv"
[[1 "text"]
[2 "text"]]
:quote string?)

In the current version every cell value is a string.






[DCSV-8] Allow read-csv to read files without quoting. Created: 29/May/14  Updated: 29/May/14

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Paul Schulz Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

I would like to be able to read file with the following format:

  • '|' separated
  • Unquoted.. eg. \" can appear in the strings, in particular
    at the beginning, and not at the end.

I need to set a nul quote character, but this doesn't currently work.
The following is a workaround, where a '.' is unlikely to appear in first
character of the sting.

(csv/read-csv in-file :separator | :quote \.))

I would like to be able to be explicit:

(csv/read-csv in-file :separator | :quote nul))






[DCSV-7] data.csv does not handle BOMs Created: 12/Aug/13  Updated: 12/Aug/13

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: John Walker Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Usually Windows (but also Linux)



 Description   

Sometimes BOMs are prepended to files in Microsoft Land. Data.csv does not handle this edge case, which causes the first field in the header of a csv file to be incorrect. This can be hard to detect, since \ufeff is usually invisible.

http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
http://www.fileformat.info/info/unicode/char/feff/index.htm



 Comments   
Comment by Jonas Enlund [ 12/Aug/13 11:46 PM ]

This isn't really a csv specific problem. I've encountered files with a byte order mark and then I have simply executed (.skip reader 1) before handing the reader over to read-csv. Is this not a good enough solution?





[DCSV-6] read-csv can not handle white-space at end of line Created: 24/May/13  Updated: 29/May/14

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Cees van Kemenade Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

When whitespace is present after the closing \" the clojure reader crashes with a weird error.
It took me some time to notice it was a white-space issue as whitespace is .... not visible.

See an example of the error below.

=> (read-csv (java.io.StringReader. "\"a\" " ))
Exception CSV error (unexpected character: ) clojure.data.csv/read-quoted-cell (csv.clj:36)
=> (read-csv (java.io.StringReader. "\"a\"" ))
(["a"])



 Comments   
Comment by Cees van Kemenade [ 24/May/13 4:35 PM ]

To take the issue a little further, the same holds for whitespace in the middle of a line between the closing-quote and the separator, see:
=> (read-csv (java.io.StringReader. "\"a\" , 5\n \"b,b\",\"6\"" ))
Exception CSV error (unexpected character: ) clojure.data.csv/read-quoted-cell (csv.clj:36)

This raises the question what happens if you put a space between the separator and the opening quote (first the default case):
=> (read-csv (java.io.StringReader. "\"a\", 5\n\"b\",\"6\"" ))
(["a" " 5"] ["b" "6"])

Now adding one additional space:
=> (read-csv (java.io.StringReader. "\"a\", 5\n \"b\",\"6\"" ))
(["a" " 5"] [" \"b\"" "6"])

Interesting, the white-space is considered to be the start of the string and the quote that follows is considered to be part of the tekst-value that is read.
The main reason for using quotes is to allow separators in text, so let us see that happens if we extend the string by putting a separator in it.
=> (read-csv (java.io.StringReader. "\"a\", 5\n \"b,b\",\"6\"" ))
(["a" " 5"] [" \"b" "b\"" "6"])

Now we see that the separator is not quoted anymore and as expect, the line is interpreted to contain three values instead of two values.

When using standard libraries the issues mentioned above usually do not appear. However, in custom code that emits csv-files or when doing small manual fixes in a csv it is easy to introduce such an issue/error and subsequently it is quit tough to analyse this issue correctly.
Therefore I would opt for a mode of operation where white-space before an opening-quote or after a closing quote are considered to be void (unless it is an escaped quote like "").

Comment by Paul Schulz [ 29/May/14 9:29 AM ]

This is related to DSCV-8

A quote at the beginning of the string, and ending in the middle of the string (eg. where additional characters appear after second quote) will cause the same problem.





[DCSV-5] No option for parsing into maps Created: 21/May/13  Updated: 24/May/13

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Gary Fredericks Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None


 Description   

I imagine a very common use case for parsing CSVs is to get the output as a sequence of maps. I'm happy to provide a patch for this but wanted to make sure I had the right design.

My initial idea is to add another option to read-csv with the name :headers which can be a sequence of values, or a flag such as :first-row. Presumably though we ought to also support using the first row as keywords rather than strings, so I'm not sure whether that ought to be another option or a different flag (e.g., :first-row-keywords).



 Comments   
Comment by Jonas Enlund [ 21/May/13 1:28 PM ]

I've seen this feature request before so I think that something like this should be added. One approach would be to provide a helper function:

(defn csv-data->maps [vecs]
  (map zipmap (repeat (first vecs)) (rest vecs)))

(csv-data->maps (read-csv reader))
Comment by Cees van Kemenade [ 24/May/13 12:41 PM ]

I've ran into the same question and prepared a small library to do my csv processing.
It uses data.csv as a workinghorse, but puts some additional functionality on top of it, such as:
1. csv-to-map: which does the same as the code above, but also maps strings in the first line to keywords. Furthermore, you can choose to translate the keys to lowercase, which is often needed when submitting the csv-data to a database
2. csv-columnMap: which does a selection of a subset of columns, renaming of these columns (aka renaming the first line of csv-data.
3. read-csv: my primary entry point using data.csv + csv-to-map + csv-columnMap
4. read-csv-lazy: A lazy variant which takes a processing function to be used in the inner loop (to allow large csv-datasets)
5. read-csv-to-db: pumping a csv into a database
6. map-seq-to-csv: mapping a uniform sequence of hashmaps to a dataset that can be written to a csv (first line contains the keys)

Feel free to reuse parts of the code. You can find the code here:

https://github.com/cvkem/vinzi.tools/blob/master/vinzi.tools/src/main/clojure/vinzi/tools/vCsv.clj





[DCSV-4] \return as record separator with unquoted fields is read as part of the field Created: 24/Oct/12  Updated: 10/Aug/15  Resolved: 10/Aug/15

Status: Resolved
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: John Hume Assignee: Jonas Enlund
Resolution: Completed Votes: 1
Labels: None

Attachments: Text File return-record-separator.patch    

 Description   

This regards the gray area of being "more forgiving." If I understand RFC 4180 correctly, I want to suggest substituting one bit of forgiveness for another: rather than supporting unquoted, multi-line cell values, I suggest supporting CSVs with just \return as the record-separator. Would you accept a patch for that?

A file with \return as record-separator is interpreted by read-csv as a single row like (["Header1" "Header2\rval1" "val2"]). I believe the RFC only allows fields to contain CR and LF when they're escaped (i.e., surrounded in double quotes). See the ABNF at the end of section 2.

As far as implementation, I believe this would require wrapping any Reader w/o markSupported in one that does, so that the LF following a CR can be consumed when present.

[I've classified this as a major defect because I ran into a \return-delimited file as soon as I passed a CSV from a Linux machine to a Windows machine, so I'm guessing these files are common. Feel free to reclassify.]



 Comments   
Comment by Jonas Enlund [ 24/Oct/12 3:00 PM ]

> rather than supporting unquoted, multi-line cell values, I suggest supporting CSVs with just \return as the record-separator. Would you accept a patch for that?

Sounds good to me.

> As far as implementation, I believe this would require wrapping any Reader w/o markSupported in one that does

I think that's ok, since BufferedReader supports it.

Comment by Paul Stadig [ 10/Aug/15 7:52 AM ]

A patch by myself (Paul Stadig) and Nate Young. We both have CAs on file.

This patch will wrap any reader in a PushbackReader.

When parsing a CSV file, a single return character (ASCII 13) will be treated as a record separator.

We ran into this issue in production. Apparently on OSX if you export an Excel spreadsheet as CSV it will use return as a record separator. However, if you export it as a "Windows CSV" it will use CRLF. This is a bit too subtle for some users, and it would be preferable to be more flexible parsing record separators.

Comment by Jonas Enlund [ 10/Aug/15 12:12 PM ]

I released 0.1.3 with this fix. Thanks for the patch!





[DCSV-3] Some minor documentation typos Created: 14/Jun/12  Updated: 15/Jun/12  Resolved: 15/Jun/12

Status: Resolved
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Trivial
Reporter: Trent Ogren Assignee: Jonas Enlund
Resolution: Completed Votes: 0
Labels: docs, documentation, typo

Attachments: Text File 0001-Documentation-typo-fixes.patch    
Patch: Code

 Description   

I found a couple minor typos: one in the README, one in a docstring. I've included a patch.






[DCSV-2] \return characters do not trigger value quoting Created: 10/Feb/12  Updated: 14/Feb/12  Resolved: 13/Feb/12

Status: Resolved
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Giorgio Valoti Assignee: Jonas Enlund
Resolution: Completed Votes: 0
Labels: None
Environment:

Apache Maven 3.0.3 (r1075438; 2011-02-28 18:31:09+0100)
Maven home: /usr/share/maven
Java version: 1.6.0_29, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: it_IT, platform encoding: MacRoman
OS name: "mac os x", version: "10.7.2", arch: "x86_64", family: "mac"


Attachments: File csv.clj     File csv_test.clj    
Patch: Code and Test

 Description   

If the csv file contains \return characters the values are not quoted. A possible patch is attached.



 Comments   
Comment by Jonas Enlund [ 13/Feb/12 11:16 PM ]

This is fixed in version 0.1.1. I couldn't accept your patch though, as I didn't find you on the contributor list at http://clojure.org/contributing

Comment by Giorgio Valoti [ 14/Feb/12 12:36 AM ]

oh, sorry about that. I’ve completely forgot it because of the problems with jira. Glad to hear it was useful, anyway.

BTW
Why didn’t I receive notifications from Jira when the tickets were closed? Should I “watch” it





[DCSV-1] pom.xml directives Created: 10/Feb/12  Updated: 26/Jul/13

Status: Open
Project: data.csv
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Giorgio Valoti Assignee: Jonas Enlund
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Apache Maven 3.0.3 (r1075438; 2011-02-28 18:31:09+0100)
Maven home: /usr/share/maven
Java version: 1.6.0_29, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: it_IT, platform encoding: MacRoman
OS name: "mac os x", version: "10.7.2", arch: "x86_64", family: "mac"


Attachments: XML File pom.xml    
Patch: Code and Test

 Description   

If you build data.csv alone with the current pom.xml you get a couple of warnings and test are not executed. With the recent versions of Maven, these warnings can break the build.

A fixed (I hope!) version is attached.






Generated at Mon Sep 26 10:42:14 CDT 2016 using JIRA 4.4#649-r158309.