Big Data Ecosystem

Jul 09, 2019

Eduard Staniloiu

Jul 09, 2019

Andre Pany

Jul 11, 2019

Jul 11, 2019

Jul 11, 2019

Jul 10, 2019

Jul 11, 2019

Jul 12, 2019

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote: > Cheers, everybody! > > I was wondering what is the current state of affairs of the D ecosystem with respect to Big Data: are there any libraries out there? If so, which? > > Thank you, > Edi Big data is a broad topic:), you can achieve it with specific software like spark, kafka or even with cloud storage services like AWS S3 or even known databases like Postgres. For Kafka there is a Deimos binding for librdkafka available here https://github.com/DlangApache/librdkafka. There is also a native implementation for D available, but unfortunately not longer maintained https://github.com/tamediadigital/kafka-d. For AWS services, I prefer the AWS client executable. It accepts JSON input and also outputs JSON. From the official AWS services metadata files you can easily create D structs and classes (https://github.com/aws/aws-sdk-js/tree/master/apis). It almost feels like the real AWS SDK available e.g. for Python, Java, C++. For AWS s3 there is also s native D implementation based on vibe-D. For postgres you can e.g. use this great library https://github.com/adamdruppe/arsd/blob/master/postgres.d. In one way or another you need in Big Data scenarios http client and servers. Also here the ARSD library has some lightweight components. Also the current GSOC project regarding dataframes is an important part of Big Data. What I currently really miss is the possibility to read/write Parquet files. Kind regards Andre

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote: > Cheers, everybody! > > I was wondering what is the current state of affairs of the D ecosystem with respect to Big Data: are there any libraries out there? If so, which? > > Thank you, > Edi Dear, To be fair if you need something to be ready in use go to scala and java through spark, deeplearning4j and others Otherwise you are welcome to demonstrate to the world the power of D in this field best regards

On Wednesday, 10 July 2019 at 21:56:19 UTC, bioinfornatics wrote: > On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote: >> Cheers, everybody! >> >> I was wondering what is the current state of affairs of the D ecosystem with respect to Big Data: are there any libraries out there? If so, which? >> >> Thank you, >> Edi > > > Dear, > To be fair if you need something to be ready in use go to scala and java through spark, deeplearning4j and others In my experience, the performance of Spark in particular leaves much to be desired when you don't have a large Hadoop cluster.

On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote: > What I currently really miss is the possibility to read/write Parquet files. For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details.

On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote: > On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote: > >> What I currently really miss is the possibility to read/write Parquet files. > > For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details. In something like two minutes of googling, I found that Apache Arrow [1] has C bindings [2] for parquet's C++ read/write utilities. I know nothing about Parquet files, but I imagine this would be faster than calling the R packages. [1] https://github.com/apache/arrow [2] https://github.com/apache/arrow/tree/master/c_glib/parquet-glib

On Thursday, 11 July 2019 at 20:00:19 UTC, jmh530 wrote: > On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote: >> On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote: >> >>> What I currently really miss is the possibility to read/write Parquet files. >> >> For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details. > > In something like two minutes of googling, I found that Apache Arrow [1] has C bindings [2] for parquet's C++ read/write utilities. I know nothing about Parquet files, but I imagine this would be faster than calling the R packages. > > [1] https://github.com/apache/arrow > [2] https://github.com/apache/arrow/tree/master/c_glib/parquet-glib Thanks. The benefit of Parquet in contrast to e.g hdf5 is the file size. A 500 mb csv has a size of 300 mb as hdf5 and 180 mb as Parquet. The file size is important when you need to read and write to e.g. AWS S3. Kind regards Andre

July 12, 2019

Re: Big Data Ecosystem

Posted by Laeeth Isharc
in reply to Eduard Staniloiu

Permalink

Laeeth Isharc

Posted in reply to Eduard Staniloiu

Permalink

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
> Cheers, everybody!
>
> I was wondering what is the current state of affairs of the D ecosystem with respect to Big Data: are there any libraries out there? If so, which?
>
> Thank you,
> Edi

Weka.io of course have the world's fastest file system and I understand ML at scale is one hot market for them.  It's simple to get going from what I saw and it's not expensive in the scheme of things.  I don't really understand myself why you would use cloud in many cases, but it does work on the cloud if you want.

I guess you know mir and Lubeck.  There's LDA tucked away there in case you need.

James Thompson lightning talk was quite interesting - sometimes doing things efficiently can reduce the need for all the complexity of some of the standard approaches.

I don't know if you consider postgres part of big data solutions, but with Timescale DB maybe.  You can quite easily write Foreign Data Wrappers in D to integrate with other data sources and you can also write server side functions.  I have done maybe half the work for that but didn't get time to finish yet.  DPP more or less works for postgres headers.

Joyent have an interesting approach to working on big data the UNIX way.  They have an object store called Manta that allows you to run code on the same node as the data (stored using zfs).   One could do something similar in D.  I wanted to get comfortable with SmartOS but I don't think it's ready for us today.  However one could do something similar home-rolled with zfs and Linux containers.  I wrapped libzfscore and lxd - alpha quality right now.  Not sure if I pushed the latest versions to GitHub yet.

For syncing stuff across a WAN between regions, TCP doesn't have great throughput.  You can either strap together a bunch of connections or use something on top of UDP to make it reliable.  We found UDT-D gave us 300x faster file transfers between London and HK.  It's up at GitHub though not very polished code.

Forums