1

Often in NLP project the data points contain both text and float embeddings, and it's very tricky to deal with. CSVs take up a ton of memory and are slow to load. But most the other data formats seem to be meant for either pure text or pure numerical data.

There are those that can handle data with the dual data types, but those are generally not flexible for wrangling. For example, for pickle you have to load the entire thing into memory if you want to wrangle anything. You can just append directly to the disk like you can with hdf5, which can be very helpful for huge datasets which can not be all loaded into memory?

Also, any alternatives to Pandas for wrangling Huge datasets? Sometimes you can't load all the data into Pandas without causing a memory crash.

SantoshGupta7
  • 197
  • 1
  • 8

1 Answers1

2

There are different possible ways to handle huge datasets:

  1. If the data is too big to be fully uploaded to RAM, you can iterate over it in Pandas. You can find a brief explanation in the article Why and How to Use Pandas with Large Data, section 1. Read CSV file data in chunk size. Or add more RAM (or use powerful server hardware), if you want to continue using single machine.

  2. If the data is really big, probably it's better to store and process it by multiple computers using special software, A specific tool depends on what you want to do with the data:

  • 2.1. You can even keep using Pandas: there is an extension named Dask which wraps interfaces of Python iterables, NumPy, and Pandas to run on multiple machines. Also, there are other similar tools.
  • 2.2. A more independent yes simple approach is to use some analytical, distributed DBMS like Google BigQuery or Cassandra.
  • 2.3. And if you need something more powerful and complex, MapReduce systems like Apache Hadoop, Spark can be utilised. Moreover, there are many articles and scientific papers that describes usage of MapReduce for the machine learning.

So, as always, it is a matter of your opportunities and your aims of dealing with the data.

AivanF.
  • 136
  • 3