Through a few months’ study I noticed the challenges of big data is less of an intellectual one than an engineering one. There are algorithms designed for actually sampling huge data streams (e.g. this Stanford course), but still in my practice the biggest challenges are usually long running time, out of resource problems, and thus difficulty in debugging, and prolonged development circle.
Frustrations arise from one minor problem, which can hinder days’ of efforts.
The key is to break the process down step by step and identifying the source problem.
This posts describes such a case and my solution to it.
I want to train an hierarchical GRU model with attention on 20000 documents of medical summary. I use the model for classification on the test set.
My implementation of attention is a fixed linear layer. Due to attention mechanism, the number of hidden states in each level of GRU should be the same for the training set and the test set.
Since number of hidden states corresponds to the max number of sentences in all documents, and the max number of tokens in all sentences, an easier way is to truncate all documents and pad them into the same length of sentences and tokens.
To ensure training set and test set have the same shape, I need to read them all in memory to truncate together.
My group mate already break the files down into smaller chunks: Full, Med, and Tiny for running the program and testing.
The problem is since the files are huge, I cannot read them all into memory. The data are stored in pickle files.
- Increase the memory limit on HPC
- Read pickle file incrementally. But still, eventually I need them all together in memory.
- Change a different storage format.
- The problem lies in the format of source file. It’s pickle file, thus difficult to read line by line.
- Also, the problem lie in running python interactively on server. It seems that server allocate a small memory limits for interactive mode. However, for batch submission to the computing nodes, the memory limits are higher and adjustable. See here.
Possible solutions phrase 2:
- Now I have two sets of memory: on login node, and on cluster (?)
- The original problem becomes : I cannot read big files on login node interactively
- I can read small files on login node interactively, thus can debug and test
- I can request for more memory on cluster. Thus I can read big files from cluster, and break them down into smaller chunks
- I can then read the smaller chunks from login node interactive
Problem is then solved.
But the solution to this problem is specific, in that:
- I can break the source file down in later programs. Thus, big files are not really necessary for my program
- I in fact have access to larger memory.
I have not solved the problem when there is a real memory limit and hard file size specification. For that, might need to store in different format (pickles are not meant to be read line by line) or design different algorithms.