Common Tips in Data Parallelism
From a practical point of view of a daily practitioner
Moving test code in production is usually tricky and people underestimate the effort to make it land smoothly into all processors and return output to storage.
Throughout my a-year intense experience with data parallelism, as working in a pixel-based geospatial data modeling research group, I summarize a few tips to help construct a production code from the messy test code.
1. Define an Independent Process
First of all, data parallelization establishes on running the same process on different data subset. Therefore what is the “same process” that can run through various data subsets is crucial.
There should not be any communications in an independent process. It is worth thinking at the head and the tail of the process. How to create a data subset in the script and how to harvest and concatenate the parallelization result?
Good Parallel Case Scenario:
Input of the process is string, list, integer (e.g. ID), or any other object. Data is stored at a database where can leverage on metadata to only retrieve the subset located at. Output on the other hand is still a separate subset still. In this case, output subsets can be fully independent among threads, or server.
Bad Parallel Case Scenario:
Input of the process should be a single big local file required reading the whole and subset in the process. Loading and subsetting are not able to parallelize at all. Then output is designated to be a single big file composed by data subsets. The result of those independent processes unfortunately have to gather in a single thread and concatenate into a file and then output by a single thread too. In addition to I/O, the independent process shares some global variables with huge sizes. In order to run a process in a thread, each thread needs to copy all global variables, duplicate the data and occupy memory inefficiently.
2. Checkpoint for Output
It is nearly impossible to make production code from test code and succeed at the first trial. However, the debugging is easier when the code simply stops and returns an error message. The annoying bugs usually occur when some data subset works but some does not, and the code stops after computing for hours. It could happen
Anomaly of Data Subset
Input divers among data subset so sometimes some subset works but some does not. When a data subset fails the whole production process is going to fail too.
Server’s Memory or Storage Problem
When the production does not construct properly, it could cause memory or storage usage accumulation. It cannot be detected at the first few processes but at the middle. When the server reaches the limit and returns “no space” error, time has passed hours or even days.
In these cases, the finished subsets are good. If we fix the code and rerun the whole process, the process would start at the beginning and all work done is in vain.
Therefore, creating an “if” condition line to check if the output of the data subset already exists or not can solve the problem easily. Worth to note, it is also related to the worst case scenario of 1., if we create a big single file output, as this problem occurs, we are barely able to find a corresponding path to our input.
3. Designing Logfile to Track Process
Logfile is a general term to describe a computer-generated data file recording operation process. While using a logfile we can locate the error, back calculate the timespan of each step, and if it manages well, a logfile can also be used as a progress bar in production computing. To design a useful log file, some tips lying could improve:
Print Timestamp alongside Process
Timestamp for each step helps to analyze the time spent in the train of components. It is particularly useful at the beginning while building the production, helping pinpoint the time-consuming step in order to see if it could be optimized or not.
Customize Unique Name for Start/end of a Data Subset
Having a unique name at start and end of a data subset gives the information of a single unit of process as a whole. By multiplying the number of subsets with the timespan unit of process, it roughly returns the estimate of finishing of the production process.
Don’t underestimate how useful it is! With the estimate of finishing time we can for instance plan for 3-day holiday since it computes til then, or be doomed finding out the code will take 30 days to finish in the current setup. And then we should consider making a serious effort to reconstruct the production code.
Use try/except for Anomaly
As mentioned in 2., the anomaly is inevitable, and we should not let “a bad apple spoils the bunch.” Using try-except structure warps up the production code and can skip the error that occurs due to the anomaly of data subset.
Besides, adding a error message that pinpoints the data subset is extremely useful, which is obviously simply to empower by aforementioned unique name for each data subset. These error messages in a logfile eventually turns into a list of failed files, and we can start over from there trying to debug those abnormal subset and only reprocess a tiny proportion of files.
However try-except is also a double-edge sword. It should not be implemented before finishing the debugging. Otherwise when the code fails at any point it would only return an exception message.
Print Output Paths
If we design the production code to generate the output as a unit of data subset (ideal design according to 1.), we would get a bunch of output files. Printing the output file path enables one to conduct an immediate check during the process, simply reading the logfile, getting the path and checking by any manner.
I have used S3 http protocol as my storage database. As printing out the path on S3, I simply ctrl click the path and download to my local machine and check it locally. Isn’t it simple and efficient?
Last but not least, designing a logfile requires understanding some basic search-string commands so that the logfile can be designed properly and searchable. “grep” and “wc -l” in Linux/MacOS are enough, and it is ready to start reading those “print” in code.
Some talks…
It is way harder to make a production out of other people’s test code. In my experience, I am dealing with a local HPC system with several servers instead of any other cloud computing platform. Not only parallelizing by threads but also servers. To imagine the complexity when those concepts escalate to 2–level parallel structure (threads and servers).
Nonetheless, I am certain those tips are still useful. That is the reason why I still publish it right here right now. I also admit that I am still considered unsophisticated in parallelization, but I firmly believe if applying those tips in any of the data parallelism code it could still help to take away some pain while debugging.