The correct number of child processes will depend on the hardware available to you.
Simplifying a bit, child processes can be in one of two states: waiting for memory or disk access, or running.
If your problem fits nicely in your computers' memory, then processes will spend almost all of their time running. If it's too big for memory, they will periodically need to wait for disk.
You should use approximately 1 Child process per CPU core. If you are training on a GPU, then it depends whether the process can make use of the entire GPU at once (in which case, use just 1), or whether a "process" is really more like a CUDA thread here (in which case you'd want one per CUDA core).
If you think your processes will wait for disk, use more than one per core. About 50% more is a good starting point. You can use a program like top to monitor CPU usage and adjust the number of processes accordingly.
To answer your question more explicitly:
- Having more child processes (up to a point, discussed above), will increase hardware utilization, and make your training run faster. With a Core i7 CPU for instance, you might be able to run 8 or 16 child processes at a time, so you'd train 8-16x times faster.
- Having more child processes than processing units (CPU cores, CUDA cores), will begin to cause frequent context switching, where the processing units have to pause to change between different jobs. Changing jobs is extremely expensive, and ultimately, your program cannot train faster than it would by using all the available hardware. If you have more processes than processing units, reducing the number should make your program train faster.