A generic way to start with under such circumstances is to try find an "oracle".
Serial to parallel converters exist for quite some times, and some are open source (e.g. PIPS). The idea is to get serial code from step (1), use the "oracle" to produce parallel code, and that's it: Each conversion makes en entry in the dataset.
Ensuring the quality of the dataset is critical here. A script generating the dataset shoud ensure (1) the serial code compiles and runs properlly, (2) the parallelized code compiles and runs properly too, (3) serial and parallel programs produce the same result, (4) some metrics state objectively how the parallel version does against the serial version, and (5) keep track of the actual hardware configuration.
Point (4) is critical to the quality of such a dataset. Parallel programs are not always faster than a serial version: A 1000-iteration loop dumped over 1000 workers on a 8-core CPU may not do so well compared to 8 workers---we need to check what the converter is doing. And point (5) ensures we know under which conditions the data is valid.
Using several "oracles" would be even better, to bring diversity, and to hopefully get the learning algorithm discover the best conversion tradeoffs---perhaps better than what the carefully hand-crafted converters are able to do on their own.