4

I'm looking to perform two tasks:

  • Train a classifier to classify code as serial or parallel

  • Train a generative algorithm to generate parallel code from serial

For the first task a simple scraper can scrape random C and C++ code from git, however for the second step I would need a decently large source of examples of serial to parallel code. Any ideas or pointers for existing or creating this type of dataset would be greatly appreciated.

JMed
  • 76
  • 3

1 Answers1

1

A generic way to start with under such circumstances is to try find an "oracle".

Serial to parallel converters exist for quite some times, and some are open source (e.g. PIPS). The idea is to get serial code from step (1), use the "oracle" to produce parallel code, and that's it: Each conversion makes en entry in the dataset.

Ensuring the quality of the dataset is critical here. A script generating the dataset shoud ensure (1) the serial code compiles and runs properlly, (2) the parallelized code compiles and runs properly too, (3) serial and parallel programs produce the same result, (4) some metrics state objectively how the parallel version does against the serial version, and (5) keep track of the actual hardware configuration.

Point (4) is critical to the quality of such a dataset. Parallel programs are not always faster than a serial version: A 1000-iteration loop dumped over 1000 workers on a 8-core CPU may not do so well compared to 8 workers---we need to check what the converter is doing. And point (5) ensures we know under which conditions the data is valid.

Using several "oracles" would be even better, to bring diversity, and to hopefully get the learning algorithm discover the best conversion tradeoffs---perhaps better than what the carefully hand-crafted converters are able to do on their own.

Eric Platon
  • 1,490
  • 10
  • 21
  • 1
    In addition to PIPS: [DawnCC](http://cuda.dcc.ufmg.br/dawn/index.php) and [bones](http://parse.ele.tue.nl/research/bones/). – Ilya Palachev Jun 26 '18 at 15:32