Train a Machine Learning Model to Scan Through a Page of Specific Format and Scrap the Texts

Question

I want to translate already printed papers and use the texts for further use. Basically, the pages will contain MCQ questions and I want to scan a scanned version of the hard copy and store them into JSON file. Now how can I train a model to achieve this? I don't have any prior knowledge of ML but have a solid understanding of programming and algorithms. Never tried ML. How can I achieve this. I want to make it as accurate as possible. Also I am willing to work as hard as possible. Please give me your advice on this. Suggest me what can I do for achieving the result as fast as possible. Also how much time can it take.

score 0 · Answer 1 · answered Aug 18 '23 at 07:05

Seems more like a common scripting task than a full-blown AI project from scatch to me.
I would do this:
Subscribe to ChatGPT Pro, activate its Code Interpreter plugin. Give it pseudo code of your scriping task, or a give it a basic shellscript, and/or some commented-out instructions (e.g, "TODO: remove multiple newlines from file" TODO: ...), say "convert this code to python", and it will generate a Python program for you and execute it.

If runtime errors occur ChatGPT Code interpreter will try a new attempt until the script runs. Takes 2 minutes.

The code interpreter UI has an "Upload file" button (unlike the common ChatGPT GUI) so you can upload some sample input files.

You will need to fix the generated script in places. Perhaps, change Python2isms to Python3isms, format its output how you want it, etc.

But chances are high that 95% of your work will done by the AI.

Yes. You have asked how to achieve a task as fast as possible, and I tried to sketch a solution to do just that . --- I'm sorry, It is not clear to me what the ML aspect is here. Do you want to build your own OCR solution, or do you just want to do document management (build a collection of Multiple choice question and answers), or do you want to do some apply learning algorithm on the text collection after gathering it? — knb, Aug 18 '23 at 09:12

Train a Machine Learning Model to Scan Through a Page of Specific Format and Scrap the Texts

1 Answers1