converting djvu to pdf trouble with this OCR preserving code

Question

I want to convert djvu to pdf while preserving OCR. This page describes how to do so, but I am getting a blank html file.

In /home/steven/Documents/djvu2pdf/1/, djvu2hocr -p 1 Intro.djvu gives me:

Converting 'Intro.djvu':
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name="ocr-system" content="djvu2hocr 0.7.9" />
  <meta name="ocr-capabilities" content="ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word" />
  <title>DjVu hidden text layer</title>
</head>
<body>
*** [1-11711] Failed to open 'Intro.djvu': No such file or directory.
*** (ByteStream.cpp:693)
*** 'DJVU::GUTF8String DJVU::ByteStream::Stdio::init(const DJVU::GURL&, const char*)'


</body>
</html>
Traceback (most recent call last):
  File "/usr/bin/djvu2hocr", line 7, in <module>
    _.main(sys.argv)
  File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 325, in main
    djvused.wait()
  File "/usr/share/ocrodjvu/lib/ipc.py", line 114, in wait
    raise CalledProcessError(return_code, self.__command)
subprocess.CalledProcessError: Command 'djvused' returned non-zero exit status 10

leading to a blank html file, so when I run

sed 's/ocrx/ocr/g' > pg1.html

it just runs on an indefinite loop.

I also have a secondary program called djvu2pdf which I found at http://0x2a.at/s/projects/djvu2pdf, but

djvu2pdf Intro.djvu

gives me

-e Error: /usr/bin/djvu2pdf: File 'Intro.djvu' not found

The OCR file opens fine.

user140393 · Answer 1 · 2013-03-27T02:00:01.877

I fixed the file /home/steven/Documents/djvu2pdf/1/Intro.djvu. Turns out all my djvu files had no extensions, but linux was opening them anyways.

Testing with a single page document

I first ran cd /home/steven/Documents/djvu2pdf/1/

Then ran: djvu2hocr -p 1 1.djvu

DjVu hidden text layer



Page #1

Traceback (most recent call last):
  File "/usr/bin/djvu2hocr", line 7, in 
    _.main(sys.argv)
  File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 323, in main
    process_page(page_zone, options)
  File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 263, in process_page
    result = process_zone(None, page_text, last=True, options=options)
  File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 238, in process_zone
    parent.append(child)
AttributeError: 'NoneType' object has no attribute 'append'

The command: djvu2hocr -p 1 1.djvu > tmp.html did the same thing Converting '1.djvu': - Page #1 Traceback (most recent call last): File "/usr/bin/djvu2hocr", line 7, in _.main(sys.argv) File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 323, in main process_page(page_zone, options) File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 263, in process_page result = process_zone(None, page_text, last=True, options=options) File "/usr/share/ocrodjvu/lib/cli/djvu2hocr.py", line 238, in process_zone parent.append(child) AttributeError: 'NoneType' object has no attribute 'append'

sed 's/ocrx/ocr/g' tmp.html > pg1.html

I got a html & tmp.html saying this

http://pastebin.com/APwUPwk4

Had to post it there b/c for whatever absurdd reason this site won't allow me to post that code, using html, code, pre tags none worked Also what exactly is a pipe, more so where did I miss it in that post? I am new to the terminal in linux, just learning by google searches

score 0 · Answer 2 · answered Mar 25 '13 at 10:50

0

First, try running the program with the full path to the file. Run the following command to make absolutely sure that your file exists

file /home/steven/Documents/djvu2pdf/1/Intro.djvu

and then try

djvu2hocr -p 1 /home/steven/Documents/djvu2pdf/1/Intro.djvu

Second, there is a problem with the following command on itself:

sed 's/ocrx/ocr/g' > pg1.html

This will not run "in an infinite loop" but will just wait for standard input (keyboard in your case), since you are not running sed with an argument or as a part of a pipe. sed does not know which file you want to process.

The page you refer to clearly specifies that you should run it as a part of a pipe. Alternatively, you can do it as follows:

 djvu2hocr -p 10 /home/steven/Documents/djvu2pdf/1/Intro.djvu > tmp.html
 sed 's/ocrx/ocr/g' tmp.html > pg10.html

answered Mar 25 '13 at 10:50

January

35,952

This is very weird it exist in the GUI, but doesn't exist in the terminal. No such file or directory, but that clearly isn't the case in this pic http://i.imgur.com/Rg7wszn.png – user140393 Mar 25 '13 at 12:51
Hey, but it is called "Intro" and not "Intro.djvu"! Extensions in Linux are more or less optional and for the users more than for the system. Try the command line with "Intro" as file name. – January Mar 27 '13 at 08:49
Yes I noticed that and changed all files to *.djvu w/ pyRenamer, however encountered further problems and posted it below. – user140393 Mar 28 '13 at 02:55
Fixed it, some other command I ran on some of my documents corrupted them. Also this command also corrupted them if entered incorrectly. I thought it was different b/c i was cross-testing on different documents. – user140393 Mar 29 '13 at 09:52

converting djvu to pdf trouble with this OCR preserving code

2 Answers2