Sunday, June 1, 2008

adding OCR to djvu file

For each page in the file "cake.djvu", we can use the "tesseract" to process the page image:
djvused -e "select ${page};save-page-with \"cake_page.djvu\"" cake.djvu
convert cake_page.djvu cake.tif
tesseract cake.tif cake_box batch.nochop makebox
tesseract cake.tif cake_txt batch.nochop
This produces the information for the text structure (lines and words) and positioning (coordinate for each character). To convert this information to the hidden-text format for use with djvused, use
perl<<'EOL'>cake_text.txt
open TXT, "<:utf8", "cake_txt.txt";
open BOX, "<:utf8", "cake_box.txt";
$pxn = 1000000;
$pxx = 0;
$pyn = 1000000;
$pyx = 0;
$pagebuf = "";
while ($line = <TXT>) {
chop $line;
@words = split /\s+/, $line;
next if $#words < 0;
$lxn = 1000000;
$lxx = 0;
$lyn = 1000000;
$lyx = 0;
$linebuf = "";
foreach $word (@words) {
$xmin = 1000000;
$xmax = 0;
$ymin = 1000000;
$ymax = 0;
$w = "";
for ($i = 0; $i < length($word); $i ++) {
$c = substr($word, $i, 1);
do {
$cline = <BOX>;
} while (substr($cline, 0, 1) ne $c);
($xn, $yn, $xx, $yx) = substr($cline, 2) =~ /\S+/g;
$w = $w . '\\' if $c eq '"';
$w = $w . '\\' if $c eq '\\';
$w = $w . substr($cline, 0, 1);
$xmin = $xn if ($xmin > $xn);
$xmax = $xx if ($xmax < $xx);
$ymin = $yn if ($ymin > $yn);
$ymax = $yx if ($ymax < $yx);
}
$wline = '(word ' . $xmin . ' ' . $ymin . ' ' . $xmax . ' ' . $ymax . ' "' . $w . '")';
$linebuf = $linebuf . "\n  " . $wline;
$lxn = $xmin if ($lxn > $xmin);
$lxx = $xmax if ($lxx < $xmax);
$lyn = $ymin if ($lyn > $ymin);
$lyx = $ymax if ($lyx < $ymax);
}
$pagebuf = $pagebuf . "\n (line $xmin $ymin $xmax $ymax" . $linebuf . ')';
$pxn = $lxn if ($pxn > $lxn);
$pxx = $lxx if ($pxx < $lxx);
$pyn = $lyn if ($pyn > $lyn);
$pyx = $lyx if ($pyx < $lyx);
}
close BOX;
close TXT;
binmode(STDOUT, ":utf8");
print "(page $pxn $pyn $pxx $pyx", $pagebuf, ')', "\n";
EOL
which generates "cake_text.txt" in the accordant format. The hidden text can be saved back to the djvu file with
djvused -e "select ${page};set-txt \"cake_text.txt\";save" cake.djvu
We just need to repeat this for all the desired pages.

2 comments:

Anonymous said...

After convertation text information to the hidden-text format for use with djvused :

syntax error at - line 8, near "= ) "
syntax error at - line 26, near "= ;"
syntax error at - line 49, near "}"
Execution of - aborted due to compilation errors.

:(

What should I do with this ?

cjj said...

The <TXT> and <BOX> in my code were treated as HTML tags. It should be fixed now.