Get our free book (in Spanish or English) on rainwater now - To Catch the Rain.
Welcome to Appropedia! Thanks for the contributions on the Low_cost_computer_guide.
Looking forward to seeing more of you!
I've been noticing your excellent work on the porting of PATB pages! Great stuff! I'm curious... I don't know about the OCR stuff. Does the output of the OCR tool retain formatting (things, like bold, italics, etc)?
If the formatting is not retained, I recently also found an easier process. Acrobat allows "export as text", which I experimented with only briefly. It loses the formatting, but it's still better than the slow "copy/paste" option. And, of course, the basic Acrobat is free. But if the OCR tool actually retains formatting, then it is a great thing!
Whatever the answer, it's really great to see all your productivity and the articles!
Thanks, and let's keep in touch! --CurtB 16:58, 19 September 2007 (PDT)
Yes the formatting is preserved, but the OCR Software that I have does not detect that it could extract the text from the PDF and during the OCRing can introduce textual errors. Since I realized to late that the PATB files contain text, I did the first document as full OCR. In the second document I OCRed the full document but did not do an rigorous spell check. In the resulting document I replaced the text blocks with the text from Adobe. This way I had the structure correct and also the text. The second time I did the formatting in OpenOffice 2.3.0 until i was satisfied. Then I exported the document to Mediawiki format. There are some things that the exporter does not do like i would have them, but maybe I should investigate how I would have to format them in OpenOffice.
In the OCRed document the elements are visually correct but do not have the style that the exporter requires. The resulting export to Mediawiki looks correct, but e.g. headings are only Bold lines and not headings of the correct level. So I do some formatting changes.
- change the style of the header lines to the correct style for a heading.
- change the style of bulleted lists to bulleted list
- add the captions to images and tables
- reformat tables if the cells contain text in multiple lines.
Then I export to Mediawiki. In the resulting text file are some things that i do not like. Short sentence like they occur in addresses are formatted as paragraphs. I change this before I create the article in Appropedia. I assume one could do the formatting without OpenOffice but I think I'm faster this way. --LeissKG 23:50, 19 September 2007 (PDT)
What I forgot to mention is that if you have source documents in Word or HTML you should be able to port them with minimal effort this way. OpenOffice can open both and than export as Mediawiki. --LeissKG 00:02, 20 September 2007 (PDT)
Fantastic work - Curt emailed me a month ago about your good ideas and good work, but I've only now looked at it. I'm currently tied up with setting up the new forums and blog for Appropedia, but I'll try and stay up to date with what you're doing, and give some more serious help within a few weeks.
One thought about OCR and spell-checking. If you run the document through OCR and create it as an article in Wikipedia, you could then export as text, click edit on the wiki page, and paste the text in as though you were going to replace the whole article. Then, instead of clicking "Save", just click "Show changes". If I'm correct, this will automatically determine which paragraphs are the same, but it will show you the errors (highlighted as red changes). You could then open the edit window again, in a new browser tab or window, in order to do the corrections.
I think it could be worth a try, anyway - might be useful for some documents if you find the OCR approach has advantages.