(→‎Welcome!: thanks for all the ported material!)
Line 21: Line 21:


Thanks, and let's keep in touch!  --[[User:Curtbeckmann|CurtB]] 16:58, 19 September 2007 (PDT)
Thanks, and let's keep in touch!  --[[User:Curtbeckmann|CurtB]] 16:58, 19 September 2007 (PDT)
''' Hello Curt '''
Yes the formatting is preserved, but the OCR Software that I have does not detect that it could extract the text from the PDF and during the OCRing can introduce textual errors. Since I realized to late that the PATB files contain text, I did the first document as full OCR. In the second document I OCRed the full document but did not do an rigorous spell check. In the resulting document I replaced the text blocks with the text from Adobe. This way I had the structure correct and also the text. The second time I did the formatting in OpenOffice 2.3.0 until i was satisfied. Then I exported the document to Mediawiki format. There are some things that the exporter does not do like i would have them, but maybe I should investigate how I would have to format them in OpenOffice.
In the OCRed document the elements are visually correct but do not have the style that the exporter requires. The resulting export to Mediawiki looks correct, but e.g. headings are only Bold lines and not headings of the correct level. So I do some formatting changes.
* change the style of the header lines to the correct style for a heading.
* change the style of bulleted lists to bulled list
* add the captions to images and tables
* reformat tables if the cells contain text in multiple lines.
Then I export to Mediawiki. In the resulting text file are some things that i do not like. Short sentence like they occur in addresses are formatted as paragraphs. I change this before I create the article in Appropedia. I assume one could do the formatting without OpenOffice but I think I'm faster this way. --[[User:LeissKG|LeissKG]] 23:50, 19 September 2007 (PDT)

Revision as of 06:50, 20 September 2007

Welcome!

Hi LeissKG,

Welcome to Appropedia! Thanks for the contributions on the Low_cost_computer_guide.

Looking forward to seeing more of you!

-CurtB


Wow!

Hi again,

I've been noticing your excellent work on the porting of PATB pages! Great stuff! I'm curious... I don't know about the OCR stuff. Does the output of the OCR tool retain formatting (things, like bold, italics, etc)?

If the formatting is not retained, I recently also found an easier process. Acrobat allows "export as text", which I experimented with only briefly. It loses the formatting, but it's still better than the slow "copy/paste" option. And, of course, the basic Acrobat is free. But if the OCR tool actually retains formatting, then it is a great thing!

Whatever the answer, it's really great to see all your productivity and the articles!

Thanks, and let's keep in touch! --CurtB 16:58, 19 September 2007 (PDT)

Hello Curt

Yes the formatting is preserved, but the OCR Software that I have does not detect that it could extract the text from the PDF and during the OCRing can introduce textual errors. Since I realized to late that the PATB files contain text, I did the first document as full OCR. In the second document I OCRed the full document but did not do an rigorous spell check. In the resulting document I replaced the text blocks with the text from Adobe. This way I had the structure correct and also the text. The second time I did the formatting in OpenOffice 2.3.0 until i was satisfied. Then I exported the document to Mediawiki format. There are some things that the exporter does not do like i would have them, but maybe I should investigate how I would have to format them in OpenOffice.

In the OCRed document the elements are visually correct but do not have the style that the exporter requires. The resulting export to Mediawiki looks correct, but e.g. headings are only Bold lines and not headings of the correct level. So I do some formatting changes.

  • change the style of the header lines to the correct style for a heading.
  • change the style of bulleted lists to bulled list
  • add the captions to images and tables
  • reformat tables if the cells contain text in multiple lines.

Then I export to Mediawiki. In the resulting text file are some things that i do not like. Short sentence like they occur in addresses are formatted as paragraphs. I change this before I create the article in Appropedia. I assume one could do the formatting without OpenOffice but I think I'm faster this way. --LeissKG 23:50, 19 September 2007 (PDT)

Cookies help us deliver our services. By using our services, you agree to our use of cookies.