Help:Porting PDF files to MediaWiki: Difference between revisions

Revision as of 07:08, 19 September 2015

This is still a work in process - you can help by trying these methods and adding any information about what works. Contact Chriswaterguy or Curt if you have questions.

Methods

At present, we do not have a simple one-step PDF-to-MediaWiki translation process which retains the desired text formatting. We have many multi-step approaches which retain text formatting, all of which can be broken down into two main steps:

Convert (save) the PDF to a more workable intermediate format that supports interesting formatting, and
Convert the resulting file into MediaWiki format.

The various options for each of these main steps are described below.

The OpenOffice 3 approach may come closest to a straightforward solution, if we can get it to work.

Alas, we do not yet have an automated way to transfer the images (though mw:Extension:MultiUpload may make it easier). Help is welcome!

1. Save as formatted text

Convert to a Word, RTF or HTML file.

It is preferable to use:

An open source solution, which can be used by anyone and can be improved if needed.
If this doesn't work, Acrobat Professional - some academics, students and business people will have access to it, and it is likely to work better than freeware or web services.

(When searching for solutions, note that word combinations like PDF export gets a lot of false hits - mainly exporting to PDF, and also very many commercial programs. So, try this search:

export OR convert pdf-to freeware -demo -free-trial images formatted OR layout)

Open source options

There is an extension for OpenOffice 3 Beta (presumably works with OO3) that facilitates import of pdf documents Sun PDF Import Extension (Beta). If this works well, combined with OpenOffice's existing MediaWiki export functionality, this may be a one stop tool for PDF to MediaWiki conversion. Chriswaterguy is trying this out now, but having trouble with bugs preventing acceptance of the EULA. 21:24, 14 December 2008 (UTC)

Alternatively:

use http://pdftohtml.sourceforge.net/ or xpdf to convert to html
clean up with htmltidy. It should now be ready to convert to MediaWiki.

Scanned documents:

Sometimes scanned documents have the actual text embedded in the document.
- The pdftotext command extracts raw text: "pdftotext file.pdf" without the quotes.
- Evince PDF viewer (and possibly others) allow you to select and copy.

Acrobat Professional (i.e. paid version)

This is the best option so far. Check if your school/college/company has this program (you might have to ask for access).

In the non-free Adobe Acrobat there is an option to save to rich text formats - see Save a PDF file as a Word document, HTML file, or image. Fatima has used this and found it helpful. We should experiment and find the best path - saving to which format, and then using which method to translate to wiki markup.^{[expansion needed]}
The free program Acrobat Reader has an "export as text" function, but only plain text. Copying and pasting also only gives plain text.

We could also test other readers to see if any allow copying with formatting. The following only do plain text: Evince Document Viewer 2.24 for Linux.

"Freeware"

Options that are free (as in free beer) but not open source:

Sorax PDF SDK DLL Edition 1.1 - "export PDF files to... XML." (image or text?)
Okay, so I (CurtB) have poked around with the Sorax DLL, and it looks interesting. I figured this commentary makes more sense on the article page than the talk page, and yet this is almost a discussion at this point, hence my chatty tone. It turns out that the DLL is indeed free, and the license for usage is generous, essentially do what you want with it as long as you don't reverse engineer it or hack it into something else. It also comes with a "demo" application, which could be very useful. Using this demo program, one can open a PDF document and export it into XML. It exports all the text (no images, sorry) into an xml file with useful formatting information, including font name, font size, italic or bold (true or false for those last two). There is a fair bit of other info that might not be interesting (paging, for example) and would need to be stripped off. Nevertheless, it is quite conceivable that a PERL or Python-based tool could quickly be written to strip the undesired stuff away, and convert the remainder to wiki form. Maybe even some clever SED scripts could do the bulk of the work! Yay!

It should be noted that the DLL is really intended for use by developers, and most particularly for Visual C developers, since the tool includes a "vcproj" file, which is a Visual C project file. Python developers also may be able to make use of the tool, based on the information in the included (PDF, of course) document, and with some help from this page I found. Writing an actual application that could use the DLL would be ideal, since it would allow bulk translation of PDFs, instead of the one-by-one conversion process that would be offered by the Demo application plus xml-to-wiki tool. Hmm. New thought. How well does OpenOffice convert XML to Wiki? Be right back! Nope, not much help. Okay, done for now. CurtB 00:29, 5 February 2008 (PST)

Does Sorax do the formatting for image placement? (As wikEd does when converting HTML.)

Free PDF to Word Doc Converter - reviews and comments[1][2] suggest that this is "nagware" (i.e. freeware hassles you, adds extra steps) and that Zamzar (online service, below) gives better results.

Free online services

Check these (and do a search to make sure you've got the latest version):

docq - upload the file and it will convert online. DocQ provides online PDF editing, highlighting, and e-signing. Free account trials available.
Zamzar (review) - upload the file and receive an email with a link to the output file. Works well, some hassle and hiccups. Formatting may need extra work, e.g. double line-breaks need replacing with single line-breaks for best results. This is the only free solution known to work so far.
Adobe's online conversion service - appears broken. After a long period (e.g. 75 min) it still displays "In progress".
Form Swift

Commercial programs (apart from Adobe Acrobat)

Question: are there free trial versions that do what we need? Help by trying them out. (These programs are not guaranteed - do some Googling to make sure they're safe, and make sure you've got good anti-spyware and anti-virus.)

These are not ideal, as:

we can't invite everybody to help out without paying lots of money or stretching/breaking the licensing agreements,
they usually take an extra step, via Word, and
They're only for Windows.

But for reference (in case of desperation):

ABC Amber PDF Convertor - $13, or $40 for multi-user (company) license. Best price of the commercial programs.
http://www.coolutils.com/product.php?product=TotalPDFConverter - $40, trial download, saves images
http://www.adorepdf.com/ - $20
http://www.pdfgrabber.com/ - EUR 32,77, + trial version.
http://www.quickpdftoword.com/ $30, + free trial
http://www.topshareware.com/PDF-Export-Kit-download-40390.htm - $49, + free trial version
http://www.convert-in.com/pdfekit.htm $29 just for PDF-to-HTML. Demo available (functional?)
http://www.docudesk.com/deskUNPDF-PRO-PDF-Converter.shtml PDF-to-Word. Also supports PDF to XLS and optional OCR.
Quick-PDF PDF to Word - $29 + 10-day free trial version

OCR

When a PDF file (or other format) is image based rather than text-based, this may be helpful. See User talk:LeissKG for a discussion of this technique.

OCR should probably be limited to those cases when text is only available as an image, as it will inevitably introduce some errors. It seems likely to be more difficult as well.^{[verification needed]} Nevertheless, if proven out, this could be a useful tool for creating wiki versions of out-of-print articles or texts. Care must be taken, however, that copyright permissions are handled appropriately!

Here are some resources for OCR:

GOCR is an OCR (Optical Character Recognition) program
Article on Tesseract: an Open-Source Optical Character Recognition Engine and the software is here
List of free OCR programshere

2. Convert from formatted text to MediaWiki

There are several options, notably using wikEd, or OpenOffice (version 2.3 or higher). See Appropedia:Porting formatted content to MediaWiki for full details.

Manual formatting - old method

This is not recommended, but if you have problems with the other methods and need to try it, see Help:Porting PDF files to MediaWiki (old method, manual formatting).

Images

Images must be saved and uploaded.

Until now, this has been done as described at Help:Porting PDF files to MediaWiki (old method, manual formatting) #Transfer the images. There may be easier ways now, but there are still useful info and tips there, e.g. don't try too hard to match the layout of the original... PDF's are fixed size, while the layout of the wiki article will flex based on several variables. So invest some energy in layout, but don't overdo it.
In PDF-to-HTML conversion the images will be output in the same folder. (However, with Zamzar, each page's images are turned into a single image taking up the whole page - the text fits around it.)
In PDF-to-Word conversion the images will be integrated in the document.
Acrobat: Images are apparently saved automatically during file export:
- Exporting to HTML/XML with Acrobat, at least.
- Exporting PDFs as (formated) text with Acrobat - help.adobe.com. Note: "Images in the PDF are saved by default in JPEG format." Is their location saved? Is there a way of smoothing the process of putting the image in right place in the wiki page?

Question: Which of the formats include tags to indicate image location?

@@ Line 1: / Line 1: @@
-'''This page is for exploring how to speed this process of porting PDF documents. The main help page (which needs updating) is at: [[Help:Porting content from PDF format]].'''
+'''This is still a work in process - you can help by trying these methods and adding any information about what works. Contact [[User:Chriswaterguy|Chriswaterguy]] or [[User:Curtbeckmann|Curt]] if you have questions.'''
-==Possible methods==
+==Methods==
+At present, we do not have a simple one-step PDF-to-MediaWiki translation process which retains the desired text formatting.  We have many multi-step approaches which retain text formatting, all of which can be broken down into two main steps:
+#Convert (save) the PDF to a more workable intermediate format that supports interesting formatting, and
+#Convert the resulting file into MediaWiki format.
+The various options for each of these main steps are described below.
-* Acrobat Reader has an "export as text" function, but only plain text. Copying and pasting also only gives plain text
+The OpenOffice 3 approach may come closest to a straightforward solution, if we can get it to work.
-** Test other readers to see if any do formatting. The following only do plain text: Evince Document Viewer 2.00 for Linux...)
-* In the non-free Adobe Acrobat there is an option to save to Word format[http://www.adobe.com/products/acrobatpro/acrobatstd.html] - but apparently not in the free Adobe Reader. (If you have access to the larger non-free program Adobe Acrobat - v 5.0 or higher should work[http://www.library.mcgill.ca/edrs/services/publications/howto/PDFtoXLS/PDFtoExcel.html] - please try this and let us know here if it works. Your workplace, university or school may be able to give you access to it. In the free program, (in Linux and Windows version) it seems to only offer plain text. <!--If you open this in OpenOffice v 2.3 or greater, you should be able to export it as MediaWiki format. (Does this work smoothly?)-->
-* Use [[wikEd]] - this doesn't work yet, as the formatting is not saved when pasting into the edit box. Are there PDF readers or editors (or any other program which can open these files) which allow the formatting to be copied and pasted?
-* [[User talk:LeissKG]] - discussion on alternative techniques and issues in porting PDFs - OCR, text export.
-===Freeware & free online services===
+Alas, we do not yet have an automated way to transfer the images (though [[mw:Extension:MultiUpload]] may make it easier).  Help is welcome!
+===1. Save as formatted text===
+Convert to a Word, RTF or HTML file.
+It is preferable to use:
+# An open source solution, which can be used by anyone and can be improved if needed.
+# If this doesn't work, Acrobat Professional - some academics, students and business people will have access to it, and it is likely to work better than freeware or web services.
+(When searching for solutions, note that word combinations like ''PDF export'' gets a lot of false hits - mainly exporting ''to'' PDF, and also very many commercial programs. So, try this search:
+:''export OR convert pdf-to freeware -demo -free-trial images formatted OR layout'')
+==== Open source options ====
+There is an extension for OpenOffice 3 Beta (presumably works with OO3) that facilitates import of pdf documents [http://extensions.services.openoffice.org/project/pdfimport Sun PDF Import Extension (Beta)]. If this works well, combined with OpenOffice's existing MediaWiki export functionality, '''this may be a one stop tool for PDF to MediaWiki conversion'''. ''[[User:Chriswaterguy|Chriswaterguy]] is trying this out now, but having trouble with bugs preventing acceptance of the EULA. 21:24, 14 December 2008 (UTC)''
+Alternatively:
+# use http://pdftohtml.sourceforge.net/ or xpdf to convert to html
+# clean up with htmltidy. It should now be ready to [[Appropedia:Porting formatted content to MediaWiki|convert to MediaWiki]].
+Scanned documents:
+* Sometimes scanned documents have the actual text embedded in the document.
+** The pdftotext command extracts raw text: "pdftotext file.pdf" without the quotes.
+** Evince PDF viewer (and possibly others) allow you to select and copy.
+==== Acrobat Professional (i.e. paid version) ====
+This is the best option so far. Check if your school/college/company has this program (you might have to ask for access).
+* In the non-free Adobe Acrobat there is an option to save to rich text formats - see [http://www.adobe.com/designcenter/tutorials/acr7at_savepdfas/ Save a PDF file as a Word document, HTML file, or image]. [[User:Fatima|Fatima]] has used this and found it helpful. We should experiment and find the best path - saving to which format, and then using which method to translate to wiki markup.{{sp}}
+* The free program Acrobat Reader has an "export as text" function, but only plain text. Copying and pasting also only gives plain text.
+We could also test other readers to see if any allow copying with formatting. The following only do plain text: Evince Document Viewer 2.24 for Linux.
+==== "Freeware" ====
+Options that are free (as in free beer) but not open source:
+* [http://www.download.com/Sorax-PDF-SDK-DLL-Edition/3000-2070_4-10596381.html?tag=lst-6  Sorax PDF SDK DLL Edition 1.1] - "export PDF files to... XML." (image or text?)
+*:Okay, so I ([[User:Curtbeckmann|CurtB]]) have poked around with the Sorax DLL, and it looks interesting.  I figured this commentary makes more sense on the article page than the talk page, and yet this is almost a discussion at this point, hence my chatty tone.  It turns out that the DLL is indeed free, and the license for usage is generous, essentially do what you want with it as long as you don't reverse engineer it or hack it into something else.  It also comes with a "demo" application, which could be very useful.  Using this demo program, one can open a PDF document and export it into XML.  It exports all the text (no images, sorry) into an xml file with useful formatting information, including font name, font size, italic or bold (true or false for those last two).  There is a fair bit of other info that might not be interesting (paging, for example) and would need to be stripped off.  Nevertheless, it is quite conceivable that a PERL or Python-based tool could quickly be written to strip the undesired stuff away, and convert the remainder to wiki form.  Maybe even some clever SED scripts could do the bulk of the work!  Yay!
+*:It should be noted that the DLL is really intended for use by developers, and most particularly for Visual C developers, since the tool includes a "vcproj" file, which is a Visual C project file.  Python developers also may be able to make use of the tool, based on the information in the included (PDF, of course) document, and with some help from [http://www.thescripts.com/forum/thread23518.html this page I found].  Writing an actual application that could use the DLL would be ideal, since it would allow bulk translation of PDFs, instead of the one-by-one conversion process that would be offered by the Demo application plus xml-to-wiki tool.  Hmm.  New thought.  How well does OpenOffice convert XML to Wiki?  Be right back!  Nope, not much help.  Okay, done for now. [[User:Curtbeckmann|CurtB]] 00:29, 5 February 2008 (PST)
+::Does Sorax do the formatting for image placement? (As wikEd does when converting HTML.)
+* [http://hellopdf.com/tutorial.php Free PDF to Word Doc Converter] - reviews and comments[http://lifehacker.com/342694/free-pdf-to-word-doc-converter-is-exactly-that][http://www.freewaregenius.com/2008/01/09/convert-pdf-to-word-with-free-pdf-to-word-doc-converter/] suggest that this is "nagware" (i.e. freeware hassles you, adds extra steps) and that Zamzar (online service, below) gives better results.
+==== Free online services====
 Check these (and do a search to make sure you've got the latest version):
-* [http://zamzar.com/ Zamzar] - upload the file and receive an email with a link to the output file. Works well, some hassle and hiccups. Formatting may need extra work, e.g. double line-breaks need replacing with single line-breaks for best results. '''This is the only solution known to work so far.'''
+* [http://docq.com/ docq] - upload the file and it will convert online.  DocQ provides online PDF editing, highlighting, and e-signing.  Free account trials available.
-* [http://www.download.com/Sorax-PDF-SDK-DLL-Edition/3000-2070_4-10596381.html?tag=lst-6  Sorax PDF SDK DLL Edition 1.1] - "export PDF files to... XML." (image or text?)
+* [http://zamzar.com/ Zamzar] ([http://www.freewaregenius.com/2007/02/01/zamzar/ review]) - upload the file and receive an email with a link to the output file. Works well, some hassle and hiccups. Formatting may need extra work, e.g. double line-breaks need replacing with single line-breaks for best results. '''This is the only free solution known to work so far.'''
-* [http://www.adobe.com/products/acrobat/access_onlinetools.html Adobe's online conversion service] - tends to be slow - if it works at all.
+* [http://www.adobe.com/products/acrobat/access_onlinetools.html Adobe's online conversion service] - appears broken. After a long period (e.g. 75 min) it still displays "In progress".
-* [http://hellopdf.com/tutorial.php Free PDF to Word Doc Converter] - reviews and comments[http://lifehacker.com/342694/free-pdf-to-word-doc-converter-is-exactly-that] suggest that this is "nagware" (i.e. freeware hassles you, adds extra steps) and that Zamzar (above) gives better results.
+* [http://formswift.com/convert-pdf-to-word Form Swift]
-===Commercial programs (apart from Adobe Acrobat)===
+====Commercial programs (apart from Adobe Acrobat)====
-Question: are there free trial versions that do what we need? Help by trying them out. (These programs are not guaranteed - do some Googling to make sure they're safe, and/or make sure you've got good anti-spyware and anti-virus.)
+Question: are there free trial versions that do what we need? Help by trying them out. (These programs are not guaranteed - do some Googling to make sure they're safe, and make sure you've got good anti-spyware and anti-virus.)
-These are not ideal, as 1. we can't invite everybody to help out without paying lots of money or stretching/breaking the licensing agreements, 2. they usually take an extra step, via Word, and 3. They're only for Windows.
+These are not ideal, as:
+#we can't invite everybody to help out without paying lots of money or stretching/breaking the licensing agreements,
+#they usually take an extra step, via Word, and
+#They're only for Windows.
 But for reference (in case of desperation):
+* [http://www.processtext.com/abcpdf.html ABC Amber PDF Convertor] - $13, or $40 for multi-user (company) license. Best price of the commercial programs.
 * http://www.coolutils.com/product.php?product=TotalPDFConverter - $40, trial download, saves images
 * http://www.adorepdf.com/ - $20
@@ Line 30: / Line 78: @@
 * http://www.topshareware.com/PDF-Export-Kit-download-40390.htm - $49, + free trial version
 * http://www.convert-in.com/pdfekit.htm $29 just for [http://www.convert-in.com/pdf2word.htm PDF-to-HTML]. Demo available (functional?)
+* http://www.docudesk.com/deskUNPDF-PRO-PDF-Converter.shtml [http://www.docudesk.com/deskUNPDF-PRO-PDF-Converter.shtml PDF-to-Word]. Also supports PDF to XLS and optional OCR.
+* [http://www.quick-pdf.com/pdf-to-word.htm Quick-PDF PDF to Word] - $29 + 10-day free trial version
+====OCR====
+When a PDF file (or other format) is image based rather than text-based, this may be helpful. See [[User talk:LeissKG]] for a  discussion of this technique.
+OCR should probably be limited to those cases when text is only available as an image, as it will inevitably introduce some errors. It seems likely to be more difficult as well.{{fact}}  Nevertheless, if proven out, this could be a useful tool for creating wiki versions of out-of-print articles or texts.  Care must be taken, however, that copyright permissions are handled appropriately!
+Here are some resources for OCR:
+* [http://jocr.sourceforge.net/ GOCR is an OCR (Optical Character Recognition) program]
+* [http://www.linuxjournal.com/article/9676 Article on Tesseract: an Open-Source Optical Character Recognition Engine] and the software is [http://code.google.com/p/tesseract-ocr/ here]
+* List of free OCR programs[http://www.thefreecountry.com/utilities/ocr.shtml here]
+=== 2. Convert from formatted text to MediaWiki ===
+{{main|Appropedia:Porting formatted content to MediaWiki}}
+There are several options, notably using [[wikEd]], or OpenOffice (version 2.3 or higher). See [[Appropedia:Porting formatted content to MediaWiki]] for full details.
+=== Manual formatting - old method ===
+This is not recommended, but if you have problems with the other methods and need to try it, see [[Help:Porting PDF files to MediaWiki (old method, manual formatting)]].
 ==Images==
-*Are images saved automatically during file export? It appears so for [http://help.adobe.com/en_US/Acrobat/8.0/Standard/help.html?content=WS58a04a822e3e50102bd615109794195ff-7ef2.html exporting to HTML/XML], at least. Do any of the formats include tags to indicate image location?
+Images must be saved and uploaded.
-*[http://help.adobe.com/en_US/Acrobat/8.0/Professional/help.html?content=WS58a04a822e3e50102bd615109794195ff-7eeb.html Export PDFs as (formated) text] - help.adobe.com. Note: "Images in the PDF are saved by default in JPEG format." Is their location saved? Is there a way of smoothing the process of putting the image in right place in the wiki page?
+* Until now, this has been done as described at [[Help:Porting PDF files to MediaWiki (old method, manual formatting) #Transfer the images]]. There may be easier ways now, but there are still useful info and tips there, e.g. ''don't try ''too'' hard to match the layout of the original... PDF's are fixed size, while the layout of the wiki article will flex based on several variables. So invest some energy in layout, but don't overdo it''.
+* In PDF-to-HTML conversion the images will be output in the same folder. (However, with Zamzar, each page's images are turned into a single image taking up the whole page - the text fits around it.)
+* In PDF-to-Word conversion the images will be integrated in the document.
+* Acrobat: Images are apparently saved automatically during file export:
+**  [http://help.adobe.com/en_US/Acrobat/8.0/Standard/help.html?content=WS58a04a822e3e50102bd615109794195ff-7ef2.html Exporting to HTML/XML with Acrobat], at least.
+** [http://help.adobe.com/en_US/Acrobat/8.0/Professional/help.html?content=WS58a04a822e3e50102bd615109794195ff-7eeb.html Exporting PDFs as (formated) text with Acrobat] - help.adobe.com. Note: "Images in the PDF are saved by default in JPEG format." Is their location saved? Is there a way of smoothing the process of putting the image in right place in the wiki page?
+'''Question''': Which of the formats include tags to indicate image location?
 [[Category:Porting]]