We continue to develop resources related to the COVID-19 pandemic. See COVID-19 initiatives on Appropedia for more information.
Appropedia:Porting PDF files to MediaWiki (old method, manual formatting)
Note: This page describes an old method. In most cases, the new methods will be much faster and easier - see Help:Porting PDF files to MediaWiki.
Adobe's PDF (Portable Document Format) was designed to provide an application and OS independent format that could be viewed freely (in both the "no cost" and "no constraint" senses) by anyone. The encoding of PDF documents does not always lend itself to easy migration to the "wiki text" format (or even to other formats). The steps below are offered as a relatively straightforward, low stress process to approach the porting process. Don't worry that this page is a tad long... It's not as challenging as the length might suggest. Most of the ideas here are quite simple, and the stylistic approach was to "overcommunicate" rather than to be succinct.
The steps below assume that you have downloaded a recent version of Adobe Acrobat, which includes some features that the earlier versions did not include.
Please also review Help:How to port pages to be sure that you handle copyright permissions and Appropedia conventions correctly.
- 1 Steps in porting PDF to Wikitext
- 1.1 Choosing a reasonable source document
- 1.2 Choose a name and create your article
- 1.3 Prepare for migration
- 1.4 Transfer the text first
- 1.5 Format the text
- 1.6 Transfer the images
- 2 See also
Steps in porting PDF to Wikitext
- Choosing a reasonable source document
- Choose a name and create your article
- Prepare for migration
- Transfer the text first
- Format the text
- Upload and embed the images
- Do a final compare between the article and the original content
- Update the templates and categorization to show completion
Choosing a reasonable source document
Before you begin the porting process, be sure that your wikitext skills are compatible with the document content that you wish to port. If you have a collection of documents that are available for porting (as often happens), begin by porting some simpler documents. Content features that require a bit of experience are listed below, along with a link to the WikiMedia help for the content:
- Tables. See http://meta.wikimedia.org/wiki/Help:Table
- Images. See Help:Images and http://en.wikipedia.org/wiki/Wikipedia:Extended_image_syntax
- Web links. See http://meta.wikimedia.org/wiki/Help:Link
- Footnotes. See http://meta.wikimedia.org/wiki/Help:Footnotes
Once you're comfortable with the wiki formatting side of things, the tricky bit is often in extracting the content from the PDF document. The best approach is to be patient, break it down in to phases, and move through it step by step. You can port a fairly simple 5 page PDF document in 15 minutes, but complexity will add time; 5 pages could be an hour or more if there are lots of tables, footnotes, images and links.
Choose a name and create your article
Choosing a name is usually just a matter of taking the title of the original work, then making sure it's unique at Appropedia. Enter the name in the search box at left and click "Go". If the title is already used, or if for some reason it is not a clear reference to the subject, you'll need to modify the name. Or it may be that all articles ported from a particular source will get a certain tag in their names. In any case, choose a reasonable name. There is no perfect name, and the article can be moved to a different name later if necessary.
When you have entered a unique name and clicked "Go", the screen will say There is no page titled "XYZ". You can create this page. Click on the "create this page" to begin editing the article.
Prepare for migration
This can be a simple matter of opening the source document at the same time as keeping the destination window available. That may mean using two browser windows, or you may prefer to download the PDF document to your personal machine and open the doc without the browser. Another alternative may be to work offline by transfering the PDF textual content to a text file or other word processing format. (This can be a good approach if you have a slow or somehow flaky link. There's less chance of losing a lot of work!) In that case, you wouldn't have any browsers open.
Transfer the text first
Transfering the text is pretty straightforward, but there are pitfalls. The basic approach is to copy, paste. Should be obvious, but there can be surprises. You can copy/paste directly into an Appropedia edit window, or into an intermediate document (such as Word) if you prefer.
UPDATE: Acrobat has an "export as text" function which makes this easier - save to RTF or Word format. If you open this in OpenOffice v 2.3 or greater, you should be able to export as MediaWiki format.
In Acrobat, to mark a section of text you must first "pick" (click on) the "select text" tool (this step is not always required in other applications). With the text-select tool chosen, select the text by clicking the mouse at the beginning of a text block (say, a paragraph) and dragging your mouse to the end of the text block. This should be obvious, but frequently PDF documents are produced with complex layouts (multiple columns, embedded text boxes and images, etc) that can result in strange text sequences, so it's best to keep alert for strange selection behavior, since you may not get what you expect.
Copy and paste
Once the text is selected, use "copy" function (there are many ways to invoke this function) to put the selected text on a "clipboard", then use the "paste" function to put the text into your destination. As soon as you have pasted, it's a good practice to check a few places in the destination to see that you actually transferred what you expected to. If not, you can usually use the "undo" feature to back up and try again.
If you're copy/pasting table contents, or highlighted sections, or if there are images embedded in the document, it's often helpful to insert some reminder text, like " (highlight starts here)" or " (table text ends here)" or " Insert Figure 1 here". Put this kind of reminder on a line by itself, and start with a space to cause special formatting to make it easy to see when you view the page later.
Format the text
Formatting the text is pretty simple, except for tables and footnotes. Mostly it consists of identifying headings and formatting them appropriately. Sometimes that takes a little bit of thought if the heading style of the source is not clear. But give it a try and it can easily be tweaked later if it needs to be. It's good to make these heading changes first since it helps when viewing and navigating the page for subsequent edits. You can, for example, edit subsections rather than the entire page.
Anything bold or italic in the original will be lost, and will therefore need some extra attention.
Bulleted and numbered lists are another level of edit, pretty straightforward.
All the items in this section are easily learned. If you're not already familiar with them, take a quick look at http://meta.wikimedia.org/wiki/Help:Contents. This page (and the related pages) will be a close friend. It's really well written, with great examples. Often it's handy to keep two browser windows open, one that you're using to edit, and one that's looking at the help page. You can do some quick copy/paste from the help examples into the article you're editing, then use the "Show Preview" below to see how it comes out.
Working with "line breaks"
When copy/pasting from PDF docs, you get a "free line break" with every line. The wiki formatting tool will often ignore line breaks, so it's not essential that the "extra line breaks" get removed. On the other hand, there are times where the line breaks influence formatting (such as within a URL, see below), and it can also be annoying to edit pages which have numerous short lines, which can happen when porting from a PDF document that has narrow columns. Use your judgment about removing line breaks. Look at the resulting document and you may see some cases where line breaks have affected formatting in ways you don't want.
You may also find cases where you want an explicit line break. Usually two line breaks in the wiki text will have the desired effect, but you can also insert <br> to force a line break.
URLs (Universal Resource Locaters) are basically web addresses. The wiki formatter recognizes pretty much anything that starts with "http://" or "https://" (up to a space) as being a URL, and will display it as a link, meaning that if you click on it with the mouse, the browser will go to that web location. Often URLs in imported documents may not include the "http://", in which case you should add that (as well as removing embedded line breaks, if any).
Hyperlinks are text which has a hidden (normally undisplayed) URL associated with it. Probably all hyperlinks in a PDF that is being imported will be external links. Internal links (within Appropedia) and inter-wiki links (between Appropedia and many common wikis) are formatted differently than external links, but don't worry about them here. External links are simple to format. They should like this in the edit window:
[hiddenURLtext displayedtext which can have spaces]
Extracting the "displayedtext" from the PDF doc is easy using the copy/paste techniques. Copying the "hiddenURLtext" is a little trickier. One technique is to "right click" with your mouse on the link in the PDF, which should give you a brief "context menu", including the choice of "copy link location", which will put the hiddenURLtext onto the clipboard, and you can then paste it into the Appropedia edit window. If that doesn't work for some reason, you can use the "crude but effective" method of actually clicking on the hyperlink in the document, which should open a browser at the URL location, and you can then capture the URL from the browser. (This is not always perfect, because sometimes there are auto-redirects, so that the URL that is displayed in the browser might not be the URL that was embedded in the original hyperlink.)
More info at: http://meta.wikimedia.org/wiki/Help:Link
Footnotes are not too much trouble. Typically in PDF documents, a footnote appears as a number in superscript (like this) in the body of the article, and then at the bottom of the page or the end of the article there is a corresponding reference to a book, or website, or sometimes a brief explanation. You can achieve this end result fairly easily with footnotes/references. The details are explained at http://meta.wikimedia.org/wiki/Help:Footnotes, but basically what happens is that you need to take the text (an explanation, or a references to a book or a URL), and add some decorations (like "<ref>" at the beginning and "</ref>" at the end of the embedded footnote text), and embed this decorated text at the point where the reference or footnote is being added. All this takes a bit of editing, but is otherwise quite straightforward. Seriously, though, it's all explained pretty clearly at the help link above.
If you want to be really fancy (this is certainly not required), you can look for repeated references to the same document or website, and coalesce them into footnotes that point to the same reference. This can be done by using the <ref name=XYZ> construction, also described at the link above.
Tables are...what can I say? Tables are a bit complicated. Even simple tables seem involved, and complex tables (where some "cells" span multiple columns or multiple rows, etc) are more involved. There is very good news, though; the help document gives good examples, which can be copy/pasted to create a foundation for what you want, and then you can fairly easily massage them into something that achieves the goal.
And that, basically, is the sum of wisdom for this section:
- look at the help page
- find the closest match for a table
- copy/paste the example into your document
- cut/paste the text into the table
- use "show preview" regularly to see your results
In addition to the help pages, you might want to look around at articles at Appropedia and see if any of them offer tables that look like what you want. Then simply click "edit" on the page (you can always cancel) and copy the table formatting from the source.
Last option: Punt the problem to someone else. This is a perfectly acceptable option (and indeed this option can fairly be applied to most problems at Appropedia...do what you know how to do, then let others take over). Merely leave some information on the discussion (or "Talk") page by clicking on the discussion tab at the top of the page. (If you leave a note, please "sign" it with 4 ~'s.) If you'd like to learn how someone else handled it, you can check back now and then, or you can choose to "watch" the page, which will send you an email whenever the page is modified. See: http://meta.wikimedia.org/wiki/Help:Table
Transfer the images
Extracting images from PDF documents onto the "clipboard"
There are two schemes for getting images out of PDF documents. Both involve copying the image to the "clipboard", then saving it to a file. Which method you choose depends on how the image was embedded in the PDF document.
The easier method should be tried first. Pick the "select" tool from the toolbar, then click on the image. This tool seems to behave a little inconsistently, but if you're lucky, a small graphic will appear giving you the option of copying the image to the clipboard. Simply click on the graphic to do the copy function.
If you can't convince the PDF viewer to offer you the "copy to clipboard" option, then you can use Acrobat's "snapshot" tool to select a rectangular area which will be copied to the clipboard when you release the mouse button.
A last option, which works in Windows, is to press "Alt" and "PrtSc", which will copy the entire screen to the clipboard. This is fairly clumsy, can degrade the image quality, and includes a lot of stuff you don't want. But it can be used in a pinch.
Saving images on the "clipboard" to a file, Windows version
In either case, you will need to get the image from the clipboard to a file. A technique that works on Windows computers is to paste the image into Microsoft Paint. Microsoft Paint is a simple image processing application that is included free with Windows. Look under "Programs/Accessories". This process works simplest when the pasted image fills the whole Paint image, and that can be done most easily by first setting the Paint image attributes to a small image size. To do this, use the Image menu and select the "Attributes" menu item. Set the dimensions to something small, like 16 x 16. All images that you paste will likely be larger than 16 x 16, and so Paint will readjust the dimensions to match the pasted image.
Once you have the image in Paint, you can save it to disk in a variety of formats via the "File/Save as..." command, then select the appropriate file extension and enter the filename you want before clicking "Save".
(If you are aware of techniques that work in other Operating Systems, please improve this help page by adding those techniques. Thanks!)
Loading images into Appropedia
Loading images from your computer is very straightforward. (If they're not on your computer, you will need to copy them there.) Choose an image name that is likely to be unique. Best approach is to start the image name based on the source where the image came from. When you know what you want to call the image, click on "Upload a file" in the tool box at the left and follow the instructions.
Embedding images into the article
See Help:Images to learn how to write the source that will embed an image. Word of wisdom: in placing and sizing images in your article, don't try too hard to match the layout of the original. Spend a little time, but PDF's are fixed size, while the layout of the wiki article will flex based on several variables. So invest some energy in layout, but don't overdo it.
One ramification of this layout challenge is that it may be necessary to make minor changes to the text wherever there are references to images. For example, the text may reference "the picture at right", but you may find that you cannot guarantee that the appropriate picture appears to the right of the text. In such cases, the picture caption can be made to begin with "Picture XYZ:" or similar, and the text reference can be adjusted accordingly so that clarity is not lost.