Appropedia:Platform/Archive/Wikilink bot

Note: this page may be out of date - contact Chriswaterguy for the latest news, and working scripts and tweaks.

Background

Goal: To improve linking between articles by using an automated or semi-automated bot or program.

It is believed this will:

greatly improve usability (navigability) and stickiness of the site,
improve the usefulness of "What links here", which is useful for admins and others helping to categorize and manage the site.
make article tidying more efficient - less time spent adding wikilinks.
improve search rankings (esp for key/topic articles).

Platform:

meta:Pywikipedia - it's what we know, and it has many tools, scripts and features designed for MediaWiki.
- mailing list and project admins
- IRC chat channel: irc://freenode/pywikipediabot (can find helpful people there).

The working bot!

Docs in progress.

This is a very useful bot for maintaining Appropedia, but we could use more hands on deck. If you'd like to help with it, read on, and contact Chris with questions.

This uses Pywikipediabot.

The following code almost completely works - to search for either CCAT or Campus Center for Appropriate Technology, whichever comes first, then wikilink it, and do it only once per page:

python replace.py -regex "(?si)\b((?:CCAT|Campus Center for Appropriate Technology))\b(.*$)" "[[\\1]]\\2" -exceptinsidetag:link -exceptinsidetag:hyperlink -exceptinsidetag:header -exceptinsidetag:nowiki -exceptinsidetag:ref -excepttext:"(?si)\[\[((?:CCAT|Campus Center for Appropriate Technology)[\|\]])" -namespace:0 -namespace:102 -namespace:4 -summary:"[[Appropedia:Wikilink bot]] adding double square brackets to: CCAT|Campus Center for Appropriate Technology." -log -xml:currentdump.xml

Notes:

The exceptext string includes [\|\]] which might be confusing. That last term, [\|\]] means it can be enclosed on the right by either a ] or a | (i.e. in a piped link). (This whole excepttext argument, -excepttext:"(?si)\[\[((?:CCAT|Campus Center for Appropriate Technology)[\|\]]" is the test to make it skip pages which have already linked one of the terms.)
The exceptinsidetag arguments are used to ensure it doesn't link inside a wikilink, hyperlink or header. (E.g. for the wikilink - if you're linking [[solar]], you don't want to replace [[solar power]] with [[[[solar]] power]].)
There is still a problem with external links. It doesn't match inside the url (thanks to the "-exceptinsidetag:hyperlink" argument, but it can match inside the text part of an external link, e.g. match "solar' in [http://example.com Solar blah blah].) Or maybe there is now an exceptinsidetag option for that too? (It's been many months since I've experimented with that problem --Chriswaterguy 11:38, 16 June 2011 (PDT))
Similar to the previous point - we don't want it to link inside gallery descriptions... there seems to be a problem with wikilinks inside galleries. Need to check if this is still a problem. If so, maybe another exceptinsidetag option can be created, with a bit of python work.
\\1, \\2 etc refer to groups in the match string, in order. The groups are made by enclosing a string in brackets. Normally they would be \1, \2 etc, but as these are run in a terminal, the backslash has to be "escaped".

Regex

Understanding regex (regular expressions^W a kind of wildcards) is very useful, if you're doing anything other than simple uses of this bot with different strings.

The best place to test your regex strings is on http://myregextester.com/ - it's much quicker than running the bot and gives much better feedbatk. You just have to choose a suitable page or make up a string (with the text that you want it to replace, or to ignore), and put in the "Source" box. Remember to choose "Replace" under "OPERATION" so that you can enter a replace string. You'll have to change \\1 etc to \1 etc to make them work on this site.

Regex terms used:

(?si) - the i means case-insensitive. The s means that it matches across whitespace(?) including across linebreaks. This is important in the last term (.*$) which means matching everything till the end of the file (page), in effect ensuring that it only wikilinks the first occurrence of the term on the page.
\\1 - the first term in the search string.
\b - word break - useful for finding whole words.
Any string followed by (.$) means "to the end of the line" - or, if the (?s) term is used at the beginning, it means "to the end of the file." Using this in the search string stops more than the first occurrence of the key word/phrase being replaced. Of course it has to be replaced, e.g. by using "\\2" , when it's the second term in the search string.

Useful tools

There is an edit option in the replace.py dialog - using this depends on installing an extra component.

Questions/problems

Can we make it ignore text within a template / transclusion?

Issues

Why not on Wikipedia?

Why hasn't it been done on Wikipedia? See this discussion on a user talk page:

You aren't the first one to suggest the idea, see Repeated article-link bot and Wikilink bot. But a fully-automated bot would stir up a lot of controversy and likely never get approved. However, the proposal stand a lot more of a chance of being implemented in the AWB since all edits are "checked" by humans. —Dispenser 07:26, 19 July 2007 (UTC)

Also, there seems less need on Wikipedia - more editors relative to the size of the content. --Chriswaterguy · talk 03:47, 4 November 2007 (PST)

Alternatives (suggestions welcome)

How much of this can AutoWikiBrowser do? It mightn't have sophisticated features such as ignoring occurrences within a wikilink. We need to try it.

What works (old section to be revamped)

Chriswaterguy, with the help of many helpful people on the Pywikipedia-L list, has worked out the following:

Basic wikilinking

This code will replace the first occurrence only of a term ("sustainability" in this case) with a wikilinked term (sustainability in this case), and skip it if it's already linked or in a header:

python replace.py -regex "(?s)sustainability(.*$)" "[[sustainability]]\\1" -xml:currentdump.xml -exceptinsidetag:link -exceptinsidetag:hyperlink -exceptinsidetag:header

There are remaining issues:

It links not the first occurrence of the text, but the first occurrence of the unlinked text. What we want, is that if the term occurs but is linked, it skips that article.
Sometimes the exceptions fail and it offers to change within a header (-exceptinsidetag:header fails) or external link (-exceptinsidetag:hyperlink fails).

Vinay's thoughts

As far as your interwiki link effort, if you can make a python program that absorbs a list of keywords and the links they should map to, in some nice format like: <wikipage1> <keyword1> <keyword2> <keyword3> <wikipage2> <keyword4> <keyword5> <keyword6>

(as it absorbs this control file, it should probably check to confirm that both the wikipages and the keywords are unique... and as I think of it, the keywords should probably also support phrases. Maybe you always require quotes?)

Once the mapping file is absorbed, then make the file perform the mapping function on some basic .txt file (containing wikicode, I guess) on your machine (placing the modified result in another file to make it easy to re-run). Once you can make python do that, then having it modify one or a series of actual wiki pages is not too complex.

...If any kind of incremental approach, then you'll need to have some scheme for tracking where you left off, or which files have been checked in the last N days, or something...