Appropedia:Wikilink bot
From Appropedia
Contents |
[edit] Goal
To improve linking between articles by using an automated or semi-automated bot or program.
It is believed this will:
- greatly improve usability (navigability) and stickiness of the site,
- improve the usefulness of "What links here", which is useful for admins and others helping to categorize and manage the site.
- make article tidying more efficient - less time spent adding wikilinks.
- improve search rankings (esp for key/topic articles).
[edit] Enthusiasts
- Chriswaterguy · talk is pushing it.
- Mel and Curt know some python and are being helpful.
- Vinay has offered to help when he has time.
[edit] Proposed platform
- meta:Pywikipedia - it's what we know. (Curt said: "Note, you can use this set of bot scripts without knowing any python. I used it quite a bit before diving into the coding side of things.")
- mailing list and project admins
- IRC chat channel: irc://freenode/pywikipediabot (can find helpful people there).
Other possibilities (suggestions welcome):
- How does AutoWikiBrowser (semi-automated) work, in terms of search and replace?
[edit] Existing bots
See Re: Bot for automatically linking betweenpages: msg#00027 - it may not do everything the way we want, but it looks like a very good start. Contact Lorenzo `paulatz' Paulatto to ask for more info.
[edit] Solve disambiguation.py
This could be adapted for the wikilink task, perhaps? It would display the potential link and the surrounding words, and the user would confirm or go to the next article. It would be semi-manual, one link at a time, slower but avoids some of the potential problems.
Extract from Wikipedia:User:Commander_Keane/Disambiguation:
- The pywikipedia bot
- I use the pywikipedia bot (specifically Solve disambiguation.py) for much of the link repair that I do. All of the discussion here is about using the bot for link repair.
- How it works
- Solve disambiguation.py visits a disambigaution page that you specify, and numbers each of the wikilinks.
- It then visits pages from the "What links here" of the dab, and presents them to you one at a time
- It shows you 60 characters around the link that needs to be repaired, and lets you type in an option. You can unlink it, skip it, or type in a number corresponding to a wikilink from the dab
- The bot will find the next link to be repaired in the article, or save and move onto the next article.
[edit] Replace.py
Special conditions:
- suggested to use Regex - how does that help...?
The following are edited (cut up and in one place rearranged marginally) bits of advice and analysis from a few people:
[edit] Advice from a bot god
The following is an extract from advice obtained on irc://freenode/pywikipediabot (I'm not sure if the bot god in question wishes to be named, so I'll just use <bot_god> ):
Q: I'm looking for help on developing a wiki bot (Or finding an existing bot, python or other, that can be adapted... I want to turn keywords into wikilinks
- <bot_god> Sounds like replace.py to me
- <bot_god> http://meta.wikimedia.org/wiki/Replace.py
- <Chriswaterguy >can that ensure that e.g. [[foo]] is not replaced by [[[[foo]]]]?
- <bot_god>sure
- <bot_god>"foo " or " foo" or " foo " (I was going to add r "foo." but realized it's covered by " foo"... I think.)
- < bot_god>its only limited by your creativity. ;)
- <bot_god>Sound good?
- < Chriswaterguy> sounds very promising! and can ensure it only replaces the first occurrence in an article?
- (pause)
- <Chriswaterguy > (ideally the first occurrence of a number of related keywords)
- Chriswaterguy is reading about replace.py
- <bot_god>that is doable, yes
- < bot_god>-regex
- <Chriswaterguy>cool!
- <Chriswaterguy>I have more questions (by which I meant Curt's analysis, above,- look for text "<wikipage1> <keyword1> <keyword2> <keyword3>") but I should read up first.
- <bot_god>we'll be here
- < Chriswaterguy>wonderful, thanks!
[edit] Vinay's thoughts
(I'm presuming Vinay & curt doesn't mind being quoted here)
- With a little monkeying, I think I could get you an interface which was basically a text file of words to detect and link, and the pages they should link to.
- We should probably do some thinking about this in detail: roughly how many words are we hoping to alter automatically in this way, can we figure out how to do this safely (i.e. duplicate database,
- I was thinking that we'd trial with smaller sets of pages and smaller sets of words (perhaps just one target page, with 2 or 3 linkable words/phrases).--Chriswaterguy · talk 05:08, 7 November 2007 (PST)
- One thing is to run with manual check first - and this option would be in the interface. What with the time for thinking plus time for making the change and getting the next page, this is less than optimal for a final solution, but it's a huge improvement over what we (don't) have now. And after getting to the point of doing a hundred straight approvals of proposed changes, we'll figure it must be pretty close to ready to roll. --Chriswaterguy · talk 05:08, 7 November 2007 (PST)
- run script on copy, manually check for disasters, flip databases) etc. - Vinay (copied from email)
- That sounds like a great idea - better than manual checking, even.
Curt's response:
- Is "replace.py" smart enough not to replace within things like " xxx ", etc? But for sure, " replace.py" is the root of the solution, and where I was planning to start.
- Well, it's a regex tool, not a formal parser of Wikitext, *as far as I can see.* --Curt (copied from email)
- So it's going to make mistakes. A parser would understand the page the way that mediawiki does, but it would also encapsulate a lot of the complexity *of* media wiki. Where as replace.py is a bit more manageable. --Vinay (copied from email)
[edit] Wiki parser?
Following from Vinay's mention of a wiki parser - could a parser be used as part of a bot, to better recognize what should and shouldn't be replaced?
[edit] Curt's analysis
(Email conversation, response to Chriswaterguy · talk)
Chris asked: Have you found an auto wikilinking script? I.e. to take a list of keywords, and wikilink them the first time they appear in any article.
Curt replied: I have not seen that yet. I think when we talked about it before, we may have said that the wiki-linker would offer suggestions...can't recall for sure. I don't know that a knowledge of python would be "required"...unless of course you want to code it in python. As I think about it, this bot seems fairly straightforward to actually implement. Previously, it seemed straightforward in the abstract :-)
As to details, I'm thinking we would want a page (probably in the MediaWiki namespace, where admins live and work)
- Maybe userspace - most tools that people develop on Wikipedia seem to be there... it makes it a bit less "offical" and intimidating for people who want to try something different, too. --Chriswaterguy · talk
that provides the words and phrases associated with a page.
- Adobe "adobe construction" adobe
Something like that.
Question: if there is already a link to "Adobe" using one phrase, do we continue searching for occurrences of other matching phrases? That is, is it the first occurrence of each phrase? Or does the presence of a link, potentially as part of another phrase, satisfy?
- Chris: I'd say:
- any link would satisfy it. Or alternatively...
- any link which uses any of the matching phrases - this would ensure that there's a clear link to the page, even if there's already a link using a phrase that might not be clear. Or...
- If there's already a link, the page (with the term used in the link, and surrounding words) is added to a list, to be checked manually. If an extra link is thought to be needed, that can be done (manually at first, but if the process is doing what we want, we might want to semi-automate this step as well, so each change can be approved with a click.) --Chriswaterguy · talk
We can work on it... the more I think of it, the more I think suggestions would be best. The easy approach would not be very clever about comments, section headings, captions, whatever. -- (end of Curt's comments)
(Another email conversation, response to Chriswaterguy · talk)
I did some thinking about that, and decided it was fairly complicated, though definitely do-able. I had some scheme worked out in my mind, but drowned in other stuff. I think you're very ambitious to attempt it in a week, but potentially if you're a fast learner (just making python do what you want) and have a solid amount of time, you can get there.
- Hmm. Change of goal perhaps: I want to have a clear strategy and hopefully know basically how we're going to do this, in about a week. Was a bit too hopeful, perhaps. I'm starting a web search now to see if someone else has done it; otherwise will look for forums, and start a page on Meta. I'm hoping I can do it the way I've done VB programing - with just a bit of hacking and guesswork and help from people who know more than me.
- Your specific suggestions are great - well thought through. Even if someone else has got something, these are good things to check. --Chriswaterguy · talk
As for batch files... no, don't use them, unless you mean a python file is a batch file. I haven't used pywikipedia in a long time, but I don't really recall using the sysop param. Maybe I just specified my name? Whatever. Was able to login as me for some activity when I needed sysop privileges. Once logged in, I stayed logged in, so it wasn't repetitive within a single session.
As far as your interwiki link effort, if you can make a python program that absorbs a list of keywords and the links they should map to, in some nice format like: <wikipage1> <keyword1> <keyword2> <keyword3> <wikipage2> <keyword4> <keyword5> <keyword6>
(as it absorbs this control file, it should probably check to confirm that both the wikipages and the keywords are unique... and as I think of it, the keywords should probably also support phrases. Maybe you always require quotes?)
Once the mapping file is absorbed, then make the file perform the mapping function on some basic .txt file (containing wikicode, I guess) on your machine (placing the modified result in another file to make it easy to re-run). Once you can make python do that, then having it modify one or a series of actual wiki pages is not too complex.
But I guess you should have a strategy for the actual modification as well. Will there be no human intervention? Will humans need to approve changes before they take effect? Or go ahead and make the changes, but then prompt a human to check the changes after the fact, and do fixes? Should the bot try to tackle the whole wiki, or do 100 pages per night? (Just to ease the burden on human checking/reversion?) If any kind of incremental approach, then you'll need to have some scheme for tracking where you left off, or which files have been checked in the last N days, or something. Lots of options there, I'm sure.
Gotchas:
- Don't change words within URLs (depending on how you search for keywords, this may not be an issue).
- Don't change within comment fields
- Don't change within existing links (though potentially you could check and see if existing links happen to conflict with the link map)
- Don't change within template / transclusions
- Seems like there are a bunch of other gotchas
[edit] Other types of searches
At Wikipedia it has been proposed to:
- Add wikilinks to all mainspace articles in the same catagory as the article on the first time the article's name was mentioned. It would only do this with existing text.[1]
This might not work as well here, as most topic pages are actually category pages; most mainspace pages are not actually topics, and unlikely to receive many references that aren't consciously put in as links. (i.e we won't search for unlinked occurrences of AEF greywater - they just aren't at all likely to occur).
The other issue, pointed out in the above Wikipedia discussion, is that this is "perhaps computationally intense. (some categories have) thousands of entries." Not yet a big problem for Appropedia.
Might need a rethink when Semantic MediaWiki makes regular categories obsolete. --Chriswaterguy · talk 03:47, 4 November 2007 (PST)
[edit] Why not on Wikipedia?
Why hasn't it been done on Wikipedia? See this discussion on a user talk page:
- You aren't the first one to suggest the idea, see Repeated article-link bot and Wikilink bot. But a fully-automated bot would stir up a lot of controversy and likely never get approved. However, the proposal stand a lot more of a chance of being implemented in the AWB since all edits are "checked" by humans. —Dispenser 07:26, 19 July 2007 (UTC)
Also, there seems less need on Wikipedia - more editors relative to the size of the content. --Chriswaterguy · talk 03:47, 4 November 2007 (PST)
