References in MOOC articles in Wikipedia

At the moment, I’m doing a lot of writing, catching up on writing up papers from my PhD and other loose ends, and also drafting potential funding bids. It’s going well, but I find that I can’t sit and write non-stop all day. I need little side projects to do a bit of work on for a bit for a break.

Something which I’ve had in mind as an interesting side project for a while now is looking at the similarities and differences in pages about MOOCs in different language versions of Wikipedia. Initially, I was going to scrape the references from a sample of pages, and look to see which articles or resources are cited in different versions of Wikipedia. However, I quickly discovered that there isn’t much consistency in referencing – the same article could have a full citation in one version, but just a hyperlink in another, for example. I would also be relying on auto-translation, which adds another bit of uncertainty.

But, then I decided it would be better to look at the URLs of references, which would solve both issues. Happily, there is a way to retrieve all of the external links from a Wikipedia page automatically, using the Wikimedia API. For example (substitute whichever page name you’re interested in in place of ‘Coffee’):

With this, it was then simply a case of copying and pasting into a spreadsheet, and importing into Gephi to visualise the network of co-cited links. I included the Dutch, English, French, German, and Swedish Wikipedia MOOC pages (based on being the five largest editions in terms of number of articles – note that the Cebuano edition is second largest, but doesn’t have an entry for MOOCs):


I haven’t included labels in the static picture above as there were too many and too long, but the network can be explored in more detail in the interactive version here (opens in a new tab).

I also edited the URLs to the domain level, which reveals a greater level of overlap:


Again, labels aren’t included here as I haven’t managed to get a good layout with readable labels yet. But the nodes here are scaled according to out-degree, so the domains which are cited more highly are larger. The cluster in the middle represents the most frequently used sites (check out the interactive version for more detail and labels).

My initial thoughts on this are that I was surprised that there isn’t a greater amount of overlap; that the online Higher Education media is an important source of information (e.g. The Chronicle has a higher degree than any of the MOOC platforms); and cMOOC thinking also comes through strongly. I’d be very interested to hear your thoughts!


