Description
In this project, you will collect information about articles belonging to
different categories (such as airports, artists, politicians, sportspeople,
etc.). Based on this information, you will then try to automatically
cluster and classify these articles into the correct categories.
We will use two sources of information, namely:
- the Wikipedia online encyclopedia1 ;
1 See https://www.wikipedia.org
- the Wikidata knowledge base2.
2 See https://www.wikidata.org/wiki/
Wikidata:Main_Page
Deadline
The deadline for submission is May 15th, 2022. This is a strict dead
line. Late submissions will be penalised (-0.20 points per day past the
deadline).
Defense
You will defend your project on May 23rd, 2022. The defense is com
posed of a 10-minute presentation (using slides) followed by a 10-min
discussion with the jury.
Exercise 1 – Corpus extraction (10 points)
The goal of this exercise is to compile a parallel corpus from the
Wikipedia online encyclopedia. This corpus will be made of :
(i) plain text sentences3,
3 selected so as to have a roughly bal
anced corpus in terms of training
data (each target category should be
associated with the same number of
sentences)
(ii) key-value pairs corresponding to articles’ infobox (if any),ue 803: data science 2
(iii) triples coming from articles’ corresponding wikidata page.
We will focus on the following categories:
- Airports
- Foods
- Artists
- Transport
- Astronauts
- Monuments_and_memorials
- Building
- Politicians
- Astronomical_objects
- Sports_teams
- City
- Sportspeople
- Comics_characters
- Universities_and_colleges
- Companies
- Written_communication
Step 1 is related to lecture 6 of the
course.
The following clause for instance
retrieves 100 articles which belong to
the Comedy_fifilms category:
PREFIX dcterms:
<http://purl.org/dc/terms/>
PREFIX dbc:
<http://dbpedia.org/resource/Category:>
SELECT ?film WHERE {
?film
dcterms:subject/skos:broader*
dbc:Comedy_films .
}
LIMIT 100
Concretely, you are required to implement a Python program
which takes as a an input:
- a number k of articles per category,
- a number n of sentences per article (articles whose content is too
short should be ignored).
Step 2(a) is related to lecture 3 of the
course.
The following code snippet for in
stance retrieves the wikipedia page
content of a given article:
>>> import wptools
>>> page=wptools.page(‘Stephen Fry’,silent=True)
>>> page.get_query()
>>> page.data[‘extract’]
‘<p><b>Stephen Fry</b>…’
This extraction task can be realised through the following steps:
- 1. For each category c from the list above, retrieve k wikipedia arti
cles belonging to c.
Note that you can use DBpedia to fifind Wikipedia articles belong
ing to a given category.
Step 2(b) is related to lecture 3 of the
course.
The following code snippet for in
stance retrieves the wikipedia infobox
of a given article:
>>> import wptools
>>> page=wptools.page(‘Stephen Fry’,silent=True)
>>> page.get_parse()
>>> page.data[‘infobox’]
{‘name’: ‘Stephen Fry’, ‘image’: ‘Stephen …’}
- 2. For each selected Wikipedia article, retrieve:
(a) the corresponding Wikipedia page content,
(b) the corresponding infobox,
(c) the corresponding wikidata statements (triples).
Step 2(c) is related to lecture 3 of the
course.
The following code snippet for in
stance retrieves the wikidata statements
of a given article:
>>> import wptools
>>> page=wptools.page(‘Stephen Fry’,silent=True)
>>> page.get_wikidata()
>>> page.data[‘wikidata’]
{…
u’place of birth (P19)’: u’Hampstead (Q25610)’,
u’religion (P140)’: u’atheism (Q7066)’,
u’sex or gender (P21)’: u’male (Q6581097)’,
…}
At the end of this process, you will have in memory, for each article
you selected in step 1:
- the text of the corresponding Wikipedia page ;
- the infobox content ;
- the wikidata statements (triples).
Store these into a csv or a json fifile and save it on your hard drive.
NB: Note there is some overlapping with lab session 1.ue 803: data science 3
Exercise 2 – Pre-processing, Clustering and Classifying (20 points)
Pre-processing (8 points)
Before applying machine learning algorithm to your data, you fifirst
need to process text. For each Wikipedia text collected, do the follow
ing:
- tokenize the text
- lowercase the tokens
- remove punctuation and function words
Do the same for the Wikidata description and store the results in a
panda dataframe containing the following columns.
- column 1: person
- column 2: Wikipedia page text
- column 3: Wikipedia page text after preprocessing
- column 4: Wikidata description
- column 5: Wikidata description after preprocessing
Here is an example query which re
trieves Barack Obama’s (Q76) descrip
tion from Wikidata:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX schema: <http://schema.org/>
SELECT ?o
WHERE
{
wd:Q76 schema:description ?o.
FILTER ( lang(?o) = “en” )
}
Output:
44th president of the United States,
from 2009 to 2017
Note. To improve clustering and classifification results, feel free to
add further pre-processing steps (eg Named entity recognition, pos
tagging and extraction of e.g., nouns and verbs); or/and to also pre
process the wikidata statements and the infobox content (note that
these are not standard text however).
Clustering (
The goal of this exercise is to use the collected data (text and wiki
data descriptions) to automatically cluster the Wikipedia documents
fifirst, using 16 clusters and second, experimenting with different num
bers of clusters.
Your code should include the following functions:
- a function to train a clustering algorithm on some data using N
clusters
- a function to compute both intrinsic (Silhouette coeffificient) and
extrinsic (homogeneity, completeness, v-measure, adjusted Rand
index) evaluation scores for clustering results.
- a function to visualise those metrics values for values of N ranging
from 2 to 16.ue 803: data science 4
Classifying (
Since you know which category each document in your dataset be
longs to, you can also learn a classififier and check how well it can
predict the category of each document in your dataset.
Your code should include:
- a function which outputs accuracy, a confusion matrix, precision,
recall and F1 for the results of your classififier
- a function which outputs a visualisation of the accuracy of your
classififier per category
Note. For both clustering and classifying, feel free to experiment
outwith the given requirements e.g., providing additional informa
tion about the data through vizualisation techniques of your choice,
using additional features (e.g., learning a topic model and using topic
information as an additional feature), analysing and visualising your
results (eg plotting the loss for dev and train against the quantity of
data used for training), comparing results when using all text data
- using only the wikidata description or only the Wikipedia text
etc.
Expected documents, Presentations and grading
At the end of the session, please provide us (for each group) with a
zipped directory (zip archive) containing :
- your commented Python source code ;
- a README fifile explaining how to install and run your code ;
- a REQUIREMENT fifile listing the libraries used by your code and
their version number ;
- your extracted corpus ;
- other optional useful information (e.g., fifigure representing the
model of your database).
Your zipped archive should be uploaded to Arche in the Group
Project repository. Please name this zip archive using your logins
at UL (example : dupont2_martin5_toussaint9.zip). Please also
write down your fifirst and last names within your code!
Grading will take into account the following aspects:ue 803: data science 5
- Replicability (how easily your code can be run / adapted to other
input data). Make sure we can run your code and inspect re
sults easily by giving us precise information about how to run
the program, hardcoded fifile names (if any) and where to fifind
input/output.
- Code quality / readability (highly related to code comments).
Make sure we can understand your code easily. Use meaningful
variable and procedure names. Include clear comments
.






