[SOLVED] Data Science Project 1

25.00 $

Category:

Description

Rate this product

In this project, you will collect information about articles belonging to

different categories (such as airports, artists, politicians, sportspeople,

etc.). Based on this information, you will then try to automatically

cluster and classify these articles into the correct categories.

We will use two sources of information, namely:

  • the Wikipedia online encyclopedia1 ;

1 See https://www.wikipedia.org

  • the Wikidata knowledge base2.

2 See https://www.wikidata.org/wiki/

Wikidata:Main_Page

Deadline

The deadline for submission is May 15th, 2022. This is a strict dead

line. Late submissions will be penalised (-0.20 points per day past the

deadline).

Defense

You will defend your project on May 23rd, 2022. The defense is com

posed of a 10-minute presentation (using slides) followed by a 10-min

discussion with the jury.

Exercise 1 – Corpus extraction (10 points)

The goal of this exercise is to compile a parallel corpus from the

Wikipedia online encyclopedia. This corpus will be made of :

(i) plain text sentences3,

3 selected so as to have a roughly bal

anced corpus in terms of training

data (each target category should be

associated with the same number of

sentences)

(ii) key-value pairs corresponding to articles’ infobox (if any),ue 803: data science 2

(iii) triples coming from articles’ corresponding wikidata page.

We will focus on the following categories:

  • Airports
  • Foods
  • Artists
  • Transport
  • Astronauts
  • Monuments_and_memorials
  • Building
  • Politicians
  • Astronomical_objects
  • Sports_teams
  • City
  • Sportspeople
  • Comics_characters
  • Universities_and_colleges
  • Companies
  • Written_communication

Step 1 is related to lecture 6 of the

course.

The following clause for instance

retrieves 100 articles which belong to

the Comedy_fifilms category:

PREFIX dcterms:

<http://purl.org/dc/terms/>

PREFIX dbc:

<http://dbpedia.org/resource/Category:>

SELECT ?film WHERE {

?film

dcterms:subject/skos:broader*

dbc:Comedy_films .

}

LIMIT 100

Concretely, you are required to implement a Python program

which takes as a an input:

  • a number k of articles per category,
  • a number n of sentences per article (articles whose content is too

short should be ignored).

Step 2(a) is related to lecture 3 of the

course.

The following code snippet for in

stance retrieves the wikipedia page

content of a given article:

>>> import wptools

>>> page=wptools.page(‘Stephen Fry’,silent=True)

>>> page.get_query()

>>> page.data[‘extract’]

‘<p><b>Stephen Fry</b>…’

This extraction task can be realised through the following steps:

  1. 1. For each category c from the list above, retrieve k wikipedia arti

cles belonging to c.

Note that you can use DBpedia to fifind Wikipedia articles belong

ing to a given category.

Step 2(b) is related to lecture 3 of the

course.

The following code snippet for in

stance retrieves the wikipedia infobox

of a given article:

>>> import wptools

>>> page=wptools.page(‘Stephen Fry’,silent=True)

>>> page.get_parse()

>>> page.data[‘infobox’]

{‘name’: ‘Stephen Fry’, ‘image’: ‘Stephen …’}

  1. 2. For each selected Wikipedia article, retrieve:

(a) the corresponding Wikipedia page content,

(b) the corresponding infobox,

(c) the corresponding wikidata statements (triples).

Step 2(c) is related to lecture 3 of the

course.

The following code snippet for in

stance retrieves the wikidata statements

of a given article:

>>> import wptools

>>> page=wptools.page(‘Stephen Fry’,silent=True)

>>> page.get_wikidata()

>>> page.data[‘wikidata’]

{…

u’place of birth (P19)’: u’Hampstead (Q25610)’,

u’religion (P140)’: u’atheism (Q7066)’,

u’sex or gender (P21)’: u’male (Q6581097)’,

…}

At the end of this process, you will have in memory, for each article

you selected in step 1:

  • the text of the corresponding Wikipedia page ;
  • the infobox content ;
  • the wikidata statements (triples).

Store these into a csv or a json fifile and save it on your hard drive.

NB: Note there is some overlapping with lab session 1.ue 803: data science 3

Exercise 2 – Pre-processing, Clustering and Classifying (20 points)

Pre-processing (8 points)

Before applying machine learning algorithm to your data, you fifirst

need to process text. For each Wikipedia text collected, do the follow

ing:

  • tokenize the text
  • lowercase the tokens
  • remove punctuation and function words

Do the same for the Wikidata description and store the results in a

panda dataframe containing the following columns.

  • column 1: person
  • column 2: Wikipedia page text
  • column 3: Wikipedia page text after preprocessing
  • column 4: Wikidata description
  • column 5: Wikidata description after preprocessing

Here is an example query which re

trieves Barack Obama’s (Q76) descrip

tion from Wikidata:

PREFIX wd: <http://www.wikidata.org/entity/>

PREFIX schema: <http://schema.org/>

SELECT ?o

WHERE

{

wd:Q76 schema:description ?o.

FILTER ( lang(?o) = “en” )

}

Output:

44th president of the United States,

from 2009 to 2017

Note. To improve clustering and classifification results, feel free to

add further pre-processing steps (eg Named entity recognition, pos

tagging and extraction of e.g., nouns and verbs); or/and to also pre

process the wikidata statements and the infobox content (note that

these are not standard text however).

Clustering (

The goal of this exercise is to use the collected data (text and wiki

data descriptions) to automatically cluster the Wikipedia documents

fifirst, using 16 clusters and second, experimenting with different num

bers of clusters.

Your code should include the following functions:

  • a function to train a clustering algorithm on some data using N

clusters

  • a function to compute both intrinsic (Silhouette coeffificient) and

extrinsic (homogeneity, completeness, v-measure, adjusted Rand

index) evaluation scores for clustering results.

  • a function to visualise those metrics values for values of N ranging

from 2 to 16.ue 803: data science 4

Classifying (

Since you know which category each document in your dataset be

longs to, you can also learn a classififier and check how well it can

predict the category of each document in your dataset.

Your code should include:

  • a function which outputs accuracy, a confusion matrix, precision,

recall and F1 for the results of your classififier

  • a function which outputs a visualisation of the accuracy of your

classififier per category

Note. For both clustering and classifying, feel free to experiment

outwith the given requirements e.g., providing additional informa

tion about the data through vizualisation techniques of your choice,

using additional features (e.g., learning a topic model and using topic

information as an additional feature), analysing and visualising your

results (eg plotting the loss for dev and train against the quantity of

data used for training), comparing results when using all text data

  1. using only the wikidata description or only the Wikipedia text

etc.

Expected documents, Presentations and grading

At the end of the session, please provide us (for each group) with a

zipped directory (zip archive) containing :

  • your commented Python source code ;
  • a README fifile explaining how to install and run your code ;
  • a REQUIREMENT fifile listing the libraries used by your code and

their version number ;

  • your extracted corpus ;
  • other optional useful information (e.g., fifigure representing the

model of your database).

Your zipped archive should be uploaded to Arche in the Group

Project repository. Please name this zip archive using your logins

at UL (example : dupont2_martin5_toussaint9.zip). Please also

write down your fifirst and last names within your code!

Grading will take into account the following aspects:ue 803: data science 5

  • Replicability (how easily your code can be run / adapted to other

input data). Make sure we can run your code and inspect re

sults easily by giving us precise information about how to run

the program, hardcoded fifile names (if any) and where to fifind

input/output.

  • Code quality / readability (highly related to code comments).

Make sure we can understand your code easily. Use meaningful

variable and procedure names. Include clear comments

.