Extracting and manipulating article metadata (RDF) from Het Laatste Nieuws

This is a very rough tutorial demonstrating the use of Python libraries for manipulating article metadata for articles on Het Laatste Nieuws. We extract the metadata using a RDFa parser, demonstrate SPARQL queries against the metadata of a single article and multiple articles (using pandas to work with SPARQL query results) and finish with a silly example visualising query results.

1.
Setup

Install requirements form manipulating RDF and RDFa. Optionally pandas is a useful library for manipulating SPARQL results in the form of a dataset.

pip install rdflib html5lib sparql-client pandas pip install git+git://github.com/RDFLib/pyrdfa3.git

# load the required libraries
import pyRdfa, sparql, pandas

2.
Load and query RDF for a single article

Load the metadata embedded in the article HTML by using the RDFa parser. Store it in a Graph object g.

p = pyRdfa.pyRdfa()
article_url = 'http://www.hln.be/hln/nl/957/Binnenland/article/detail/1565024/2013/01/19/Charles-Michel-Als-N-VA-en-PS-zo-voortdoen-blokkeert-alles.dhtml'
g = p.graph_from_source(article_url)

Serialize the graph containing the article metadata using the human readable Turtle-format.

print g.serialize(format='turtle')

Select and print the article title and author (i.e. author url) using a SPARQL query on the graph g.

q = """
SELECT ?title ?author_url
WHERE { ?url <http://ogp.me/ns#title> ?title .
        ?url <http://ogp.me/ns/article#author> ?author_url . } ."""
for record in g.query(q).bindings:
    print record.get('title'), record.get('author_url')

3.
Query RDF over multiple articles

I have collected the RDF metadat for a set of 200 articles over the last two weeks, and loaded it in a locally running Fuseki instance. Instead of using the build-in rdflib SPARQL client to query a local grpah g, we the Fuseki SPARQL-endpoint using the Python sparql-client.

As an example query we select all articles that have "N-VA" in the title, and return the link and the title for each of the 167 found articles, and optionally the author and the publishing date.

We convert the SPARQL results to a pandas dataframe and use pandas and IPython together to render the results inline as HTML (the first 20 results).

q = """
SELECT ?url ?title ?author ?pubdate
WHERE {
    ?url <http://ogp.me/ns#title> ?title .
    OPTIONAL {
        ?url <http://ogp.me/ns/article#author> ?author .
        ?url <http://ogp.me/ns/article#published_time> ?pubdate.
    }
    FILTER regex(?title, "N-VA", "i") }
"""
result = sparql.query('http://localhost:3030/ds/query', q)
#for row in result:
#   print row

df = pandas.DataFrame(result.fetchall(), columns=result.variables)
print 'Nr. of articles:', len(df)
from IPython.core.display import HTML
HTML(df[1:20].to_html())

4.
Process results of multiple articles

We demonstrate how you can use the basic article metadata using a silly example. For all the parsed articles on HLN that contain 'N-VA' in their title, we show the associated image for that article.

It gives you a brief glimps/basic visualisation of the actors associated with such topics.

q = """
SELECT ?img
WHERE {
    ?url <http://ogp.me/ns#title> ?title .
    ?url <http://ogp.me/ns#image> ?img .
    FILTER regex(?title, "N-VA", "i") }
"""
result = sparql.query('http://localhost:3030/ds/query', q)
img_urls = [url[0] for url in result.fetchall()]
img_urls = [str(url) for url in img_urls if 'logo' not in str(url)]
html = ''
for url in img_urls:
    html = html + '<img src=' + url + '/>'
HTML(html)
© 2018 Nextjournal GmbH