GraphGists

Media, Politics and Graphs

My dear friend and neo4j community member Ron recently pointed me to an amazing piece of work. Thomas Boeschoten, of the Utrecht Data School among many other things, published some amazing work of analysing the Dutch Talk Shows from different perspectives, using Gephi as one of his tools. Some of his results are nothing short of fascinating, and very cool to look at.

netwerk

…​

I will not try to help you understand the depths of Thomas' research, I would just like to take this dataset - which he kindly shared - for a spin using neo4j.

Importing the dataset

Rik originally imported the 20x larger dataset from Gephi, but this GraphGist uses a sampled version of the original.

…​

However, when I fired up the server, I soon found out that I would have to do some work :) …​ the graph that Thomas created did not really have a "database-like" model (it did not do any normalisation of the model, for instance) - and the neo4j browser looked a bit boring:

Screen%2BShot%2B2014 03 23%2Bat%2B19.28.11

I needed to add some structure to this all, in order to be able to query it meaningfully.

Adding a model

After browsing around through the data, I decided that the model that I would be playing with would look something like this:

Screen%2BShot%2B2014 03 23%2Bat%2B19.34.51

You can see that it is not a very big graph:

MATCH (n)
RETURN head(labels(n)) as labels,count(*) as count

but it is quite densely connected - it has a lot of relationships between the nodes:

MATCH (n)-[r]->(m)
RETURN head(labels(n)) as start, type(r) as rel, head(labels(m)) as end, count(*) as count

So now I can do some more interesting queries on the data, and see if - like in Thomas' research - I kind find out some interesting stuff about this dataset. Take it for a spin: CYPHER queries!

Let’s start with some simple queries. Let’s figure out how many people have visited the different shows:

match (g:GUEST)-[v:VISITED]->(sh:SHOW)
return sh.id as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And we immediately get a feel for the dominant talkshows:

But then let’s see how many of these talkshow guests are politicians (or have political affiliations at least). Let’s expand the query a bit:

match (g:GUEST)-[v:VISITED]->(sh:SHOW),
g-[:AFFILIATED_WITH]->(p:PARTY)
return sh.id as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And see if there is any difference in the way the shows are ranked:

Interesting. There are indeed some differences, as you can see.

Now let’s look at another perspective in our dataset: Gender. Let’s look at the distribution of male/female guests to all of these shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
(g)-[v:VISITED]->(sh:SHOW)
return gen.name, count(v)
order by gen.name ASC;

we can clearly still see the dominance of men in these shows:

If we then add the political dimension again, and look at gender distribution for the political visitors to the shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
(g)-[v:VISITED]->(sh:SHOW),
(g)-[:AFFILIATED_WITH]->(p:PARTY)
return gen.name, count(v)
order by gen.name ASC;

then we can see that the distribution is broadly the same:

I am sure there are plenty of other queries to think of, but let me do one more in this post: let’s see what the overlap is - in terms of guests visiting them - between the different shows. To do that, all we need to do is calculate some paths between two shows: DWDD and P&W.

match p = AllShortestPaths((s1:SHOW {id:"DWDD"})-[*..2]-(s2:SHOW {id:"P&W"}))
return nodes(p)
limit 5;

The result is exactly what you would expect: a HUGE amount of overlap - at least between these two (see above: largest) shows. Hence the "limit 5" in the query - so that my poor neo4j browser would survive:

Wrap-up

That’s about all I have at this point. You can download the database from over here. And the queries that I used above are all on github.

From my perspective, I think these kinds of datasets are extremely interesting and powerful. I would love to see more work like Thomas', from my own country or abroad, and look at this from an even broader perspective. In any case, I would like to thank and compliment Thomas on his work - and look forward to your feedback.

Hope this was useful.

Cheers

Rik

Link to the original post again