Analysing football transfers with Neo4j
Analysing football transfers with Neo4j
In this guide, we’ll look at how we can import and analyse football transfers using Neo4j.
Looking at the Data
We’ve written a bunch of transfers into a CSV file, so let’s have a quick look at the first few rows. Run the following query:
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row
LIMIT 5
This query returns the first five rows of the CSV file.
Importing the data: Constraints
Before we import anything, we’ll create some constraints to make sure we don’t insert duplicate records into the database.
CREATE CONSTRAINT ON (player:Player)
ASSERT player.id IS UNIQUE;
CREATE CONSTRAINT ON (club:Club)
ASSERT club.id IS UNIQUE;
CREATE CONSTRAINT ON (transfer:Transfer)
ASSERT transfer.id IS UNIQUE;
CREATE CONSTRAINT ON (country:Country)
ASSERT country.name IS UNIQUE;
Let’s check they all got created:
CALL db.constraints()
Looks good. Let’s get the data imported.
Importing the data: Players
First we’ll import the players. Run the following query:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MERGE (player:Player {id: row.playerUri})
ON CREATE SET player.name = row.playerName, player.position = row.playerPosition
Let’s check that the players imported correctly. Run the following query:
MATCH (player:Player)
RETURN player
LIMIT 25
Looks good!
Importing the data: Player Countries
Next we’ll import the countries that those players represent. Run the following query:
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row WHERE row.playerNationality <> ''
WITH DISTINCT row.playerNationality AS nationality
MERGE (country:Country {name: nationality })
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
WITH row WHERE row.playerNationality <> ''
MATCH (player:Player {id: row.playerUri})
MATCH (country:Country {name: row.playerNationality })
MERGE (player)-[:PLAYS_FOR]->(country)
Importing the data: Clubs
Next up, we’ll import the clubs. The way the data is structured for clubs is quite interesting. We have buying and selling clubs so the easiest approach is to write two import queries, one for each:
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row where row.sellerClubUri <> ''
MERGE (club:Club {id: row.sellerClubUri})
ON CREATE SET club.name = row.sellerClubName
MERGE (country:Country {name: row.sellerClubCountry})
MERGE (club)-[:PART_OF]->(country)
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
WITH row where row.buyerClubUri <> ''
MERGE (club:Club {id: row.buyerClubUri})
ON CREATE SET club.name = row.buyerClubName
MERGE (country:Country {name: row.buyerClubCountry})
MERGE (club)-[:PART_OF]->(country)
It’s a bit annoying that we have to write two queries to do this. Maybe there’s another way…
Importing the data: Clubs II
Indeed there is! Our good friend UNWIND
can help us out here:
WITH 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS url
LOAD CSV WITH HEADERS FROM url AS row
UNWIND [
{uri: row.sellerClubUri, name: row.sellerClubName, country: row.sellerClubCountry},
{uri: row.buyerClubUri, name: row.buyerClubName, country: row.buyerClubCountry}
] AS club
WITH club WHERE club.uri <> ''
WITH DISTINCT club
MERGE (c:Club {id: club.uri})
ON CREATE SET c.name = club.name
MERGE (country:Country {name: club.country })
MERGE (c)-[:PART_OF]->(country)
Importing the data: Transfers
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'https://s3-eu-west-1.amazonaws.com/football-transfers.neo4j.com/transfers-all.csv' AS row
MATCH (player:Player {id: row.playerUri})
MATCH (source:Club {id: row.sellerClubUri})
MATCH (destination:Club {id: row.buyerClubUri})
MERGE (t:Transfer {id: row.transferUri})
ON CREATE SET t.season = row.season,
t.fee = row.transferFee,
t.timestamp = toInteger(row.timestamp)
MERGE (t)-[ofPlayer:OF_PLAYER]->(player) SET ofPlayer.age = row.playerAge
MERGE (t)-[:FROM_CLUB]->(source)
MERGE (t)-[:TO_CLUB]->(destination);
Now it’s time to clean up the transfer fees so that we can compare different transfers more easily.
Cleaning the data: Transfer fees
Let’s have a look what transfer fees look like at the moment. Run the following query:
MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100
There’s lots of different values here, but it looks like if the value is "?" or "-" then we don’t have any idea what the transfer fee actually was.
Let’s see if there are any valid transfers with those values. Run the following query:
MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee contains "-"
RETURN t.fee, count(*)
Doesn’t look like it! Let’s exclude those transfers:
MATCH (t:Transfer)
WHERE t.fee contains "?" or t.fee contains "-"
REMOVE t:Transfer
SET t:TransferWithoutFee
Tagging the loan transfers
There are some transfers in our dataset where a player is only
temporarily transferred between teams. This is a good time to make use of a 2nd label. Let’s add the label Loan
to those transfers:
MATCH (t:Transfer)
WHERE t.fee STARTS WITH 'Loan'
SET t:Loan
Cleaning the data: Transfer fees
Now what we’ve got left are all values that we can translate into a numeric value.
MATCH (transfer:Transfer)
RETURN transfer.fee, COUNT(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100
Let’s add a new property that has a numeric value for each transfer fee. Run the following query:
MATCH (t:Transfer)
WITH t, replace(replace(replace(replace(t.fee, "k", ""), "m", ""), "Loan fee:", ""), "£", "") AS rawNumeric
WITH t,
CASE
WHEN t.fee ENDS WITH "k" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000"))
WHEN trim(t.fee) IN ["Free transfer", "ablösefrei ", "gratuito", "free", "free transfer", "Ablösefrei", "transfervrij", "ablöserei", "Free Transfer", "Libre", "gratutito", "ablsöefrei", "ablösefrei", "ablösefei", "abösefrei", "Loan", "draft", "Swap deal", "trade", "ablösefrei", "ablösefreei", "Free", "ablosefrei", "Draft", "Trade", "Libre para traspaso", "bez odstępnego", "ablossefrei", "Bez odstępnego", "Gratuito", "ablödefrei", "Bonservissiz", "ablösfrei", "ablõsefrei", "ablösefre", "custo zero", "ablösefrei!", "ablösefreo", "svincolato", "Ablösfrei", "livre", "libre", "Leihe", "abolsfrei", "ablösefrai", "ablösefreil", "abllösefrei", "abölsefrei", "ablöserfrei", "abklösefrei", "ablöaefrei", "Ablosefrei", "Nessuno", "ablösesfrei", "Free Tranfer", "abblösefrei", "Spielertausch", "ablösebrei", "abslösefrei", "spielertausch", "a", "ablöseferi", "ablöserfei", "Tausch"] THEN 0
WHEN NOT(exists(t.fee)) THEN 0
WHEN rawNumeric = '' THEN 0
WHEN t.fee ENDS WITH "m" THEN toFloat(apoc.number.exact.mul(trim(rawNumeric),"1000000"))
ELSE toFloat(trim(rawNumeric))
END AS numericFee
SET t.numericFee = numericFee
There are still a few transfers left which have annoying values so let’s exclude those from the dataset:
MATCH (t:Transfer)
WHERE not exists(t.numericFee)
REMOVE t:Transfer
SET t:TransferWithoutFee
Cleaning the data: Floating point numbers
You might have noticed that we’re using the APOC function apoc.number.exact.mul
to multiply transfer fees, but why can’t we just do that calculation in pure Cypher?
Floating point fun!
WITH "8.37" as rawNumeric
RETURN toFloat(rawNumeric) * 1000000 AS numeric
We’d expect to get back 8370000
but we didn’t! Let’s try that same calculation with the APOC function:
WITH "8.37" as rawNumeric
RETURN apoc.number.exact.mul(rawNumeric,"1000000") AS apocConversion
That works but it’s still a String, so we need to convert it back to a numeric value:
WITH "8.37" as rawNumeric
RETURN toFloat(apoc.number.exact.mul(rawNumeric,"1000000")) AS apocConversion
Now we’re ready to query the graph.
The top transfers
We’ll start by finding the most expensive transfers.
MATCH (transfer:Transfer)-[:OF_PLAYER]->(player),
(from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to)
RETURN player.name, from.name, to.name, transfer.numericFee
ORDER BY transfer.numericFee DESC
LIMIT 10
Transfers from teams
Now let’s narrow in and find the transfers involving a specific team:
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "FC Barcelona"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
Brexit means Brexit
In these days of Brexit, we can write a query that finds transfers of English players between English teams:
MATCH (to:Club)<-[:TO_CLUB]-(t:Transfer)-[:FROM_CLUB]-(from:Club),
(t)-[:OF_PLAYER]->(player:Player)-[:PLAYS_FOR]->(country),
(to)-[:PART_OF]->(country:Country)<-[:PART_OF]-(from)
WHERE country.name = "England"
RETURN player.name, from.name, to.name, t.numericFee, t.season
ORDER BY t.numericFee DESC
LIMIT 10
We could also easily change the country and look at transfers in other countries as well.
Players with the biggest transfer footprint
We can also write an aggregate query to find the players that have had the most money spent on them:
MATCH (t:Transfer)-[:OF_PLAYER]->(p:Player)
WITH p, sum(t.numericFee) as moneyTrace, COUNT(*) AS numberOfTransfers
RETURN p.name, apoc.number.format(moneyTrace), numberOfTransfers
ORDER BY moneyTrace desc
LIMIT 10
Adjacent transfers
So far, our queries haven’t been particularly graphy. We’ve done a few queries that had multiple joins, but we haven’t really used the power of the graph yet.
One way we can do that is by introducing NEXT
relationships between adjacent transfers involving individual players. Run the following query:
MATCH (p:Player)<-[:OF_PLAYER]-(transfer)
WHERE transfer.numericFee > 0
WITH p, transfer
ORDER BY p.name, transfer.timestamp
WITH p, collect(transfer) AS transfers
WHERE size(transfers) > 1
UNWIND range(0, size(transfers)-2) AS idx
WITH transfers[idx] AS t1, transfers[idx+1] AS t2
MERGE (t1)-[:NEXT]->(t2)
Most profit made
We can now write a query to see which club made the most profit on a player:
MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
(initial)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
RETURN p.name as player, club1.name AS profitMaker , initial.name as buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
ORDER BY profit DESC
Profit per day
We can then go even further and work out how much profit was made for each day that a player was owned by a club:
MATCH (p:Player)<-[:OF_PLAYER]-(t1)-[:NEXT]->(t2),
(club0)<-[:FROM_CLUB]-(t1)-[:TO_CLUB]->(club1)<-[:FROM_CLUB]-(t2)-[:TO_CLUB]->(club2)
WHERE none(t in [t1, t2] where t:Loan)
WITH p, club1.name AS profitMaker, club0.name AS buysFrom, club2.name AS sellsTo, t2.numericFee - t1.numericFee as profit, (t2.timestamp - t1.timestamp) / 60 / 60 / 24 AS daysAtClub
RETURN p.name AS player, profitMaker, buysFrom, sellsTo, profit, daysAtClub, profit / daysAtClub AS profitPerDay
ORDER BY profitPerDay DESC
Which shows us some pretty strange-looking transfers! Andrea Bertolacci seems to have been transferred twice in consecutive days at a massive profit. Presumably these transfers were all organised beforehand?
Loop transfers
We can also find players who have returned to the club that initially sold them. Run the following query:
MATCH (p:Player)<-[:OF_PLAYER]-(t:Transfer)
MATCH path = (t)-[:NEXT*]->(t2)
MATCH (t)-[:FROM_CLUB]->(club)<-[:TO_CLUB]-(t2)
WHERE none(t in [t, t2] where t:Loan)
WITH p, t.numericFee - t2.numericFee AS profit, [transfer in nodes(path) | [(from)<-[:FROM_CLUB]-(transfer)-[:TO_CLUB]->(to) | from.name + "->" + to.name][0]] AS transfers, path
RETURN p.name, apoc.number.format(profit), transfers, (nodes(path)[-1].timestamp - nodes(path)[0].timestamp) / 60 / 60 / 24 AS days
ORDER BY profit DESC
Money flow
What was the money flow between clubs in a particular season?
MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
RETURN buyer, cash_flow, player_count, season, seller
ORDER BY cash_flow DESC
LIMIT 10
Let’s persist this in the graph so that we don’t need to recalculate it each time:
MATCH (t:Transfer)
WITH DISTINCT t.season AS season
MATCH (seller)<-[:FROM_CLUB]-(t:Transfer)-[:TO_CLUB]->(buyer)
WHERE t.season = season AND t.numericFee > 0
WITH season, seller, buyer, sum(t.numericFee) AS cash_flow, count(t) AS player_count
MERGE (buyer)-[:CASH_FLOW {total: cash_flow, playerCount: player_count, season: season}]->(seller)
Who received money from Real Madrid in 2010/2011?
This new relationship type allows us to find out where the cash was flowing in a particular season - e.g. we can see who Real Madrid bought players from in 2010/2011.
MATCH path = (buyer:Club)-[:CASH_FLOW]->(seller:Club)
WHERE buyer.name = "Real Madrid" AND all(f in relationships(path) WHERE f.season="2010/2011")
RETURN *
Largest sums of money transferred
We can write a query to find out which clubs participated in the biggest transfers of money:
MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10
It’s surprising to see Man City/Monaco in there, who was transferred between those two teams?
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Monaco" AND to.name = "Man City"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
Largest sums in a specific season
We can also drill down to a specific season and only look at the cash flows for that season:
MATCH (buyer:Club)-[f:CASH_FLOW]->(seller:Club)
WHERE f.season = "2016/2017"
RETURN f.season AS season, buyer.name AS buyer, seller.name AS seller, f.total AS amount, f.playerCount AS playersTransferred
ORDER BY amount DESC limit 10
MATCH (from:Club)<-[:FROM_CLUB]-(transfer:Transfer)-[:TO_CLUB]->(to:Club),
(transfer)-[:OF_PLAYER]->(player)
WHERE from.name = "Valencia CF" AND to.name = "FC Barcelona" AND transfer.season = "2016/2017"
RETURN player.name, to.name, transfer.numericFee, transfer.season
ORDER BY transfer.numericFee DESC
Is this page helpful?