Centrality Algorithms
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
Centrality algorithms are used to understand the role or influence of
particular nodes in a graph. The notebook shows the application of
centrality algorithms using the graphdatascience
library on the
Airline travel reachability network dataset that can be downloaded
here.
This notebook will show how you can apply eigenvector centrality, betweenness centrality, degree centrality and closeness centrality on a graph dataset.
1. Setup
We start by importing our dependencies and setting up our GDS client connection to the database.
# Install necessary dependencies
%pip install graphdatascience pandas
from graphdatascience import GraphDataScience
import pandas as pd
import os
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
NEO4J_AUTH = (
os.environ.get("NEO4J_USER"),
os.environ.get("NEO4J_PASSWORD"),
)
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
from graphdatascience.server_version.server_version import ServerVersion
assert gds.server_version() >= ServerVersion(1, 8, 0)
2. Importing the dataset
We import the dataset as a pandas dataframe first. We deal with two
files here. The file reachability-meta.csv.gz
stores the names of
the cities and their information while the file reachability.txt.gz
stores the edges of the graph. An edge exists from city i
to city
j
if the estimated airline travel time is less than a threshold.
nodes_info_df = pd.read_csv("https://snap.stanford.edu/data/reachability-meta.csv.gz", compression="gzip")
nodes_info_df.head()
routes_df = pd.read_csv(
"https://snap.stanford.edu/data/reachability.txt.gz",
sep=" ",
skiprows=6,
header=None,
compression="gzip",
names=["Origin", "Destination", "Weight"],
)
routes_df.head()
Since this graph is very small, a straight-forward Cypher UNWIND
query is the simplest way to create our graph in the database.
Larger graphs may need a more sophisticated importing technique like
batching, neo4j-admin import
or Arrow CREATE DATABASE
.
gds.run_cypher(
"UNWIND $nodes AS node CREATE (n:City {node_id: node.node_id, name: node.name, population: node.metro_pop})",
params={"nodes": nodes_info_df.to_dict("records")},
)
gds.run_cypher(
"""
UNWIND $rels AS rel
MATCH (source:City {node_id: rel.Origin}), (target:City {node_id: rel.Destination})
CREATE (source)-[:HAS_FLIGHT_TO]->(target)
""",
params={"rels": routes_df.to_dict("records")},
)
G, result = gds.graph.project("airline", "City", "HAS_FLIGHT_TO")
print(f"The projection took {result['projectMillis']} ms")
# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")
print(f"Graph '{G.name()}' relationship count: {G.relationship_count()}")
3. Eigenvector Centrality
Eigenvector centrality measures the importance or influence of a node based on its connections to other nodes in the network. A higher eigenvector centrality score suggests that a node is more central and influential within the network.
For our dataset, eigenvector centrality can help identify airports that are not only well-connected themselves but also have connections to other important airports. Nodes with high eigenvector centrality are likely to be major hubs or airports with extensive connectivity.
eigenvector_centrality_result = gds.eigenvector.mutate(G, maxIterations=100, mutateProperty="eigenvectorCentrality")
# We can verify that the eigenvectorCentrality was mutated
G.node_properties()
We can see if our implementation converged or not and if converged, the number of iterations it took using the below code:
if eigenvector_centrality_result.didConverge:
print(
f"The number of iterations taken by Eigenvector Centrality to run is {eigenvector_centrality_result.ranIterations}."
)
else:
print("Algorithm did not converge!")
We can also see the distribution of the eigenvector centrality measures using the below code. This will show us the minimum, maximum, mean and other statistical values for our centrality measure.
eigenvector_centrality_result.centralityDistribution
We will now write our results back to the database.
gds.graph.nodeProperties.write(G, ["eigenvectorCentrality"])
Using the results from eigenvector centrality, we can now look up the top 20 cities with airports that have major hubs or airports with extensive connectivity.
def display_top_20_cities(centrality_measure):
"""
Function to execute the Cypher query to retrieve the top 20 cities with the highest centrality measure.
"""
query = f"""
MATCH (n:City)
RETURN n.node_id AS node_id, n.name AS name, n.population AS population, n.{centrality_measure} AS {centrality_measure}
ORDER BY n.{centrality_measure} DESC
LIMIT 20
"""
result = gds.run_cypher(query)
# Display the result
print(result)
display_top_20_cities("eigenvectorCentrality")
4. Betweenness Centrality
Betweenness Centrality quantifies the importance of a node as a bridge or intermediary in the network. It measures how often a node lies on the shortest path between other pairs of nodes.
For our dataset, cities/airports with high betweenness centrality serve as crucial transfer points or connecting hubs between airports that might not have direct flights between them. They play a significant role in facilitating the flow of air travel and can be vital for overall network connectivity.
betweenness_centrality_result = gds.betweenness.mutate(G, mutateProperty="betweennessCentrality")
# We can verify that the betweennessCentrality was mutated
G.node_properties()
We can also see the distribution of the betweenness centrality measures using the below code. This will show us the minimum, maximum, mean and other statistical values for our centrality measure.
betweenness_centrality_result.centralityDistribution
We will now write our results back to the database.
gds.graph.nodeProperties.write(G, ["betweennessCentrality"])
Using the results from betweenness centrality, we can now look up the top 20 cities with airports that serve as crucial transfer points or connecting hubs between airports that might not have direct flights between them.
display_top_20_cities("betweennessCentrality")
5. Degree Centrality
Degree Centrality measures the number of connections (edges) a node has in the network.
For our dataset, cities with high degree centrality have a large number of direct flight connections to other cities. They represent cities that have many direct destinations or are frequently used for direct travel. Degree centrality provides insights into the prominence and connectivity of individual airports within the network.
degree_centrality_result = gds.degree.mutate(G, mutateProperty="degreeCentrality")
# We can verify that the degreeCentrality was mutated
G.node_properties()
Similar to above, we can also see the distribution of the degree centrality measures using the below code. This will show us the minimum, maximum, mean and other statistical values for our centrality measure.
degree_centrality_result.centralityDistribution
We will now write our results back to the database.
gds.graph.nodeProperties.write(G, ["degreeCentrality"])
Finally, using the results from degree centrality, we can now look up the top 20 cities with airports that have a large number of direct flights.
display_top_20_cities("degreeCentrality")
6. Cleanup
Before finishing we can clean up the example data from both the GDS in-memory state and the database.
# Cleanup GDS
G.drop()
# Cleanup database
gds.run_cypher("MATCH (n:City) DETACH DELETE n")
7. References
-
For the network: Brendan J. Frey and Delbert Dueck. “Clustering by passing messages between data points.” Science 315.5814 (2007): 972-976.
-
For the city metadata (metropolitan population, latitude, and longitude): Austin R. Benson, David F. Gleich, and Jure Leskovec. “Higher-order Organization of Complex Networks.” Science, 353.6295 (2016): 163–166.
-
Link to the dataset: https://snap.stanford.edu/data/reachability.html
-
Notebook contributed by Kedar Ghule