Visualizing the network of Hillary Clinton’s emails
Courtesy of the US State Department we have access to part of Hillary Clinton’s emails. Using graph visualization we will explore this data, focusing not on the content of the emails but on their metadata. Let’s see what kind of information we can uncover about Hillary Clinton’s professional network.
Clinton, then Secretary of States of the United States, had the habit of using a personal server email server to exchange professional emails. When this was revealed, it caused a public controversy.
The data was later the object of a public records request. The State Department reviewed the emails to decide which were too sensitive to be turned over. The rest of the emails were published on a monthly basis as PDF documents. WSJ journalists, Ben Hamner and others have undergone the task of turning it in a more exploitable format. For the purpose of the article we will use a cleansed version of the data prepared by Ryan Boyd. It consists in a single CSV file.
The script below (courtesy of Boyd) turns the CSV file into a Neo4j graph database:
// Creating the graph
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “https://s3-us-west-2.amazonaws.com/neo4j-datasets-public/Emails-refined.csv” AS line
MERGE (fr:Person {alias: COALESCE(line.MetadataFrom, line.ExtractedFrom, ”)})
MERGE (to:Person {alias: COALESCE(line.MetadataTo, line.ExtractedTo, ”)})
MERGE (em:Email { id: line.Id })
ON CREATE SET em.foia_doc=line.DocNumber, em.subject=line.MetadataSubject, em.to=line.MetadataTo, em.from=line.MetadataFrom, em.text=line.RawText, em.ex_to=line.ExtractedTo, em.ex_from=line.ExtractedFrom
MERGE (to)<-[:TO]-(em)-[:FROM]->(fr)
MERGE (fr)-[r:HAS_EMAILED]->(to)
ON CREATE SET r.count = 1
ON MATCH SET r.count = r.count +1;
MATCH (a:Person)-[r]-(b:Email) WITH a, count(r) as count SET a.count = count;
The result is a graph of 8,278 nodes and 16,335 edges.
In our graph we have 2 types of nodes: persons and emails. Persons are linked to emails by “from” and “to” relationships. In addition, persons are directly linked by a relationship when they have exchanged emails.
Now that we have prepared the data, we can use Linkurious to start investigating it. First let’s look up Hillary Clinton.
Time to explore what Clinton is connected to. Instead of visualizing all the 7,945 emails she has sent or received, let’s focus on the people she is connected to.
Clinton has exchanged emails with 210 persons. Already there are some interesting things to notice. We have a lot of isolated nodes (nodes which are only connected to Clinton) in the top right corner of the screen. In the bottom we have a group of nodes which are highly interconnected. Among them are Cheryl Mills, former Counselor and Chief of Staff, and Lona Valmoro, Special Assistant. The people in this group are in contact together and form a community. These are Clinton’s closest professional contacts.
In this network, who are the most active persons? Let’s map the size of the nodes to the number of emails sent and received.
We can see that Cheryl Mills, Huma Abedin and Jake Sullivan are the biggest nodes and thus the most active participants (after Clinton) in the network.
Let’s shift our focus to the isolated nodes. They represent people who exchanged with Clinton but were not involved in her day to day work. For example, Cherie Blair, wife of former British PM Tony Blair, is one of these isolated nodes connected to Clinton.
When we expand Blair’s connections, we see that she received four emails from Clinton with subjects “Confidential, “Get well soon”, “Sorry to miss you” and “Get well soon”.
We can select the “Confidential” email and read the content.
We don’t have to look at the content of Clinton’s emails though to learn more about her activity at the State Department. Graph visualization helps us turn the emails’ metadata into a clear view of Clinton’s network. We can identify key people and communities quickly and easily.
A spotlight on graph technology directly in your inbox.