How whitepages turned the phone book into a graph
Whitepages is now offering developers access to a Graph API. This is the final step in a complete overhaul of the way Whitepage structures its data. Here is the story of how Whitepages switched to graph technologies and transformed its business.
If you were born in the 1990’s or earlier, you are familiar with phone books. These books listed the phone numbers of the people living in a given area. When you wanted to contact someone you knew the name of, the phone book could help you find his number. Before people switched phones regularly and stopped caring about having a landline, this was important.
Needless to say the phone book was a limited tool. It was impossible to use it to find who a certain number belonged too for example. The data was here but hard to extract unless you were Rain Man.
Actually, until 2008 Whitepages was still using the digital equivalent of phone books to store its data. The Whitepages engineering team recently shared the story of its journey into graph databases. It turns out that until 2008, the company used multiple RDBMS silos (PostgreSQL 7.4, 8.0, MySQL, Oracle), each silo stored a large flat listing of data. Good old tables.
The paradox is that while Whitepages relied on RDBMS its data is naturally a graph. People or businesses are connected to addresses and phone numbers. Moreover, in the real world many people or businesses can be connected to the same address and some phone numbers are shared. Of course this can be modeled with tables in a relational database but it leads to serious issues.
The phone book addressed one use case : finding someone’s number given his name. The relational stack used by Whitepages could answer that type of query but struggled with more advanced questions like :
- Who is behind a name? What are the present and historical addresses and phone numbers of a new customer?
- What businesses exist in my area? Can I get a list of all the restaurant owner in Chicago with addresses and phone numbers?
- Who does this phone number belong to? Who is the owner, where does he live and is it a telemarketer or a legitimate individual?
Whitepages customers needed answers to these questions to find new leads, identify potential fraudsters or update customer listings. The relational technologies struggled to answer questions that necessitated to look for connections in the data. According to ProgrammableWeb :
WhitePages started building their Contact Graph platform after they saw regular visitors to WhitePages.com coming from IP addresses associated with businesses. WhitePages found that business teams like call centers and fraud departments were relying on their web site. Call centers wanted to know how to quickly spell a caller’s name. Fraud departments checked to see if a customer name was really associated with a phone number and shipping address.
From 2008 to 2013, a team led by Devin Ben-Hur, Senior Architect at Whitepages tried different solutions to solve that problem. Solr, HBase, Scala, Kraken were tried but the problem remained : providing fast answers to customers looking for the connections within the Whitepages data.
Whitepages needed something that could match its strong requirements :
- Scalable — Distributed solution; just add nodes ;
- Available — AP design; robust fault-tolerance ;
- High performance — > 30,000 vertices/sec ;
- High ingest rate — 200+ updates/sec ;
The system would have to support a dataset that is naturally connected. It would also have to support queries that are centered on the exploration of relationships between entities. Finally it would have to be agile enough to adapt to new business/customer requirements.
The team led by Devin Ben-Hur settled on Titan. Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.
Whitepages decied to use Titan together with Cassandra. It tested this architecture in 3 steps :
- Local Deployment (Single node Cluster on a Mac) ;
- Small Cluster (5 nodes in AWS, 7.5-10 GB of data) ;
- “Full” Cluster (60 nodes in AWS, 3 TB of data) ;
The test proved highly conclusive with 400 requests per second Titan delivered the results in 47 ms whereas the old system took 140 ms. Adding 200 writes simultaneously barely impacted the performances.
Whitepages is moving to Titan to handle a billion+ entities. When the project will be over, the Whitepages graph will store the most comprehensive and accurate data for people and businesses in North America, including the best mobile data available anywhere.
Using a graph database like Titan allows Whitepages to use a more natural model to store its data.
In a graph database, the entities are stored as individual nodes. Here we see a graph with 2 phone nodes, 3 person nodes and 2 addresses nodes. They are linked together by relationships. For example Jane Smith is linked to a current address and to a previous address.
Simply by looking at the graph visualization above we see the Smith household is located in Seattle and is composed of two parents and (most probably) their girl. It is simply a matter of following relationships. Querying the data (what is called a “graph traversal”) is similar and just as easy. Finding a phone number, the neighbors or spouse of someone is a matter of looking for specific relationships within the graph.
The actual graph schema used by Whitepages is more sophisticated. It is designed to handle merged entities or out of order updates.
The impact of moving to a graph database to support its business is huge for Whitepages. It is not simply a new way to offer the same products and services with better performances. This is illustrated by the fact that the graph infrastructure used at Whitepages is now available to anyone. Whitepages has released WhitePages PRO API 2.0, an API that makes the Contact Graph available to anyone.
Whitepages is embracing an open philosophy and making its data available to developers everywhere. An overview of the WhitePages API is publicly available and accessing the API only requires an email verification.
Now WhitePages is servicing an increasing number of customers via its API. The Contact Graph is used by these customers for:
- fraud prevention ;
- lead qualification and contact data completion ;
- identity verification and normalization ;
- personalization and retargeting.
Using a flexible, easy to understand and high performance graph back-end allowed Whitepages to turn its 17 year experience in the data collection business into a platform used by companies like Amazon, Microsoft or Twilio.
Whitepages is following other companies betting heavily on graphs. Last year, Facebook announced Graph Search, a search engine smart enough to understand the connections within the users of Facebook, places, tastes, etc. Crunchbase recently announced the Business Graph. It is now using the Neo4j graph database to make its data about startups more easily available to the world.
Whitepages had a challenge : storing hundreds of millions of relationships between various entities and finding connections quickly in this dataset. Like more and more companies, it turned to a graph database to solve this issue. The result were good enough to enable Whitepages to give direct access to its data through a public API. It shows how graph technologies create new opportunities to find value in already existing data.
A spotlight on graph technology directly in your inbox.