Here are the top ten questions we get asked by our customers about data sources, quality, coverage, and more.
1. Where do you get your data?
We aggregate our data from more than 100 different sources all over the world; there is a great diversity in the types of sources contributing to our database. We are not able to reveal exactly from whom we source our data, but we can provide the types of companies we work with. The most common data sources for us are telcos, local postal authorities (e.g., USPS), cable and utilities, directory assistance, and credit bureaus.
Additionally, we are able to leverage our extensive user base of more than 50M unique monthly users to crowdsource interesting data points to help improve the accuracy and coverage of our Identity Graph.
2. Why don’t you have data on 100% of people?
First, we only publish data on the adult population of a given country, so by default we will not have information on people under the age of 18.
Second, there are people who have made conscious decisions to stay “off the grid” by not leveraging traditional lines of communication and/or banking, or who have gone through the steps of removing their information from public records.
Lastly, there are more than 250M adults in the US. While we have information on about 95 percent of them, there are always going to be data points that slip through the cracks. This doesn’t deter us from trying to improve our coverage, but we will never have data on 100% of the people in a country.
3. What is your coverage?
We get asked this a lot, and while it is a valid and understandable question, it is a question we always need to further qualify. For example, do you want coverage of Person/Phone links, or Person/Email? Also, in which countries are you interested?
Our coverage numbers vary by entity type (e.g., in the US Person-Phone is ~80% and Person-Email is ~70%) and by country (eg, Person-Phone in Australia is ~30% and Person-Phone in Brazil is ~60%), so when asking this question to your Whitepages Pro contact, please be specific on what numbers you are looking for.
4. How do you keep your data up to date?
We approach this challenge through two primary mechanisms:
- Data sources we store locally in our own environment: these sources provide us updates of the files on a pre-determined schedule (e.g., weekly or monthly). In these instances we are diligent about selecting an update schedule that makes the most sense for our customers, and then making that data available to our customers as soon as we receive the updates.
- Data points that are updated rapidly in the real world (e.g., proxy IP addresses): we leverage real-time APIs that receive these updates as they happen. This allows us to query for and receive the most up-to-date data available. APIs are not always ideal due to the additional latency they add for our customers, but in certain instances it is a tradeoff we make in order to access data in real time.
5. How do you measure data accuracy?
Our primary form of accuracy measurements is via call downs in which we use operators to call a sample of our data records to verify that the information is correct. For example, they would call 206-555-1234 to verify that John Smith is linked to that phone number and lives at 456 Main Street. We do this on statistically significant samples so that we can take the learnings from these call downs and apply changes to our entity resolution algorithms to improve the accuracy of our links.
Of course, there is certain data that cannot be measured with call downs. Take for example an email/name link. In these situations we will use a variety of other mechanisms to ensure we are producing accurate data. Examples include using truth sets or double or triple corroboration from other sources.
6. How accurate is your data?
This is a question that comes up once in awhile, and it has a fairly nuanced answer. One of the key factors we consider when building the Identity Graph is how to balance the coverage vs accuracy of our data. For example, we could limit our data to only that data we know to be 100% accurate, but the coverage would drop significantly and we would lose a lot of great data that is still accurate and useful. On the flip side, if we wanted to append a name to every phone number in the US we could do that, but the accuracy of that data would plummet.
At the end of the day the key charter of our Data Services team is to strike the right balance of coverage and accuracy. We do this by understanding our customer use cases and what data we realistically have available, and by testing various versions of our entity resolution algorithm. The end result is that our accuracy will actually vary over time based on what needs our customers have and what data we are using.
7. Why is your domestic data different from your international data?
There are a number of reasons for this:
- In the United States we are fortunate to have a very mature public data industry in which there are many different types of companies that have built businesses on data monetization. This provides companies like Whitepages access to vast amounts of data to use for identity verification and fraud prevention. Outside of the US, this industry is at mixed stages of maturation on a country by country basis. Some countries, like the UK, are pretty close in terms of the necessary infrastructure and data companies to facilitate this type of market, while others, like Mexico, are still up and coming.
- Data privacy restrictions vary significantly by country, and that can have a huge impact on our ability to source data in these countries. It is fairly well known that the US has pretty relaxed data privacy rules, while on the other side of the Atlantic, European countries are traditionally much more closed off with their data. As a result there is far more data available in the US than in most EU countries.
- While we have been hard at work over the past few years on increasing our international data coverage, we started on the US data over 20 years ago. Just by virtue of our roots dating back to the late 90s, we have built up a massive database of US people and businesses that few other companies can compete with. That said, we’re making big strides on a daily basis with our international data and are always looking at different ways to improve our coverage and accuracy abroad.
8. Why is your data better than the “next guys”?
- There are three primary reasons why the Whitepages Pro data is the best in the market.
We have a rigorous data sourcing process that involves significant due diligence on where our partners source their data, their viability as a long-term partner for Whitepages, and of course the accuracy/coverage of the data. I go into our process in more depth in a prior blog post.
- We invest a significant amount of money and human resources into data science and engineering. By far our largest engineering team is our Data Services team that is responsible for onboarding new data sources, developing entity resolution algorithms and building the technology that powers our Identity Graph. With multiple PHDs and high level data architects, we have a high powered team that works tirelessly everyday to improve our Identity Graph and provide our customers with the best data.
- Lastly, we have a very strong product management team that is obsessed with solving customers’ problems leveraging the Identity Graph, sophisticated data science and machine learning to build world-class identity verification products. This team frequently meets with our customers to clearly understand their challenges with identity verification, fraud prevention, risk modeling, etc. This constant feedback directly influences both what data we’re sourcing, how to best ingest it, and what insights we deliver for our customers.
- There are three primary reasons why the Whitepages Pro data is the best in the market.
9. How secure is it to use your data?
While we cannot publish the specifics regarding how our platform runs internally, we can say that it is of the utmost importance that our data and the data of our customers is kept safe. We have a team of engineers dedicated to our data security, and they have a built Whitepages Pro API to run over a secure TLS encrypted interface that ensures our customer queries and responses are secured end to end.
10. Do you capture or store my data?
One of the areas in which we put a lot of focus is privacy of our customer data, and a big part of this is how we address capturing and storing this data. In short, yes we capture and store this data, but we take privacy of the data very seriously as addressed in the prior question. We have the ability to not store data if absolutely necessary as part of an InfoSec requirement, but by default this data is stored and used for only a couple critical functions. The first is basic accounting and audit needs. The second is feeding into our fraud and identity validation models to help us better serve our customer through the likes of Identity Check Confidence Score.