When it comes to big data, challenges derive from the nature and the volume of the data. Whether it’s a data leak or a financial company’s internal data, the amount of data we are dealing with is considerable. While in the Paradise Papers leak, journalists were dealing with about 1,4 TB of data, some organizations can gather dozens of terabytes every month.
To complicate things, investigations usually start from raw, unstructured data. And it’s impossible to automate or scale the investigation without a predefined-data model or any kind of organizational logic. The files obtained by Süddeutsche Zeitung included millions of loan agreements, financial statements, emails, trust deeds and other paperwork dating back to nearly 50 years.
The large amounts of data and their unstructured form raise a first difficulty. Organizations have to handle the processing of these large volumes of raw data into computable information that can be organized, stored and analyzed.
“Depending on the source, we had different formats and many of those were not machine-readable” declared Pierre Romera, ICIJ’s Chief Technology Officer.
The second obstacle is related to the way we store data. The success of fraud investigations is determined by the finding of connections between entities. Though, in many investigation cases, data is kept in silos that make it difficult to cross-reference it and highlight connections. For the Paradise Papers, ICIJ’s reporters conducted the investigation with data stemming from the leak but also from public databases. To make siloed data talk, it’s essential to bring everything together.
Finally, data-driven investigations are reducing the availability. Like for ICIJ, making the data exploration accessible to non-tech-savvy reporters is both a challenge and a necessity. Otherwise, without an army of data analysts and database specialists, data-driven investigations would be nearly impossible to lead.
According to Romera, “one of the key challenges is to make our technology user-friendly for the journalists so that everyone around the world is able to use it.”