How do you make sense of 11.9 million documents, which include 10,000-page PDFs and years worth of emails, written in many different languages? Thanks to several powerful technology solutions, including the Linkurious investigation platform, ICIJ was able to unravel the stories in this massive leak in just over one year – a massive feat considering only 4% of the files were structured to begin with.
First, ICIJ had to identify the files containing beneficial ownership information and structure that data. They combined individual spreadsheets into master spreadsheets. For PDF or document files, ICIJ used programming languages like Python to automate data extraction and structuring where possible. For more complex cases, the ICIJ used machine learning and other tools like Fonduer and Scikit-learn softwares to identify and separate certain forms from longer documents.
After filtering and structuring the data, the Linkurious Enterprise investigation platform and Neo4j graph database were able to help the journalists easily search, explore, and visualize this huge quantity of data.
Linkurious Enterprise is built for this. It enables analysts and investigators to easily explore and visualize data through a network analysis approach to quickly understand all the complex direct and indirect connections within huge amounts of data.
A longtime user of Linkurious through the Linkurious for Good program, ICIJ also used the investigation platform for the Panama Papers, Paradise Papers, and FinCEN Files investigations. “Linkurious is very user-friendly. It’s easy for anyone to use, even without a technical background. Yet if you are an advanced user, you can also turn it into something extremely powerful. It’s a very versatile tool,” explains Miguel Fiandor, data specialist at ICIJ.
With Linkurious, the journalists were able to collaboratively establish a precise picture of the tens of thousands of businesses and beneficial owners (UBOs) implicated in the leak. Through intuitive network visualizations, the journalists could explore and understand the connections between all of these entities across providers. They were also able to bring in external data sets, like sanctions lists, previous leaks, and public records, to help identify the most interesting stories and give extra context to the leaked data.
“We have complex data sets and our partners can explore them independently on Linkurious,” says Emilia Diaz-Struck, research editor at ICIJ. “In this investigation, we focused on beneficial owners. If you know who the real person behind a company is, you can understand whether or not they’re potentially connected to any controversial activity. Using graph technology, it’s easy to understand corporate structures and who is behind a given company.”
Some of the visualizations created for the investigation are available to the public on ICIJ’s website.
Note: This article was written with information available as of October 6, 2021 and may be updated as new information becomes available.