WikiLeaks data in numbers

Published on August 1, 2019 by

We have uploaded all the WikiLeaks data to Intelligence X and created a new category. You do not need an account or license to search through the WikiLeaks data using our site.

Try it out here! https://intelx.io/?s=cnn.com&b=leaks.public.wikileaks

Source & Challenges

Most of the raw data is available via https://file.wikileaks.org/file/ as well as torrents. There are a couple of organizational and technical challenges that come with the data:

  • The files are mostly unstructured and there is no clear index.
  • The published raw files do not always exactly match up in count with what’s published on wikileaks.org.
  • The chaotic way of how and what is published in raw form likely represents the chaos within the WikiLeaks organization (see Conway’s Law).
  • The file types vary. Some document are in PDF form, Word files (DOC, DOCX), some in picture form (JPG), and some in picture form embedded in PDFs. This makes it tricky to extract reliable meta-data (such as the title, creation date) and the data itself (the text).
  • Spam: Sometimes the data contains pure spam. A human is required to go through each folder and decide whether the data may be of interest or not.
  • Some data is compressed (7Z, RAR, ZIP) with many sub-files and folders.
  • Fake news: Some data contains fake information (for example: Fake medical report about Steve Jobs having a HIV+ diagnosis).
  • Pornographic content: Some attachments of emails contain pornographic content.
  • Duplicates: Some files are duplicated.
  • Extreme violence.

Counting the Input

  • Count of files: 43,374
  • Count of folders: 4,423
  • Total size: 28.3 GB

Intelligence X Statistics

The Intelligence X statistics list more files than the input, because the compressed files (ZIP and other) contain many files that are extracted and stored separately.

  • Count of items: 5,664,971
  • Count of unique selectors: 368,818
  • Count of total extracted selectors: 41,213,169
  • Size of data files (total): 471 GB

The above statistics mean that we have 368,818 different search terms (selectors, like domain name, email address, etc.) that search in 5,664,971 results.

Out of the 368k unique selectors, most are – not surprisingly – email addresses with 46%. Next is Credit Cards with 19% followed by URLs 15%.

Selector breakdown for the WikiLeaks data
Same data, different visualization

Cryptome

Update 8/9/2019: We uploaded the Cryptome data into the WikiLeaks bucket.

  • Count of items: 93,234
  • Count of unique selectors: 333,122
  • Count of total extracted selectors: 539,908
  • Size of data files (total): 39 GB

Related articles

Newsletter 2019-11-12

Published on November 12, 2019 by

November 2019: Tools and Public Web 🕵🏻 OSINT Tools: New Tabs & Updates We have updated our free OSINT tools: https://intelx.io/tools ✅ New YouTube DataViewer: Shows metadata and thumbnails ✅ New Vehicle Identification Number (VIN) tab ✅ Added 4 sites to Domain search, including crt.sh, Robtex, and Hurricane Electric ✅ Added 3 sites to IP


Open sourcing our fileconversion library

Published on October 30, 2019 by

We released our Go “fileconversion” library here: https://github.com/IntelligenceX/fileconversion It supports converting many file formats to plaintext, and provides other related functions. It was tested on 184+ million files and is used for intelx.io. We are happy to contribute back to the open source community.


Newsletter 2019-10-17

Published on October 17, 2019 by

October 2019: Russia, Grouping Results, and DDoS attack 🇷🇺 Russia Want to investigate the Russian government? We are helping with a new search category, “Government: Russia”. It indexes data from Russian governmental domains, including: gov.ru – Russian Government mil.ru – Ministry of Defence of the Russian Federation kremlin.ru – Official website of the President of


Search the blog:

Subscribe for the newsletter: