Open Government 🏛

Published on November 20, 2020 by

Open government is the governing doctrine which holds that citizens have the right to access the documents and proceedings of the government to allow for effective public oversight.

Wikipedia

After the recent firing of Christopher Krebs, the 1st director of the Cybersecurity and Infrastructure Security Agency (CISA), we decided to leap into action and preserve public data from US governmental websites.

New Category

We created the new category, “Government US”, that includes all Public Web data from these TLDs:

  • .GOV
  • .MIL

As a result, you can now access historical versions of the CISA website via https://intelx.io/?did=3dab2dd6-724f-4c66-916f-62c586ab7037. New copies are made every few days so it will catch any changes – and preserve any content that might be deleted or altered by the current or future administration. This data is available for free (you don’t even need an account).

Why historical websites look plain: For security reasons, we remove any JavaScript, images, and external references including CSS files which contain the style sheet information. As a result you only see the bare HTML content without backgrounds, colors and images.

To search for historical versions of a particular US government domain, select the “Government US” category in the Advanced menu:

You will then see the website with all crawled URLs visualized as tree:

At Intelligence X, transparency is paramount. Our users have full access to our data set and we are transparent where data is coming from. If you click on a search result there is a “Metadata” tab that shows you all the details.

Data Size

At the time of writing, the crawlers were running for less than 24 hours, even though the dataset is already growing quickly:

  • 160 GB of data
  • 10+ million selectors
  • 29,791 active .GOV domains
  • 13,208 active .MIL domains

Related articles

Newsletter 2021-06-29

Published on June 29, 2021 by

June 2021: New Usenet data category We added the new data category Usenet. It contains historical and current data from Usenet, which is “a worldwide distributed discussion system”. Today, Usenet is mostly used for piracy. This new category stores currently 209,469,453 selectors and is expected to grow substantially. Improved inline statistics We have improved the


Newsletter 2021-04-16

Published on April 16, 2021 by

Intelligence X supports Peernet – Founder’s Statement I am excited to announce Peernet, a decentralized network that allows sharing of data freely without censorship and restrictions. Here is the pitch deck: https://peernet.org/dl/Peernet%20Deck.pdf Peernet is making quick progress from its inception as I am finalizing the whitepaper and developing the core library. I would like to


Newsletter 2021-02-20

Published on February 20, 2021 by

February 2021: Launch of the European Internet Archive The European Internet Archive just launched! 🎉🥳 ➡ https://archive.eu/ 225 TLDs added to the list of web crawling We have added 225 top-level domains (TLDs) to the list of web crawling. Find the full list and how we are categorizing them in this blog post. Our dataset


Search the blog: