At Intelligence X, we value quality over quantity. Our goal is continuous improvement, sustainability, and stability. As we cross the mark of 25 billion records with 100+ TB of storage, it is time to set sail for 100 billion records.
A single record is an extracted selector (search term) like “test.com”. A search term might appear in one or multiple documents. Intelligence X is not just a search engine; it is also a data archive that stores copies of the actual search result (i.e. the original document) in its cache.
Therefore, each record has one or multiple documents linked. Any change of the dataset, documents, the indexing algorithm, or meta-data is not a simple operation. We have developed our own proprietary storing, indexing, and searching algorithms that allow us to scale infinitely into billions of records and Petabytes of data while maintaining a fast search & retrieval speed (within ms).
There are multiple constraints that limit the amount of records and content that can be indexed:
IOPS is the limiting factor when it comes to indexing the speed of new websites within a single bucket. The amount of single input/output operations (per disk) essentially limit how fast new websites (HTML copies) can be stored and processed for indexing.
Spam is a big reason why we don’t just flip the switch and crawl the whole public web as of right now. It is both a challenge on the darknet and the public web. Operators may create endless sub-pages, sub-domains, and register many spam domains. We previously published the blog post, “Combating spam websites from Tor”. A recent example of spam related to Tor hidden services:
There is 0 value in spam content for Intelligence X and our customers so we blacklist and delete any content that we deem as spam. Here is the overview of active vs. inactive vs. spam onion domains; it reveals that there are more spam domains than active ones, highlighting the need and importance for spam detection algorithms:
Another example on the public web is spam detected in our “Web: Germany” bucket: The domain “anki-overdrive-kaufen24.de” had 3,362,715 subdomains in our index – a typical example of endless auto-generated subdomains with a catch-all DNS record.
While developing spam detection algorithms may be a relatively easy task, operating such an algorithm on millions (or billions) of records makes it a delicate operation whenever the algorithm is changed.
Our public web crawling started in October 2019 and is currently limited to these top-level domains per bucket:
As we are gradually expanding our crawling operations, new TLDs to be indexed will be announced soon.
According to DomainTools, these are the TLDs with the most domains:
We are always looking to add new data sources and expanding the ones we already have. We are reviewing these new sources (in no particular order):
The cool thing about adding new data sources is that any newly indexed data is available in real-time through the entire Intelligence X ecosystem: The main search intelx.io, the API, our phonebook.cz, and of course all integrations by 3rd party tools.
June 2021: New Usenet data category We added the new data category Usenet. It contains historical and current data from Usenet, which is “a worldwide distributed discussion system”. Today, Usenet is mostly used for piracy. This new category stores currently 209,469,453 selectors and is expected to grow substantially. Improved inline statistics We have improved the
Intelligence X supports Peernet – Founder’s Statement I am excited to announce Peernet, a decentralized network that allows sharing of data freely without censorship and restrictions. Here is the pitch deck: https://peernet.org/dl/Peernet%20Deck.pdf Peernet is making quick progress from its inception as I am finalizing the whitepaper and developing the core library. I would like to
February 2021: Launch of the European Internet Archive The European Internet Archive just launched! 🎉🥳 ➡ https://archive.eu/ 225 TLDs added to the list of web crawling We have added 225 top-level domains (TLDs) to the list of web crawling. Find the full list and how we are categorizing them in this blog post. Our dataset