When you use one of the popular search engines, such as Google,
Bing or Yahoo, the search results reflect only a very thin layer of the information available on the Internet. It is like
fishing the shallow surface layer of the ocean. There is an enormous volume of information available in the "deep waters"
of the Internet. This has several reasons:
1. Search engines usually cannot index web pages that are dynamically created from a
database depending on specific user requests. What is in the database is invisible to a search engine crawler.
2. Search engines typically scan and index only a few levels of hierarchically organized
websites. What is hidden away deep down in a large website is not necessarily indexed.
3. Search engines cannot index content that is constantly updated. Their crawlers typically
visit web pages only from time to time - so their index is always outdated to some extent.
4. Search engines usually do not index web pages for which the web developer has set up
certain crawler restrictions. Websites of government agencies, the military, intelligence organizations, certain large
corporations, and research centers rigorously control what the crawlers of search engines can "see" and index.
5. Finally, there are also special technologies available that can make websites and certain
activities on the Internet "invisible" or at least extremely hard to trace (TOR-technology, IP-hiding, etc.)
One group of websites in the Deep Web are the on-line shopping, real-estate and auction
sites, such as e-Bay, REALTOR, or Amazon, which have huge amounts of content (such as product information, ads, reviews
etc.) stored in databases. Of course, one can find these websites in search
engines, but not necessarily the specific content they publish from their database. Fortunately, these commercial sites are not relevant in our context.
Another group of websites and web-activities that usually cannot be found with the common
search engines are those that intentionally obscure or hide their existence. This "Dark Web" consists of the sites and
services of criminals, terrorist groups or Internet users who utilize certain technologies to hide or obscure their
Secret activities of intelligence agencies on the Internet, military communications and the
protected systems used by law enforcement are also part of the "Dark Web", which cannot be found by everyone
through a simple Google search.
However there are also websites of libraries, registers, government agencies,
research networks or international organizations that have large amounts of information in specialized databases. Deep down
in these public websites there are vast amounts of textual, numerical, visual or
acoustic data - hidden away in databases or deeply layered link collections. The information in these on-line databases
can be most interesting and relevant to all kinds of research activities. Unfortunately, the data are hard to find,
because they can be only accessed through their respective websites - which are often rather obscure, badly organized, or
excessively hard to use. Some governments, international organizations and even private companies have therefore tried to
consolidate their various databases and make them more easily accessible through "open data" platforms. For instance, the
World Bank is accumulating all kinds of data from United Nations Agencies to provide a unified open data platform. But
there are numerous such initiatives from governments, organizations and private businesses, so that in the end a data
analyst has to know and visit hundreds of such sites for a complete picture of what data are available.
Below is a preliminary list of some of these websites that
contain particularly large amounts of data and other kinds of information, which cannot be found through the usual search