How Search Engines Find Websites? How to Google Like a Pro – Google Dorks

Search engines serve as massive indexers, enabling us to access information scattered across the internet. Behind the scenes, search engines perform extensive work through indexers, often referred to as “spiders” or “crawlers.” These crawlers search for web content and map it, allowing us to find relevant information effortlessly.

How Crawlers Work

Crawlers employ different methods to discover content on the internet. One approach involves visiting a URL and extracting information from the website to return it to the search engine. Another method resembles a virus, as crawlers follow any URLs found on previous websites to propagate and explore new URLs.

When a search engine encounters a new website, like jcawl.com, it indexes the entire domain and stores keywords in a dictionary. For instance, this blog post may have keywords like “crawlers” and “google dork”. These stored keywords help match relevant content to user queries, ensuring the blog post appears in search results.

As most blog posts reference external sources, once a crawler finishes indexing a website, it attempts to propagate to other sites referenced within the current website. This continuous process allows crawlers to index millions of web pages 24/7, contributing to the rapid growth of search engine databases.

Robots.txt

While crawlers aim to retrieve as much information as possible for enhanced search results, some websites may contain sensitive files that should not be publicly indexed. A robots.txt file is the first file that a crawler will scan, and it is used primarily to manage crawler traffic to a site. It tells search engine crawlers which URLs the crawler can access but is not a mechanism for keeping a web page out of Google.

User-agent: Specifies the type of crawler that can index your site.

Allow: Specifies the directories or file(s) that the crawler can index.

Disallow: Specifies the directories or file(s) that the “Crawler” cannot index.

Sitemap: Provide a reference to where the sitemap is located (improves SEO).

Regexes can be used to disallow all files with extension .ini, .conf, or any other sensitive file type. Careful consideration is essential when designing a robots.txt file to avoid inadvertently exposing vital website configuration details.

Sitemap

Search engines operate continuously, and having a well-structured sitemap greatly benefits them. When a website has a clear and organized sitemap in XML format, crawlers can efficiently scrape the content from the sitemap, avoiding the need to access each webpage individually. Sitemaps provide a roadmap of content routes within the domain, facilitating effective crawling. While sitemaps themselves are not inherently risky, it is crucial to ensure the proper configuration of the robots.txt file to maintain the confidentiality of sensitive configuration files.

Google Dorking

While Google is a valuable search engine, its vast content leads to numerous clickbait websites and irrelevant information on every search. There is no need to complicate looking up videos of puppies, but delving into niche topics requires more comprehensive approaches. By employing advanced search techniques, we can refine our searches and minimize the need to sift through ad-filled and unproductive websites.

Here are some dorks that will refine the content that appears during a search:

site:jcawl.com google dorksSearch a particular website.
filetype:pdf tacos Search for a particular file type.
ext:pdf tacosSearch for a file extension.
@youtube tacosSearch a specific social media.
define:tacosProvide a definition.
source:thehackernewsFind news from a source.
cache:exploit-db.comShow cached version of website.

Find documents that follow patterns or have extract text string:

intitle:tacosSearch for results with tacos in the title.
intext:”how to use google dorks”Find website containing the string.
inurl:googledorksFinds links containing the string.

“Dorking” refers to the technique of leveraging Google’s search engine capabilities to identify vulnerable web applications and servers. The Google Hacking Database (GHDB) is a comprehensive collection of Google dorking commands. The GHDB encompasses a vast array of specialized search queries that can be utilized to uncover sensitive information, vulnerabilities, and hidden content on web applications and servers. When using Google dorking, it’s crucial to remember that Google can identify your actions, so it’s best to use it responsibly. Yes, the information is public but exploiting sensitive information found through Google dorking is illegal and can lead to being labeled a cybercriminal.

I won’t get into all of the use cases, but a simple one would be finding excel file email lists. To perform this, we can specify the file type and URL with this search:

filetype:xls inurl:"email"

Browsing through the GHDB can provide valuable insights into various payloads and the specific results they produce. If you find this topic intriguing, I recommend exploring the database and examining the different dorks available. You’ll likely discover a lot of useful information and techniques to enhance your understanding in this area.

Conclusion

Google is undoubtedly a powerful search engine for general use, it also serves as an invaluable tool for cyber professionals. Its capabilities extend beyond conventional searches, enabling experts to employ advanced techniques like Google dorking to identify vulnerabilities, hidden files, and potential security risks.