Robots.txt file is a really important file for any site or blog because it has control over the search engines spider. With the robots.txt you can decide what the spiders can see and index and what is block for them and can’t be index to any search engine.
Jeff Star has made pretty good robots.txt tutorial showing which files should be blocked by the search engines and which should be shown, one of the files that must be blocked from the search engines is the wp-admin folder . You don’t want Google indexing your admin page on its search. For one reasons, security, if your admin page shows up in the search engines it becomes easy for the hackers to access it and mess with it.
Be sure to check out Jeff’s tutorial but for now stay with us because we are going to show you how what kind of Robots.txt huge site and blogs use.
WP.com has a pretty good robots.txt, its not that simple that pretty effective.
# This file was generated on Sat, 28 Dec 2013 03:21:37 +0000 # If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead. # Please see http://en.wordpress.com/firehose/ for more details. Sitemap: http://en.wordpress.com/sitemap/ Sitemap: http://wordpress.com/news-sitemap.xml User-agent: IRLbot Crawl-delay: 3600 User-agent: * Disallow: /next/ User-agent: * Disallow: /mshots/v1/ # har har User-agent: * Disallow: /activate/ User-agent: * Disallow: /wp-login.php User-agent: * Disallow: /signup/ User-agent: * Disallow: /related-tags.php User-agent: * Disallow: /public-api/ # MT refugees User-agent: * Disallow: /cgi-bin/ User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
John has actually the best looking Robots.txt because the file follows all the guiding suggested by Google. Check it out.
sitemap: http://www.johnchow.com/sitemap.xml User-agent: * Disallow: /cgi-bin/ Disallow: /go/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /author/ Disallow: /page/ Disallow: /category/ Disallow: /wp-images/ Disallow: /images/ Disallow: /backup/ Disallow: /banners/ Disallow: /archives/ Disallow: /trackback/ Disallow: /feed/ User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Mediapartners-Google Allow: / User-agent: duggmirror Disallow: /
User-agent: * Disallow:
And some just block everything. Seriously this is a really bad idea because your site is not visible by Google which means you won’t get traffic which later going cause the dead of the site.
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
Smashing Magazine Smashing Magazine blocks all crawlers from indexing their RSS feed.
Sitemap: http://www.smashingmagazine.com/post-sitemap.xml Sitemap: http://www.smashingmagazine.com/page-sitemap.xml Sitemap: http://www.smashingmagazine.com/category-sitemap.xml Sitemap: http://www.smashingmagazine.com/post_tag-sitemap.xml User-agent: * Disallow: /wp-rss.php Disallow: /wp-rss2.php User-agent: MSIECrawler Disallow: / User-agent: psbot Disallow: / User-agent: Fasterfox Disallow: / User-agent: Xenu Disallow: / User-agent: SiteSucker Disallow: /
Mashable basically blocked search engines from indexing their theme and plugin folders. Also they blocked a lot of html files instead of just deleting them.
User-agent: * Disallow: /adcentric Disallow: /adinterax Disallow: /atlas Disallow: /doubleclick Disallow: /eyereturn Disallow: /eyewonder Disallow: /klipmart Disallow: /pointroll Disallow: /smartadserver Disallow: /unicast Disallow: /viewpoint Disallow: /addineyeV2.html Disallow: /canvas.html Disallow: /DARTIframe.html Disallow: /interim.html Disallow: /oggiPlayerLoader.htm Disallow: /videoeggbackup.html Disallow: /facebook_xd_receiver.html Disallow: /readme.html Disallow: /rpc_relay.html Disallow: /twitterlists/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/
A2Hosting is the leader in providing the fastest loading hosting in the world, in fact A2 is faster than HostGator by up to 300% and those are not their words, those are the words of thousands of clients who had fallen in love with them.