Posted June 2nd, 2010 by sharkdancer
In the web, you will find many types of websites in which you can get the information that you need. You can simply search the web engine and there are already lists of website pages that may match your needs. Anyone can also have their own site and put all the necessary contents that they want to share to other people. However, there are a number people who make some web pages but they do not intend that it would be search for others. Thus, they use the robots.txt file.
One may have some websites in which they do not want others to view it because it is not yet finished or that there are a few information that may be irrelevant to most people. Thus, you can use the file and keep others in opening the webpage.
The search engine crawler generally follows this robots exclusion protocol or robots.txt file if it is present in the server. The main use of this file is to determine which sites or pages in a website are to be accessed by the search engine and which are not. This will keep the Web robots from crawling in certain pages that may have sensitive contents that is not intended for other viewers. However, this file only prevents the access into the web pages but it does not keep the site from being indexed.
There are some people who tend to have problems when their sites are not listed in the search engine. When this is the case, most blame the wrong use of the robots.txt file. The file prevented the users why certain sites are not listed in the search engines or some cannot access the whole site. When you fixed the problem regarding the file, the site will then be indexed and it will soon have a better traffic.
When a person does not use the file correctly or put codes the wrong way, there result would definitely the other way around. Thus, they should be able to know the use of the file and how it should work according to your needs. There are some people who want their website to be viewed by others so they must not use the robots.txt file. However, for some who may want certain unfinished or confidential pages not to be indexed, the proper use of the file will be a big help for them.
There may be a number of problems with the robots.txt file since there is no way that you may be able to stop other sites to link into your site. However, in some way, the robots.txt file help in making the person protect the search engines from gaining access to some web pages or your whole website but the site is still available, only not from the search engines. Still, a careful use of the file should be done so that the results will be according to what you want.
Posted May 31st, 2010 by sharkdancer
robots.txt is possibly the most miss understood file that a website can contain.
Many people think that by using a robots.txt file on their website they are protecting pages and folders from thieves and hackers. In fact it is totally the opposite! robots.txt opens up an enormous security hole that hackers and theives will use to easily gain access to the parts of your website that you don’t want them to.
What is robots.txt?
robots.txt is a file that you create and upload to your websites root directory that is used by search engine spiders to determine which parts of your website they should index and which folders/pages that you, the website owner,
don’t want listed in search engine indexes.
Why would not want pages indexed?
There are many reasons why you might not want search engines to index pages on your website, such as private membership pages or exclusive training pages and such like.
If you are an Internet Marketer selling your own ebook or other digital product, you wouldn’t want your thank you pages indexed either!
And this is where the misunderstanding comes to the fore, and robots.txt becomes your foe.
Many online marketers who provide ebooks or other digital products for instant download will list their download thank you pages in the robots.txt file because they obviously don’t want those pages indexed in search engines.
By using robots.txt this way though, you will be opening up your product to anyone who has a slight bit of knowledge about how the file works.
robots.txt is easily readable by any human that opens a browser and types in http://www.yourdomain.com/robots.txt and if you have listed your thank you pages, all they have to do is go to that url and take your product(s)!
It’s that easy!
And I’m living proof that this works, as this is exactly what happened to me. I had listed my thank you pages in robots.txt and thought that they were safe from hackers and thieves, then one day I was checking my web site stats and BAM, someone had been to every single thank you page, and taken everything.
The moral is, don’t list any URL in robots.txt that you don’t want humans to have free access to. Use robots.txt with great caution and secure your thank you pages using dedicated software.
Posted May 30th, 2010 by sharkdancer
Despite the importance of the Robots.txt file in getting your website indexed with the major search engines, many webmasters don’t offer one on their site. What is the robots.txt file you ask? If you don’t know, you are far from alone. The robots.txt file is a simple text file (no html) that is placed in your website’s root directory in order to tell the search engines which pages to index and which to skip.
When a search engine sends its webcrawler to your site, one of the first things the webcrawler will do is search the root directory for the robots.txt file. A correctly formated robots.txt file will consist of several records, each providing instructions for a particular search-bot. A record will generally consist of two components, the first is called the user-agent and is where the name of the search-bot is listed. The second line consits of one or more “disallow” lines. These lines tell the webcrawler which files or folders should not be indexed (ie a cgi-bin folder).
If you currently have a website and do not have a robots.txt file, you can create one easily. As mentioned earlier, the files are plain text, so just open up notepad and save the file at robots.txt. Most webmasters can use one record that will apply to all of the search engine crawlers. Once you have opened notepad enter the following:
User-agent: *
Disallow:
The “*” applies this rule to all bots. In this example, there is nothing listed in the disallow line. This tells the robot to index the entire site. You can also enter a folder path here such as “/private” if there is a folder that shouldn’t be indexed. This can be very useful if you are still testing a portion of your website or is a section is still under construction.
Now that you know what should go into your robots.txt file, there are several common mistakes people make when creating these files. Never enter notes or comments into the file as these items can cause confusion for the webcrawler. Also, the format should always be the user-agent on the first line, followed by the disallow(s). Do not reverse the order. Another common mistake made involves using the incorrect case. If the disallowed folder is /private, make sure your robots.txt file does not list the folder as /Private. It seems like a very minor issue, but it will cause problems if done incorrectly. Finally, there is no Allow command. You cannot tell the webcrawler what to look at, only what not to look at.
If you are still curious about the robots.txt file you can find many more complex examples online. Just try one of your favorite websites and look for their robots.txt file. For example you can go to http://www.cnn.com/robots.txt. If you need help creating a robots.txt file for your site, there are plenty of places online that will create the file for you for free. One example is http://www.seochat.com/seo-tools/robots-generator/. Despite its apparently simplicity, this file can make or break your site’s chances with the search engines. Make sure you have your robots.txt file in place and correctly formatted today.
Posted May 30th, 2010 by sharkdancer
If you are a website owner, you know the reasoning behind that question. No, we are not talking about physical robots in general, but rather the language of robots. Anyone that is familiar with the famous Google robot â Googlebot, knows how important it can be to be able to understand the language of robots to help protect your website. Not everyone though, is at savvy in the language art of speaking robot.  Â
It can be intimidating to some website owners when thinking they have to learn to effectively use the language, but there are tools available to help the lesser robot savvy communicators. Most of us have probably employed the services of Googlebot to protect sections and parts of our websites that we donât want invaded. Those that are familiar with using the robots.txt language can simply fire off a file to him and he will always deliver what we need. But if you are unsure of your abilities in the art of speaking robot, there is something that can help you.
There is a new Webmaster tool available that acts as a translator or robot.txt files. It helps you build the file to use, and all you have to do is enter the areas you do not want robots to crawl through. You can also make it very specific blocking only certain types of robots from certain types of files. After you use the generator tool, you can take it for a test drive by using the analysis tool. After you have seen that your test file is ready to go, you can simply save the new file on the root directory on your website and sit back.
When creating and using the robots files, you should consider the following two tips:
1.   Robot text files are not always supported on all search engines â Googlebot and some other robots can understand the files, but other robots may not be able to understand the generated files.
2.   Keep in mind that robot text files are only a method of asking that your site be protected from robots crawling. You simply generate the file, but to some robots who are not as scrupulous as others, they can choose to ignore the file and get in. Make sure you use the password protection option to protect what files you need blocked.
This can be a great tool for those who are not as confident in their robot language skills, and can create a safe haven for the files on your website you need protected from unsavory robots. It can substantially help you in your quest to protect your website and files within by helping you generate the file in the correct format to the robot. As always, there are options out there if you need further guidance, you can check out the help center for Webmaster tools or seek answers from a help group of Webmasters.
Posted May 30th, 2010 by sharkdancer
Robots.txt is a text file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt implements the Robots Exclusion Protocol, which allows you as a web manager, to define what parts of your site are off-limits to search engine crawlers. For example, Web managers can disallow access to .cgi, private, temporary directories and other areas with pages they do not want accessed or indexed.
The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies which robots to allow or disallow and the Disallow specifies which directories robots can or cannot crawl. The robots.txt is a gentleman’s agreement and some crawlers, such as Google, may ignore the robots.txt file that disallows all crawling.
The structure of a robots.txt is pretty simple. This example allows all robots to visit all files:
User-agent: *Disallow:
Example of a recommended robots.txt files blocking crawling of the scripts and images directories:
User-agent: * Disallow: /scripts/
Disallow: /images/
If you have a particular robot in mind, such as the Google image search robot, which collects images on your site for the Google Image search engine, you may include lines like the following:
User-agent: Googlebot-Image
Disallow: /
This means that the Google image search robot, should not try to access any file in the root directory and all its subdirectories.
You can create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file and the filename should be lowercase. Include the robots.txt file in your server’s root directory. This is standard web management practice. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way.
All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders your web site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.