My favorites | English | Sign in

Faster apps faster - GWT 2.0 with Speed Tracer New!

Google Custom Search API

Selecting Sites to Search

The Defining Your Search Engine Specifications page showed you how to define the specifications for your search engine using the context file. This page shows you how you to define the coverage of your search engine using a TSV file or XML file.

Contents

This page includes the following sections:

Overview

Sure, you can add sites one at a time in the control panel; however, that gets tedious if you're building a large search engine. In addition, managing a large collection of sites in the control panel isn't fun either. So the best way to add and manage a lot of sites is by listing them in an annotations file and uploading it. Besides, annotations files—particularly XML ones—let you have far greater control over the ranking of search results.

An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Custom Search how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.

Annotations files can be in any of the following formats:

When you start editing your annotations file, start out with a small number of annotations, and then test some search queries in the Preview tab of the control panel. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.

You can either upload the annotations file to the control panel or host it in your own website. For details about file limits, see the Following the Annotations Limit section.

Back to top

Choosing the Right Format

Before you start creating annotations, determine which file format best suits your needs. If your search engine increases in complexity, you can consider using multiple annotations files, even files of different formats. For example, you can upload OPML annotations files generated by other sites and XML annotations files you created. Custom Search combines all the annotations files in all your search engines into a single XML annotations file.

Use the following table to pick the appropriate format:

If you want to create... Use this format... Because.... But be aware of the limitations, which are.... For more information, see...
A search engine with an existing OMPL file (feed-based search engine) OPML format You do not need to recreate annotations if you have some OPML files with URL patterns lying around. You can upload the existing file directly to the control panel. You cannot directly fine-tune the ranking of search results, but you might want to use this format if you already have a list of web sites in OPML files. Using the OPML format
A search engine that does not need all the advanced features TSV format You can create and manage the annotations in a more readable format.

You can use a spreadsheet editor.

You can take advantage of many advanced features, such as applying labels, associating scores, adding comments.

You can create your own attributes. However, they are mostly for your own use; Custom Search does not do anything with them.
You cannot refer to another Custom Search file, and this is not the best option for programmatically created search engines. Using the TSV format
A complex and heavily customized search engine Custom Search XML format It's the most powerful format. It is appropriate for developers who want to create advanced search engines with bells and whistles. It gives you more flexibility and greater control over the the ranking of your search results.

If you programmatically generate custom search engines or if you use third-party tools to generate custom search engines, and you want to host the specifications in your own website, you have to use this format.

It is the most complex format. Using the XML format

Back to top

Using the OPML Format

OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. You can learn more about OPML by reading its specifications.

If you have OPML files from some feed aggregators, you can upload the OPML file without bothering with typing each site. Custom Search grabs the value of the OPML attribute htmlUrl and adds it to the list of sites to search. You can upload multiple OPML files for each of your search engines.

The following is an example of an OPML file:

<opml version="1.0">
   <head>
     <title>Bicycles</title>
     <dateCreated>Fri Mar 14 23:21:11 PDT 2008</dateCreated>
     <dateModified>Fri Mar 14 23:21:11 PDT 2008</dateModified>
   </head>

   <body>
     <outline type="rss" text="Road Bikes" xmlUrl="http://www.google.com/exampleurl.opml" htmlUrl="http://www.google.com/sampleurl1.opml"/>
     <outline type="rss" text="Mountain Bikes" xmlUrl="http://www.google.com/exampleurl2.opml" htmlUrl="http://www.google.com/sampleurl2.opml"/>
   </body>
</opml>    

When you upload an OPML file in the control panel, Custom Search automatically converts OPML to Custom Search XML. It adds search engine labels (<Label name="_cse_example"/>) and scores (score="1"). You can learn more about scores in the Changing the Ranking of Your Search Results page.

The following is an example of an OPML file that has been converted to have Custom Search XML:

<GoogleCustomizations>
   <Annotations>
     <Annotation about="www.google.com/exampleurl1.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
     <Annotation about="www.google.com/exampleurl2.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
   </Annotations>
</GoogleCustomizations>

Back to top

Using the TSV Format

If you don't plan to host files in your own website and you don't have OPML files, you can create annotations using a text file with tab-separated values (TSV).

You can use a plain text editor or a spreadsheet editor to create the file. It does not matter what you name the file, so long as you save it with the file extension .tsv (for example, cse_bicycles.tsv). If you are using a plain text editor, separate each element by a single tab character. Do not try to prettify and align the lines with multiple tab characters. If you are using a spreadsheet editor, allocate a column for each of the fields.

Each line of text in your TSV file can list a site and its associated labels.

Elements of a Custom Search TSV

Your TSV files must begin with a heading that enumerates the fields that you will be using in the subsequent annotation lines. The headings are case-sensitive, so follow the capitalization in this guide. The order of the heading elements doesn't really matter, but the annotation lines that follow the heading must follow the order of the headings. When you create the headings, you are essentially creating columns of data, so you can't just plug the annotation data any which way.

A heading has the following fields:

  • URL - The URL pattern of the site.

  • Label - The search engine label or refinement label that should be applied to the site. You can get the labels for your search engine from the Context section of the Advanced tab in the control panel. You'll find at least two search engine or background labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the search engine label, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
  • Comment - Optional. Notes about each annotation.
  • Score - Optional. Discussed in detail in the Changing the Ranking of Your Search Results page.
  • Custom Field - Optional. Your own attributes. To create an attribute, just prefix it with "A=". For example, to create a date attribute, use "A=Date". Custom Search does not process these fields.

Each subsequent line corresponds to an annotation. It provides the values for the fields that were defined in the headings.

Back to top

TSV Example

Let's look at an example of a basic TSV file.

URL        Label
www.webmd.com/hw/*     _cse_Ansi-stoubiq
www.webmd.com/hw/cancer/*     _cse_exclude_Ansi-stoubiq

The example has a heading with the two required fields: URL and Label. The two annotation lines supply the values for the fields. The label in the first annotation line, _cse_Ansi-stoubiq, adds the site, www.webmd.com/hw/*, to the search engine. The other label, _cse_exclude_Ansi-stoubiq, excludes the site, www.webmd.com/hw/cancer/*, from the search engine.

You can add more fields to your TSV annotations, as in the following example, which has a Comment field and a custom field, A=Date.

URL     Label     Comment     A=Date
www.cancer.gov/cancertopics/types/liver/*     _cse_Ansi-stoubiq	government site     20060504
www.medicinenet.com/liver_cancer/*     _cse_Ansi-stoubiq     site on symptoms     20060504
www.webmd.com/hw/cancer/*     _cse_Ansi-stoubiq     great site for patients!     20060504
www.oncologychannel.com/*/treatment     _cse_Ansi-stoubiq     20060504

Even though you added new fields in the header, you are not obligated to supply the values for all them, which is why it's fine for the last line to not have a comment. But that's not the case for URL and Label, which are required fields.

Back to top

Using the Custom Search XML Format

If you want to take advantage of all the features available in the Custom Search API, XML is the way to go. You can create XML annotations files in three ways. The following table describes the different strategies for the XML format. It's just a matter of preference, so you should not worry too much about picking the right way to annotate your search engine. If you change your mind, you can reorganize your annotations by cutting and pasting.

If you want to... Use this annotation strategy... Because.... But be aware of the limitations, which are.... For more information, see the section on....
Keep the annotations for each search engine separate An external annotations file for each search engine Custom Search merges all annotations into a single annotations file, but you can create and upload them separately. Each file pertains to a search engine. If there's overlap between search engines, you might end up managing the same sites in multiple places. Getting to Know XML Annotations
Pool all annotations across all your search engines in a single place One external annotations file shared by all search engines Having all annotations in a single file lets you manage annotations across all search engines.

A communal annotations file enables you to list sites only once, yet have the flexibility to change inclusion, exclusion, and ranking of the same sites for various search engines.

For example, one of your search engines could restrict its search to five sites, another could eliminate those sites, and yet another could promote those sites.

If you have a lot of annotations, it could be hard to manage the file. You always have to verify that you are changing the annotations for the right search engine.

Getting to Know XML Annotations
Host the files in your website and keep both the context and annotation data of the search engine in a single file Context files with inline annotations A single file is easier to manage than a search engine that has a context file and an external annotations file. Just create the annotations section right after the context specification. You can use this format only if you are hosting the file in your website.

If you have multiple search engines that are fairly similar, you might end up managing the same sites in multiple places.

Creating Inline Annotations

When you upload your files in the control panel, Custom Search merges all your annotations into a single annotations file that is shared by all your search engines. So when you download the annotations file, you will find all annotations across all your search engines in that file. You can distinguish the annotations by their search engine labels (the value in the Label element and the name attribute). Of course, if you have only one search engine, all that won't matter because you only have a single annotations file anyway.

<Annotation about="http://www.solarenergy.org/*">
   <Label name="_cse_abcdefghijk"/>
</Annotation>                                  

If you prefer to keep the annotations for each search engine separate, you should maintain the original annotations files and upload them to the control panel when you make changes. To keep things simple, stick with using the XML format. Do not alternate between using the XML format and the Sites tab in the control panel to include or exclude sites, because changes made to the Sites tab are appended to the communal annotations file and you'll have to copy these new annotations to your copy of the annotations file.

Back to top

Getting to Know XML Annotations

The following is an example of XML annotations. It is roughly the XML version of the TSV example in the previous section. It includes the same elements, except for custom attributes, which are available only in the TSV format. This annotations file tells Custom Search to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.

  <Annotations>

    <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
      <Label name="_cse_Ansi-stoubiq"/>
      <Comment>government site</Comment>
    </Annotation>

    <Annotation about="www.medicinenet.com/liver_cancer/">
      <Label name="_cse_exclude_Ansi-stoubiq"/>
      <Comment>site on symptoms</Comment>
    </Annotation>

    <Annotation about="www.webmd.com/hw/cancer/*">
      <Label name="_cse_exclude_Ansi-stoubiq"/>
      <Comment>great sites for patients!</Comment>
    </Annotation>

    <Annotation about="www.oncologychannel.com/*/treatment">
      <Label name="_cse_exclude_Ansi-stoubiq"/>
    </Annotation>

  </Annotations> 

The annotations file has four elements in the following hierarchy:

  • Annotations (root element)

Creating External Annotations

To list sites you want your search engine to cover, do the the following:

  1. Start the file with the <Annotations></Annotations> root element.
  2. Create an annotation by adding the <Annotation></Annotation> tags, and then define the about attribute with the URL pattern of the site.

    Note: the second-level annotation tags are different from the top-level annotations tags in that they take the singular form; that is, there is no "s" after "Annotation".

    <Annotations>
       <Annotation about="www.webmd.com/hw/cancer/*">
       </Annotation>
    </Annotations>   
  3. Associate the site with the search engine by using the <Label name=" "/> tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context section of the Advanced tab in the control panel. You'll find two labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
    <Annotations>
       <Annotation about="http://www.solarenergy.org/*">
         <Label name="_cse_abcdefghijk"/>
       </Annotation>
    </Annotations>             

    A single site can have multiple labels associated with it, and it can be treated differently by different search engines. For example, the site http://www.solarenergy.org/ can be included in both your solar energy search engine and excluded from your bike search engine. The same site will be ranked differently in the result pages of different search engines.

    If you have changed the name of the label in the context file, remember to update the Label name values in your annotations files.

  4. To add more sites, create and define another Annotation element.
  5. Save the XML file.

    Back to top

Creating Inline Annotations

An inline annotation is just like an external annotation, except that it is embedded inside the context file. In essence, you are creating a Custom Search file with two sections: the CustomSearchEngine section, which houses the context or search engine specification, and the Annotations section, which houses the annotations or sites information. You can use files of this format only if you are hosting them from your website. You will not be able to upload this file in the control panel.

When you combine the context and annotations in one file, you have to start with the GoogleCustomizations root element. The file has the following structure:

  • GoogleCustomizations (root element)
    • CustomSearchEngine
      • Title
      • Description
      • Context
        • BackgroundLabels
          • Label
      • LookAndFeel
    • Annotations
      • Annotation
        • Label
        • Comment (optional)

The following is an example of inline annotations.

<GoogleCustomizations>

  <CustomSearchEngine>
   <!--For brevity, other elements have been excluded....--> 
   
    <Context>
      <BackgroundLabels>
        <Label name="_cse_solar_example" mode="FILTER"/>
        <Label name="_cse_exclude_solar_example" mode="ELIMINATE"/>
      </BackgroundLabels>

    </Context>
  </CustomSearchEngine>

  <Annotations>
    
    <!--Include this site in the search results--> 
    <Annotation about="http://www.solarenergy.org/*">
      <Label name="_cse_solar_example"/>
    </Annotation>

    <!--Include this site in the search results-->
    <Annotation about="http://www.solarfacts.net/*">
      <Label name="_cse_solar_example"/>
    </Annotation>

    <!--Exclude this site from the search results--> 
    <Annotation about="http://en.wikipedia.org/wiki/*">
      <Label name="_cse_exclude_solar_example"/>
    </Annotation>

   </Annotations>
</GoogleCustomizations> 

Back to top

Improving Search Coverage

Custom Search is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your custom search engine to include sites that are not in the Google index, submit a Sitemap to the Indexing tab of the control panel or directly to Google Webmaster Tools.

A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.

Submitting Sitemaps is particularly helpful if your site has the following:

  • Dynamic content
  • Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
  • Few websites linking to it

    Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.

  • A large archive of content pages that does not have a strong network of cross-linking

Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure that they don't give instructions that block crawlers.

Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your custom search engine.

Back to top

Improving Search Freshness

If you are not creating or maintaining a search engine that searches just your website, you can skip this section. You cannot apply the strategies discussed in this section to websites that you do not own or manage.

After you submit a Sitemap, Google will start crawling some or all of the webpages, and, over time, the search results for your website would improve. But if you can't wait and you want certain webapges crawled and indexed within the next 24 hours, you can expedite the crawling of your most important webpages by going to Indexing tab of the control panel and clicking the Index Now button under the On-Demand Indexing section.

Note: Custom Search and Google search use different selection criteria, therefore submitting pages for on-demand indexing will not make them appear any faster in the Google search index.

For each search engine in your account, you can submit one Sitemap for on-demand indexing. Custom Search will crawl 10 webpages that you have marked with the highest priority values in your Sitemap. If more than 10 webpages have the highest priority values, Custom Search will crawl the highest priority pages with the most recent last modified date. If you had upgraded to Google Site, you have a higher limit for on-demand indexing. The limit, which starts from 50 webpages, varies according to your account level.

If you have more than 10 webpages that you want indexed immediately, you can resubmit an updated Sitemap with the next 10 webpages marked with the highest priority value. If you had just clicked the Index Now button in the control panel, wait until 24 hours have elapsed before you click it again. If you have multiple Sitemaps for a custom search engine, you can submit your most important Sitemap first, wait 24 hours, and then submit the next Sitemap. You can keep submitting the rest of your Sitemaps in 24-hour cycles.

As time passes, new pages on your site will eventually get crawled and included in the main Google index. This frees up your on-demand indexing quota so you can submit new pages.

Back to top

Hosting the Annotations Files Yourself

You might want to host the annotations files yourself instead of uploading them in the control panel if you want the following:

  • Have more than 5,000 annotations.
  • Update the annotations frequently.
  • Manage the annotations files without using the control panel.
  • Use scripts to create custom search engines.

    If you have fast-changing data, you could use scripts to convert XML output into XML annotations files, and Custom Search will just grab the updated annotation data from your site. Your script could get data from anywhere, such as a database, RSS feeds, Atom feeds, iCal feeds, and Open Directory.

If you want to host and manage the annotations files in your website, you have to tell Custom Search where to find them. You have to create and upload a root annotations file that points to the hosted files.

The following is an example of a root annotation that refers to an annotations file hosted on a website:

<GoogleCustomizations>
   <Include type="Annotations" href="http://www.yoursite.com/cse_bacon_annotations.xml" />
<GoogleCustomizations>

Your root annotations file does not have to be sparse. You can have one or more full-blown annotations files that refers to other annotations files.

The following example shows how you can refer to a hosted annotations file inside a full-blown annotations file.

<GoogleCustomizations>

  <Annotations file="livercancer-annotations.xml">
    <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
      <Label name="_cse_Ansi-stoubiq"/>
      <Label name="symptoms"/>
      <Comment>This labels this url as symptoms.</Comment>
    </Annotation>
  </Annotations>

  <Include type="Annotations" href="http://mysite.com/myannofile.xml" />
  
</GoogleCustomizations>

Note: Include is a child element of GoogleCustomizations, not Annotations.

The annotations files that you include could themselves use Include tags to refer to more files. In fact, you can have up to five levels of nested Include tags. Regardless of your nesting structure, you can include up to 50 annotations files.

Back to top

Following the Annotations Limits

The following table lists the limits for annotations files that are uploaded to the Advanced tab of the control panel and annotations files that are hosted on your website:

Maximum allowed Hosted on Google Custom Search Hosted on your site
File size (context or annotations files) 30 kb 3 mb
Number of files As many files as you need, so long as you do not exceed the global annotations limit (5,000) 50
Number of annotations per file 2,000 As many as you need, so long as the file size does not exceed 3 mb and the total size of all the files do not exceed 10 mb
Total number of annotations for all your search engines 5,000

Tip: If you find your search engines outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns.

As many as you need, so long as the aggregate size of all files does not exceed 10 mb

Back to top

Taking the Next Step

After you have defined the search engine specifications and created a list of sites for your search engine, test the search results in the Preview tab of the control panel. If the ranking of the search results does not suit your needs, you can start tweaking it. However, if you are satisfied with your search engine, you can start designing its look and feel.

 

< Back to Defining Your Search Engine Specifications | Forward to Changing the Ranking of Your Search Results >