Overview

Introduction

Note: This documentation is currently still under development. Expect improvements in the near future.

Google Safe Browsing v5 is an evolution of Google Safe Browsing v4. The two key changes made in v5 are data freshness and IP privacy. In addition, the API surface has been improved to increase flexibility, efficiency, and reduce bloat. Furthermore, Google Safe Browsing v5 is designed to make migration from v4 easy.

Currently, Google offers both v4 and v5 and both are considered production ready. You may use either v4 or v5. We have not announced a date for sunsetting v4; if we do, we will give a minimum notice of one year. This page will describe v5 as well as a migration guide from v4 to v5; the complete v4 documentation remains available.

Data Freshness

One significant improvement of Google Safe Browsing v5 over v4 (specifically, the v4 Update API) is data freshness and coverage. Since the protection highly depends on the client-maintained local database, the delay and size of the local database update is the main contributor of the missed protection. In v4, the typical client takes 20 to 50 minutes to obtain the most up-to-date version of threat lists. Unfortunately, phishing attacks spread fast: as of 2021, 60% of sites that deliver attacks live less than 10 minutes. Our analysis shows that around 25-30% of missing phishing protection is due to such data staleness. Further, some devices are not equipped to manage the entirety of the Google Safe Browsing threat lists, which continues to grow larger over time.

In v5, we introduce a mode of operation known as real-time protection. This circumvents the data staleness problem above. In v4, clients are expected to download and maintain a local database, perform checks against the locally downloaded threat lists, and then when there is a partial prefix match, perform a request to download the full hash. In v5, although clients should continue to download and maintain a local database of threat lists, clients are now also expected to download a list of likely-benign sites (called the Global Cache), perform both a local check for this Global Cache as well as a local threat list check, and finally when there is either a partial prefix match for threat lists or a no-match in the Global Cache, perform a request to download the full hashes. (For details on the local processing required by the client, please see the provided procedure below.) This represents a shift from allow-by-default to check-by-default, which can improve protection in light of faster propagation of threats on the web. In other words, this is a protocol that is designed to provide near-real-time protection: we aim to have clients benefit from fresher Google Safe Browsing data.

IP Privacy

Google Safe Browsing (v4 or v5) does not process anything associated with a user’s identity in the course of serving requests. Cookies, if sent, are ignored. The originating IP addresses of the requests are known to Google, but Google only uses the IP addresses for essential networking needs (i.e. for sending responses) and for anti-DoS purposes.

Concurrently with v5, we introduce a companion API known as the Safe Browsing Oblivious HTTP Gateway API. This uses Oblivious HTTP to hide end users' IP addresses from Google. It works by having a non-colluding third-party to handle an encrypted version of the user request and then forward that to Google. So the third party only has access to the IP addresses, and Google only has access to the content of the request. The third party operates an Oblivious HTTP Relay (such as this service by Fastly), and Google operates the Oblivious HTTP Gateway. This is an optional companion API. When using it in conjunction with Google Safe Browsing, end users' IP addresses are no longer sent to Google.

Appropriate Usage

Permitted Use

The Safe Browsing API is for non-commercial use only (meaning “not for sale or revenue generating purposes”). If you need a solution for commercial purposes, please refer to Web Risk.

Pricing

All Google Safe Browsing APIs are free of charge.

Quotas

Developers are allocated a default usage quota upon enabling the Safe Browsing API. Current allocation and usage can be viewed in the Google Developer Console. If you expect to use more than your currently allocated quota, you may request additional quota from the Developer Console's Quota interface. We review these requests and require a contact when applying for an increased quota to ensure that our service availability meets the needs of all users.

Appropriate URLs

Google Safe Browsing is designed to act on URLs that would be displayed in a browser's address bar. It is not designed to be used to check against subresources (such as a JavaScript or image referenced by an HTML file, or a WebSocket URL initiated by JavaScript). Such subresource URLs should not be checked against Google Safe Browsing.

If visiting a URL results in a redirect (such as HTTP 301), it is appropriate for the redirected URL to be checked against Google Safe Browsing. Client-side URL manipulation such as History.pushState does not result in new URLs to be checked against Google Safe Browsing.

User Warnings

If you use Google Safe Browsing to warn users about risks from particular webpages, the following guidelines apply.

These guidelines help protect both you and Google from misunderstandings by making clear that the page is not known with 100% certainty to be an unsafe web resource, and that the warnings merely identify possible risk.

  • In your user visible warning, you must not lead users to believe that the page in question is, without a doubt, an unsafe web resource. When you refer to the page being identified or the potential risks it may pose to users, you must qualify the warning using terms such as: suspected, potentially, possible, likely, may be.
  • Your warning must enable the user to learn more by reviewing Google's definition of various threats. The following links are suggested:
  • When you show warnings for pages identified as risky by the Safe Browsing Service, you must give attribution to Google by including the line "Advisory provided by Google" with a link to the Safe Browsing Advisory. If your product also shows warnings based on other sources, you must not include the Google attribution in warnings derived from non-Google data.
  • In your product documentation, you must provide a notice to let users know that the protection offered by Google Safe Browsing is not perfect. It must let them know that there is a chance of both false positives (safe sites flagged as risky) and false negatives (risky sites not flagged). We suggest using the following language:

    Google works to provide the most accurate and up-to-date information about unsafe web resources. However, Google cannot guarantee that its information is comprehensive and error-free: some risky sites may not be identified, and some safe sites may be identified in error.

The Modes of Operation

Google Safe Browsing v5 allows clients to choose from three modes of operation.

Real-Time Mode

When clients choose to use Google Safe Browsing v5 in real-time mode, clients will maintain in their local database: (i) a Global Cache of likely-benign sites, formatted as SHA256 hashes of host-suffix/path-prefix URL expressions, (ii) a set of threat lists, formatted as SHA256 hash prefixes of host-suffix/path-prefix URL expressions. The high-level idea is that whenever the client wishes to check a particular URL, a local check is performed using the Global Cache. If that check passes, a local threat lists check is performed. Otherwise, the client continues with the real-time hash check as detailed below.

Besides the local database, the client will maintain a local cache. Such a local cache need not be in persistent storage and may be cleared in case of memory pressure.

A detailed specification of the procedure is available below.

Local List Mode

When clients choose to use Google Safe Browsing v5 in this mode, the client behavior is similar to the v4 Update API except using the improved API surface of v5. Clients will maintain in their local database a set of threat lists formatted as SHA256 hash prefixes of host-suffix/path-prefix URL expressions. Whenever the client wishes to check a particular URL, a check is performed using the local threat list. If and only if there is a match, the client connects to the server to continue the check.

As with the above, the client will also maintain a local cache that need not be in persistent storage.

No-Storage Real-Time Mode

When clients choose to use Google Safe Browsing v5 in the no-storage real-time mode, the client need not maintain any local database. Whenever the client wishes to check a particular URL, the client always connects to the server to perform a check. This mode is similar to what clients of the v4 Lookup API may implement.

Checking URLs

This section contains detailed specifications of how clients check URLs.

Canonicalization of URLs

Before any URLs are checked, the client is expected to perform some canonicalization on that URL.

To begin, we assume that the client has parsed the URL and made it valid according to RFC 2396. If the URL uses an internationalized domain name (IDN), the client should convert the URL to the ASCII Punycode representation. The URL must include a path component; that is, it must have at least one slash following the domain (http://google.com/ instead of http://google.com).

First, remove tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Do not remove escape sequences for these characters (e.g. %0a).

Second, if the URL ends in a fragment, remove the fragment. For example, shorten http://google.com/#frag to http://google.com/.

Third, repeatedly percent-unescape the URL until it has no more percent-escapes. (This may render the URL invalid.)

To canonicalize the hostname:

Extract the hostname from the URL and then:

  1. Remove all leading and trailing dots.
  2. Replace consecutive dots with a single dot.
  3. If the hostname can be parsed as an IPv4 address, normalize it to 4 dot-separated decimal values. The client should handle any legal IP-address encoding, including octal, hex, and fewer than four components.
  4. If the hostname can be parsed as a bracketed IPv6 address, normalize it by removing unnecessary leading zeroes in the components and collapsing zero components by using the double-colon syntax. For example [2001:0db8:0000::1] should be transformed into [2001:db8::1]. If the hostname is one of the two following special IPv6 address types, transform them into IPv4:
    • An IPv4-mapped IPv6 address, such as [::ffff:1.2.3.4], which should be transformed into 1.2.3.4;
    • A NAT64 address using the well-known prefix 64:ff9b::/96, such as [64:ff9b::1.2.3.4], which should be transformed into 1.2.3.4.
  5. Lowercase the whole string.

To canonicalize the path:

  1. Resolve the sequences /../ and /./ in the path by replacing /./ with /, and removing /../ along with the preceding path component.
  2. Replace runs of consecutive slashes with a single slash character.

Do not apply these path canonicalizations to the query parameters.

In the URL, percent-escape all characters that are <= ASCII 32, >= 127, #, or %. The escapes should use uppercase hex characters.

Host-Suffix Path-Prefix Expressions

Once the URL is canonicalized, the next step is to create the suffix/prefix expressions. Each suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path).

The client will form up to 30 different possible host suffix and path prefix combinations. These combinations use only the host and path components of the URL. The scheme, username, password, and port are discarded. If the URL includes query parameters, then at least one combination will include the full path and query parameters.

For the host, the client will try at most five different strings. They are:

  • If the hostname is not an IPv4 or IPv6 literal, up to four hostnames formed by starting with the eTLD+1 domain and adding successive leading components. The determination of eTLD+1 should be based on the Public Suffix List. For example, a.b.example.com would result in the eTLD+1 domain of example.com as well as the host with one additional host component b.example.com.
  • The exact hostname in the URL. Following the previous example, a.b.example.com would be checked.

For the path, the client will try at most six different strings. They are:

  • The exact path of the URL, including query parameters.
  • The exact path of the URL, without query parameters.
  • The four paths formed by starting at the root (/) and successively appending path components, including a trailing slash.

The following examples illustrate the check behavior:

For the URL http://a.b.com/1/2.html?param=1, the client will try these possible strings:

a.b.com/1/2.html?param=1
a.b.com/1/2.html
a.b.com/
a.b.com/1/
b.com/1/2.html?param=1
b.com/1/2.html
b.com/
b.com/1/

For the URL http://a.b.c.d.e.f.com/1.html, the client will try these possible strings:

a.b.c.d.e.f.com/1.html
a.b.c.d.e.f.com/
c.d.e.f.com/1.html
c.d.e.f.com/
d.e.f.com/1.html
d.e.f.com/
e.f.com/1.html
e.f.com/
f.com/1.html
f.com/

(Note: skip b.c.d.e.f.com, since we'll take only the last five hostname components, and the full hostname.)

For the URL http://1.2.3.4/1/, the client will try these possible strings:

1.2.3.4/1/
1.2.3.4/

For the URL http://example.co.uk/1, the client will try these possible strings:

example.co.uk/1
example.co.uk/

Hashing

Google Safe Browsing exclusively uses SHA256 as the hash function. This hash function should be applied to the above expressions.

The full 32-byte hash will, depending on the circumstances, be truncated to 4 bytes, 8 bytes, or 16 bytes:

  • When using the hashes.search method, we currently require the hashes in the request to be truncated to exactly 4 bytes. Sending additional bytes in this request will compromise user privacy.

  • When downloading the lists for the local database using the hashList.get method or the hashLists.batchGet method, the length of the hashes sent by the server is influenced by both the nature of the list and the client's preference of the hash length, communicated by the desired_hash_length parameter.

The Real-Time URL Check Procedure

This procedure is used when the client chooses the real-time mode of operation.

This procedure takes a single URL u and returns SAFE, UNSAFE or UNSURE. If it returns SAFE the URL is deemed safe by Google Safe Browsing. If it returns UNSAFE the URL is deemed potentially unsafe by Google Safe Browsing and appropriate action should be taken: such as showing a warning to the end user, moving a received message to the spam folder, or requiring extra confirmation by the user before proceeding. If it returns UNSURE, the following local-check procedure should be used afterwards.

  1. Let expressions be a list of suffix/prefix expressions generated by the URL u.
  2. Let expressionHashes be a list, where the elements are SHA256 hashes of each expression in expressions.
  3. For each hash of expressionHashes:
    1. If hash can be found in the global cache, return UNSURE.
  4. Let expressionHashPrefixes be a list, where the elements are the first 4 bytes of each hash in expressionHashes.
  5. For each expressionHashPrefix of expressionHashPrefixes:
    1. Look up expressionHashPrefix in the local cache.
    2. If the cached entry is found:
      1. Determine whether the current time is greater than its expiration time.
      2. If it is greater:
        1. Remove the found cached entry from the local cache.
        2. Continue with the loop.
      3. If it is not greater:
        1. Remove this particular expressionHashPrefix from expressionHashPrefixes.
        2. Check whether the corresponding full hash within expressionHashes is found in the cached entry.
        3. If found, return UNSAFE.
        4. If not found, continue with the loop.
    3. If the cached entry is not found, continue with the loop.
  6. Send expressionHashPrefixes to the Google Safe Browsing v5 server using RPC SearchHashes or the REST method hashes.search. If an error occurred (including network errors, HTTP errors, etc), return UNSURE. Otherwise, let response be the response received from the SB server, which is a list of full hashes together with some auxiliary information identifying the nature of the threat (social engineering, malware, etc), as well as the cache expiration time expiration.
  7. For each fullHash of response:
    1. Insert fullHash into the local cache, together with expiration.
  8. For each fullHash of response:
    1. Let isFound be the result of finding fullHash in expressionHashes.
    2. If isFound is False, continue with the loop.
    3. If isFound is True, return UNSAFE.
  9. Return SAFE.

The LocalThreat List URL Check Procedure

This procedure is used when the client opts for the local list mode of operation. It is also used when the client the RealTimeCheck procedure above returns the value of UNSURE.

This procedure takes a single URL u and returns SAFE or UNSAFE.

  1. Let expressions be a list of suffix/prefix expressions generated by the URL u.
  2. Let expressionHashes be a list, where the elements are SHA256 hashes of each expression in expressions.
  3. Let expressionHashPrefixes be a list, where the elements are the first 4 bytes of each hash in expressionHashes.
  4. For each expressionHashPrefix of expressionHashPrefixes:
    1. Look up expressionHashPrefix in the local cache.
    2. If the cached entry is found:
      1. Determine whether the current time is greater than its expiration time.
      2. If it is greater:
        1. Remove the found cached entry from the local cache.
        2. Continue with the loop.
      3. If it is not greater:
        1. Remove this particular expressionHashPrefix from expressionHashPrefixes.
        2. Check whether the corresponding full hash within expressionHashes is found in the cached entry.
        3. If found, return UNSAFE.
        4. If not found, continue with the loop.
    3. If the cached entry is not found, continue with the loop.
  5. For each expressionHashPrefix of expressionHashPrefixes:
    1. Look up expressionHashPrefix in the local threat list database.
    2. If the expressionHashPrefix cannot be found in the local threat list database, remove it from expressionHashPrefixes.
  6. Send expressionHashPrefixes to the Google Safe Browsing v5 server using RPC SearchHashes or the REST method hashes.search. If an error occurred (including network errors, HTTP errors, etc), return SAFE. Otherwise, let response be the response received from the SB server, which is a list of full hashes together with some auxiliary information identifying the nature of the threat (social engineering, malware, etc), as well as the cache expiration time expiration.
  7. For each fullHash of response:
    1. Insert fullHash into the local cache, together with expiration.
  8. For each fullHash of response:
    1. Let isFound be the result of finding fullHash in expressionHashes.
    2. If isFound is False, continue with the loop.
    3. If isFound is True, return UNSAFE.
  9. Return SAFE.

The Real-Time URL Check Procedure Without a Local Database

This procedure is used when the client chooses the no-storage real-time mode of operation.

This procedure takes a single URL u and returns SAFE or UNSAFE.

  1. Let expressions be a list of suffix/prefix expressions generated by the URL u.
  2. Let expressionHashes be a list, where the elements are SHA256 hashes of each expression in expressions.
  3. Let expressionHashPrefixes be a list, where the elements are the first 4 bytes of each hash in expressionHashes.
  4. For each expressionHashPrefix of expressionHashPrefixes:
    1. Look up expressionHashPrefix in the local cache.
    2. If the cached entry is found:
      1. Determine whether the current time is greater than its expiration time.
      2. If it is greater:
        1. Remove the found cached entry from the local cache.
        2. Continue with the loop.
      3. If it is not greater:
        1. Remove this particular expressionHashPrefix from expressionHashPrefixes.
        2. Check whether the corresponding full hash within expressionHashes is found in the cached entry.
        3. If found, return UNSAFE.
        4. If not found, continue with the loop.
    3. If the cached entry is not found, continue with the loop.
  5. Send expressionHashPrefixes to the Google Safe Browsing v5 server using RPC SearchHashes or the REST method hashes.search. If an error occurred (including network errors, HTTP errors, etc), return SAFE. Otherwise, let response be the response received from the SB server, which is a list of full hashes together with some auxiliary information identifying the nature of the threat (social engineering, malware, etc), as well as the cache expiration time expiration.
  6. For each fullHash of response:
    1. Insert fullHash into the local cache, together with expiration.
  7. For each fullHash of response:
    1. Let isFound be the result of finding fullHash in expressionHashes.
    2. If isFound is False, continue with the loop.
    3. If isFound is True, return UNSAFE.
  8. Return SAFE.

Local Database Maintenance

Google Safe Browsing v5 expects the client to maintain a local database, except when the client chooses the No-Storage Real-Time Mode. It is up to the client the format and storage of this local database. The contents of this local database can conceptually be thought of as a folder containing various lists as files, and the contents of these files are SHA256 hashes or hash prefixes.

Database Updates

The client will regularly call the hashList.get method or the hashLists.batchGet method to update the database. Since the typical client will want to update multiple lists at a time, it is recommended to use hashLists.batchGet method.

Lists are identified by their distinct names. The names are short ASCII strings a few characters long.

Unlike V4, where lists are identified by the tuple of threat type, platform type, threat entry type, in v5 lists are simply identified by name. This provides flexibility when multiple v5 lists could share the same threat type. Platform types and threat entry types are removed in v5.

Once a name has been chosen for a list, it will never be renamed. Furthermore, once a list has appeared, it will never be removed (if the list is no longer useful, it will become empty but will continue to exist). Therefore, it is appropriate to hard code these names in the Google Safe Browsing client code.

Both the hashList.get method and the hashLists.batchGet method support incremental updates. Using incremental updates saves bandwidth and improves performance. Incremental updates work by delivering a delta between client's version of the list and the latest version of the list. (If a client is newly deployed and does not have any versions available, a full update is available.) The incremental update contains removal indices and additions. The client is first expected to remove the entries at the specified indices from its local database, and then apply the additions.

Finally, to prevent corruption, the client should check the stored data against the checksum provided by the server. Whenever the checksum does not match, the client should perform a full update.

Decoding the List Content

All lists are delivered using a special encoding to reduce size. This encoding works by recognizing that Google Safe Browsing lists contain, conceptually, a set of hashes or hash prefixes, which are statistically indistinguishable from random integers. If we were to sort these integers and take their adjacent difference, such adjacent difference is expected to be "small" in a sense. Golomb-Rice encoding then exploits this smallness.

The Google Safe Browsing v5 has four distinct types to handle 4-byte data, 8-byte data, 16-byte data, and 32-byte data. Let's look at an example where three numerically consecutive 4-byte integers are encoded. Let the Rice parameter, denoted by k, be 3. The quotient part of the encoding is simply the adjacent difference value shifted right by k bits. Since the given integers are consecutive, their adjacent difference is 1, and after shifting by 3 bits the quotient part is zero. The least significant k bits are 001. The zero quotient is encoded as a single 0 bit. The remainder is 1, and encoded as 100. This is repeated again to form the bitstream 01000100. The resulting bitstream is encoded using little endian as 00100010. Therefore it corresponds to the following data:

rice_parameter: 3
entries_count: 2
encoded_data: "\x22"

After the above decoding step for 32-bit integers, the result are directly usable as either removal indices or additions. Unlike v4, there is no need to perform a byte-swap afterwards.

Available Lists

The following lists are recommended for use in v5alpha1:

List Name Corresponding v4 ThreatType Enum Description
gc None This list is a Global Cache list. It is a special list only used in the Real-Time mode of operation.
se SOCIAL_ENGINEERING This list contains threats of the SOCIAL_ENGINEERING threat type.
mw MALWARE This list contains threats of the MALWARE threat type for desktop platforms.
uws UNWANTED_SOFTWARE This list contains threats of the UNWANTED_SOFTWARE threat type for desktop platforms.
uwsa UNWANTED_SOFTWARE This list contains threats of the UNWANTED_SOFTWARE threat type for Android platforms.
pha POTENTIALLY_HARMFUL_APPLICATION This list contains threats of the POTENTIALLY_HARMFUL_APPLICATION threat type for Android platforms.

Additional lists will become available at a later date, at which time the above table will be expanded.

Update Frequency

The client should inspect the server's returned value in the field minimum_wait_duration and use that to schedule the next update of the database. This value is possibly zero, in which case the client should immediately perform another update.

Example Requests

This section documents some examples of directly using the HTTP API to access Google Safe Browsing. It is generally recommended to use a generated language binding because it will automatically handle encoding and decoding in a convenient way. Please refer to the documentation for that binding.

Here is an example HTTP request using the hashes.search method:

GET https://safebrowsing.googleapis.com/v5/hashes:search?key=INSERT_YOUR_API_KEY_HERE&hashPrefixes=WwuJdQ

The response body is a protocol-buffer formatted payload that you may then decode.

Here is an example HTTP request using the hashLists.batchGet method:

GET https://safebrowsing.googleapis.com/v5alpha1/hashLists:batchGet?key=INSERT_YOUR_API_KEY_HERE&names=se&names=mw

The response body is, once again, a protocol-buffer formatted payload that you may then decode.

Migration Guide

If you are currently using the v4 Update API, there is a seamless migration path from v4 to v5 without having to reset or erase the local database. This section documents how to do that.

Converting List Updates

In v4, one would use the threatListUpdates.fetch method to download lists. In v5, one would switch to the hashLists.batchGet method.

The following changes should be made to the request:

  1. Remove the v4 ClientInfo object altogether. Instead of supplying a client's identification using a dedicated field, simply use the well-known User-Agent header. While there is no prescribed format for supplying the client identification in this header, we suggest simply including the original client ID and client version separated by a space character or a slash character.
  2. For each v4 ListUpdateRequest object:
    • Look up the corresponding v5 list name in the table above and supply that name in the v5 request.
    • Remove unneeded fields such as threat_entry_type or platform_type.
    • The state field in v4 is directly compatible with the v5 versions field. The same byte string that would be sent to the server using the state field in v4 can simply be sent in v5 using the versions field.
    • For the v4 constraints, v5 uses a simplified version called SizeConstraints. Additional fields such as region should be dropped.

The following changes should be made to the response:

  1. The v4 enum ResponseType is simply replaced by a boolean field named partial_update.
  2. The minimum_wait_duration field can now be zero or omitted. If it is, the client is requested to immediately make another request. This only happens when the client specifies in SizeConstraints a smaller constraint on max update size than the max database size.
  3. The Rice decoding algorithm for 32-bit integers will need to be adjusted. The difference is that the encoded data are encoded with a different endianness. In both v4 and v5, 32-bit hash prefixes are sorted lexicographically. But in v4, those prefixes are treated as little endian when sorted, whereas in v5 those prefixes are treated as big endian when sorted. This means that the client does not need to do any sorting, since lexicographic sorting is identical to numeric sorting with big endian. An example of this sort in the Chromium implementation of v4 can be found here. Such sorting can be removed.
  4. The Rice decoding algorithm will need to be implemented for other hash lengths.

Converting Hash Searches

In v4, one would use the fullHashes.find method to get full hashes. The equivalent method in v5 is the hashes.search method.

The following changes should be made to the request:

  1. Structure the code to only send hash prefixes that are exactly 4 bytes in length.
  2. Remove the v4 ClientInfo objects altogether. Instead of supplying a client's identification using a dedicated field, simply use the well-known User-Agent header. While there is no prescribed format for supplying the client identification in this header, we suggest simply including the original client ID and client version separated by a space character or a slash character.
  3. Remove the client_states field. It is no longer necessary.
  4. It is no longer needed to include threat_types and similar fields.

The following changes should be made to the response:

  1. The minimum_wait_duration field has been removed. The client can always issue a new request on an as-needed basis.
  2. The v4 ThreatMatch object has been simplified into the FullHash object.
  3. Caching has been simplified into a single cache duration. See the above procedures for interacting with the cache.