|
FetchingURLsFromWebServices
Examples of fetching URLs from web services and processing them within Google Refine
IntroductionGoogle Refine supports the ability to fetch JSON from any web service based on values in a Google Refine project and create a new column out of it. Once you have the JSON, you can use the parseJson() function to create new columns, transform cell values, or even create new facets. In this article, I'll show a detailed example how it's done, starting with a very simple project with only 3 columns and 19 rows. I'll illustrate how to call the Facebook Graph API and the Google Translate API to annotate the project with extra information you can use to manipulate, analyze, and cleanse your dataset. DetailsHere's the initial dataset, created from information found here:
First, let's get more information about these pages using the Facebook Open Graph API. For example, you retrieve the JSON for the first row with this URL: http://graph.facebook.com/105911579453084. Google Refine can retrieve this information for all the rows at once using the "Add Column By Fetching URLs..." command, accessed via the dropdown menu to the left of the Facebook ID column:
Next, enter the information in the "Fetch URLs" dialog as follows:
Note the following:
When it's done, it will look something like this:
There is a lot of information in the JSON you might be interested in, but let's extract the information from the "link" element and the "plot_outline" element. To do this, click on the dropdown menu for Graph JSON and use the "Edit Column > Add Column Based on This Column..." command. Use the following expression: value.parseJson()["link"] When it's done, there will be a new column with the contents of the "link" element extracted from the JSON returned from the Graph API:
Note that these pages do not appear to be in English. But what language(s) are they in? The Google Language APIs can help determine an answer. First, we need to find an element in the JSON that has sufficient text. The plot_outline element is a good candidate, although not every page has one. We can extract that into a new column using the following expression: value.parseJson()["plot_outline"] Note that Google Refine has taken care of any text encoding during the parsing/extraction process:
The RESTful URL syntax for the Google Language APIs isn't as well documented as it is for using it in languages such as Javascript, but it's not too hard to figure out. The URL syntax for the Google Language Detection API looks like this: http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=<api_key>&q=<text> That looks easy enough, but there are some things to keep in mind:
Use the following expression for the "Add Column By Fetching URLs..." command on the Plot Outline column, replacing your own API key for <api_key> and using a throttle delay of 200 ms: "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=<api_key>&q=" + escape(value.substring(0,128),"url") After Google Refine calls the URL for each row, it should look something like this:
Note that the information we really want is inside the responseData "envelope" so let's extract it using the following expression with the "Edit Cells > Transform..." command on the Lang Detect column: value.parseJson()["responseData"]
(This step is optional, but makes it easier to type subsequent expressions to extract data from JSON.) In practice, you'll want to pay attention to the confidence percentage and the isReliable flag, which is true if confidence is > 50% (0.5). But for now, we have everything we need; let's group our dataset by ISO 639-1 language code, using the expression value.parseJson()["language"] with the "Facet > Custom Text Facet..." command on the Lang Detect column (the Plot Outline and Graph JSON columns have been collapsed for readability):
That's it! Google Refine makes it easy to annotate datasets with data fetched from any web service which returns JSON -- a very powerful and useful feature. | |
Thanks this is a great tutorial
What if the web page I am going to fetch is not UTF-8? How change character set from original to UTF-8 during the "Add Column By Fetching URLs..." command?
Is there any way of only fetching a URL if the cell in the source column contains a particular value?
What if the web service doesn't allow the query value to be at the very end of the request URL, for example the Google geocoding API requires an extra parameter after the query value: http://maps.googleapis.com/maps/api/geocode/xml?address=Mountain+View,+CA&sensor=true
I can't figure out how to write the GREL expression to fetch that.
@tom - use if
@krall - use "http://maps.googleapis.com/maps/api/geocode/xml?address=" + value+ "&sensor=true"