Introduction

The Google Knowledge Graph Search API was recently made publicly available. This API allows for the use of Schema.org types to limit the scope of search results to entities that have been categorized as a given type.

For example, a search may be limited to the schema.org type of "Person" or "TVSeries" to enhance the quality of search results. Interestingly, the format of the data that is returned is JSON-LD that provides a mapping between the JSON keys and URI's that include a unique and dereferencible identifier for clarifying the meaning of a given key.

Further, JSON-LD (the LD is for Linked Data) is interoperable with the Resource Description Framework (RDF) that offers a graph-based representation of knowledge based on subject-predicate-object "triples." For an API called 'Knowledge Graph" it makes sense to use a format that is compatible with this W3C standard that underlies the Web's knowledge representation formalism: the Web Ontology Language (OWL)

What this buys us is the ability to use other W3C standards for working with the results of the Knowledge Graph, such as the SPARQL Protocol and RDF Query Language. In this post, I am going to demonstrate how we can use a couple Python libraries to:

  • access the Knowledge Graph API
  • convert the JSON-LD results into RDF as a Turtle document
  • query the RDF graph using SPARQL

Python Libraries

First we need a few libraries for handling requests to the Knowledge Graoh API (requests), converting the JSON-LD results into RDF (pyld), and parsing the RDF into a queryable graph (rdflib). You can fnd these libraries at the links below:

In [92]:
import os
import json

import pyld
import rdflib
import requests

Accessing the Knowledge Graph API

Here we first need to obtain a key from the Google Developers Console and enable access. You won't see the Knowledge Graph API show up in the list of popular APIS, so you'll need to search for it in the API Manager. Once you enable the API and generate a key, you may need to add your IP address to the section on "Accept requests from these server IP addresses".

Below, I've saved my key and read it in. For fun here, I am just searching the schema.org type of "TVSeries" for a cartoon series called "Archer" - of course you can search for something more serious, but one thing I found is that many of the schema.org types are not supported. For example, an error was raised when I searched for the schema.org type for "Drug", "AnatomicalStructure", and "MedicalStudy".

In [94]:
kg_key = open(os.path.join(os.path.expanduser('~'), '.knowledge-graph-key')).read()
r = requests.get("https://kgsearch.googleapis.com/v1/entities:search", 
                 params=dict(query="Archer", key=kg_key, types="TVSeries"))

Parsing the and Examining the Results

This is pretty straight forward. We just parse the returned JSON-LD using the standard json library - note that JSON-LD can be treated as just plain old JSON.

You can see in the results that the initial section includes an @context indicating the mapping between different vocabularies and a "key". There are a number of returned 'hits' and that each hit has associated with it a score. Here, the top score of '43.221546' is the TV Show we are interested in - Archer.

In [100]:
jsonld = json.loads(r.text)
print(r.text)
{
  "@context": {
    "@vocab": "http://schema.org/",
    "goog": "http://schema.googleapis.com/",
    "EntitySearchResult": "goog:EntitySearchResult",
    "detailedDescription": "goog:detailedDescription",
    "resultScore": "goog:resultScore",
    "kg": "http://g.co/kg"
  },
  "@type": "ItemList",
  "itemListElement": [
    {
      "@type": "EntitySearchResult",
      "result": {
        "@id": "kg:/m/06_wvhl",
        "name": "Archer",
        "@type": [
          "TVSeries",
          "Thing"
        ],
        "description": "American animated series",
        "image": {
          "contentUrl": "http://t3.gstatic.com/images?q=tbn:ANd9GcQwUXmJt_InhAr39HEyyv8l4CIiom0RvTvNYcf-JoCN8cpXOyon",
          "url": "https://en.wikipedia.org/wiki/Archer_(TV_series)",
          "license": "http://creativecommons.org/licenses/by/2.0"
        },
        "detailedDescription": {
          "articleBody": "Archer is an American adult animated television series created by Adam Reed for the FX network. A preview of the series aired on September 17, 2009. The first season premiered on January 14, 2010. ",
          "url": "http://en.wikipedia.org/wiki/Archer_(TV_series)",
          "license": "https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License"
        },
        "url": "http://www.fxnetworks.com/archer"
      },
      "resultScore": 43.221546
    },
...

Conversion to RDF

Now, we can of course just parse the JSON object above using Python or test out our javascript skills, which is actually what makes JSON-LD nice - you can encode these Semantic Web ideas into an API without forcing developers to know anything about the Semantic Web Technology Stack, shown below:

To convert to RDF, we use the PyLD library and convert to a format called nquads that can include an additional URI in each RDF triple that indicates the graph that the triples belong to. Here there is no such graph, so only ntriples is output.

In [101]:
normalized = pyld.jsonld.normalize(jsonld, {'algorithm': 'URDNA2015', 'format': 'application/nquads'})
print(normalized)
<http://g.co/kg/g/11cknytgw6> <http://schema.googleapis.com/detailedDescription> _:c14n4 .
<http://g.co/kg/g/11cknytgw6> <http://schema.org/name> "Cassius &amp; Clay" .
<http://g.co/kg/g/11cknytgw6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/TVSeries> .
<http://g.co/kg/g/11cknytgw6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Thing> .
<http://g.co/kg/m/0264z1w> <http://schema.googleapis.com/detailedDescription> _:c14n35 .
<http://g.co/kg/m/0264z1w> <http://schema.org/description> "British television show" .
<http://g.co/kg/m/0264z1w> <http://schema.org/name> "The Dame Edna Experience" .
<http://g.co/kg/m/0264z1w> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/TVSeries> .
<http://g.co/kg/m/0264z1w> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Thing> .
<http://g.co/kg/m/02qcfrx> <http://schema.googleapis.com/detailedDescription> _:c14n20 .
...

Parse the RDF into a Queryable Graph

Next, we use the RDFLib library to parse this N3 data into a graph that we can then serialize into any of the RDF formats. In this case, I show an example of serializing into the Turtle format that is a bit easier to read. Note that RDF does not maintain any order of the triples that are output, so we no longer have our show listed as the first element. Also, you will see that the @context section is gone, being replaced with @prefix declarations for identifying the namespace from which terms come.

In [103]:
g = rdflib.Graph()
g.parse(data=normalized, format='n3')
print(g.serialize(format='turtle'))
@prefix ns1: <http://schema.org/> .
@prefix ns2: <http://schema.googleapis.com/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://g.co/kg/m/06_wvhl> a ns1:TVSeries,
        ns1:Thing ;
    ns2:detailedDescription [ ns1:articleBody "Archer is an American adult animated television series created by Adam Reed for the FX network. A preview of the series aired on September 17, 2009. The first season premiered on January 14, 2010. " ;
            ns1:license "https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License" ;
            ns1:url "http://en.wikipedia.org/wiki/Archer_(TV_series)" ] ;
    ns1:description "American animated series" ;
    ns1:image [ ns1:contentUrl "http://t3.gstatic.com/images?q=tbn:ANd9GcQwUXmJt_InhAr39HEyyv8l4CIiom0RvTvNYcf-JoCN8cpXOyon" ;
            ns1:license "http://creativecommons.org/licenses/by/2.0" ;
            ns1:url "https://en.wikipedia.org/wiki/Archer_(TV_series)" ] ;
    ns1:name "Archer" ;
    ns1:url "http://www.fxnetworks.com/archer" .

[] a ns1:ItemList ;
    ns1:itemListElement [ a ns2:EntitySearchResult ;
            ns2:resultScore 1.214568e+01 ;
            ns1:result <http://g.co/kg/m/06w21l8> ],
        [ a ns2:EntitySearchResult ;
            ns2:resultScore 1.142771e+00 ;
            ns1:result <http://g.co/kg/m/0522jn> ],
...

Querying the Knowlege Graph with SPARQL

Now that we have our graph loaded into RDFLib, we can issue a SPARQL query against it. Now this query is only being issued againts the results we returned from the API and not the entire Knowledge Graph, but it will give you a flavor for the query syntax.

To break this down a little, We have four keywords to parse: SELECT, WHERE, ORDER BY DESC, and LIMIT. SELECT is simuilar to SQL and contains our, unintuitively named, 'Projection' criteria or what we want returned to us as a table. The real meat is in the WHERE clause that provides our 'Selection' criteria using graph pattern matching on the triples in the graph. Here I am first looking up the the score of all the search results, then gathering the name, url, and an associated image. ORDER BY DESC allows us to sort our results with the highest score at the top and LIMIT just grabs the top results.

In [117]:
q = """SELECT ?name ?description ?url ?score ?image
       WHERE {?b a ns2:EntitySearchResult ;
                 ns2:resultScore ?score ;
                 ns1:result ?result .
              ?result ns1:description ?description ;
                      ns1:name ?name ;
                      ns1:url  ?url ;
                      ns1:image ?b_image .
              ?b_image ns1:contentUrl ?image .}
       ORDER BY DESC(?score)
       LIMIT 1
"""
print(g.query(q).serialize(format='csv'))
name,description,url,score,image
Archer,American animated series,http://www.fxnetworks.com/archer,43.221546,http://t3.gstatic.com/images?q=tbn:ANd9GcQwUXmJt_InhAr39HEyyv8l4CIiom0RvTvNYcf-JoCN8cpXOyon

Summary

So here I've shown how you can get started with converting data from the knowledge graph into RDF and query it using the SPARQL query language. This is only scratching the surface, as the real power of these technologies are to integrate data from external sources. For example, as a next step we may want to query dbpedia using the URL we acquired here to gain additional information about actors or episodes and further explore the information retrieved as linked data. Of course here we are only looking at a TV Show, but as additional schema.org types are supported by the Knowledge Grpah API, I imagine that there will be links and hooks that enable searching across Drug databases and interoperating the data retrieved from Google, with more traditional sources of Linked Data from bioinformatics databases and forming a Giant Global Graph of linked data.

In [ ]:
 



Comments

comments powered by Disqus