The Resource Description Framework (RDF) is a data model and language which is quickly gaining momentum in the open-data and data-integration worlds. In SAILS we’re developing a prototype for rdf-data manipulation and querying, as a consequence in the last weeks I’ve been surveying the available tools and libraries for programming rdf applications.
There are dozens of blog posts and articles that discuss these issues online; among them, it is worth mentioning the ones created in the context of SPQR (another JISC-funded linked data project): “Linked data tools“, “Assessing Linked Data Tools for SPQR” and “Assessing Jena and Sesame“. The SPQR guys seem to have gone for the Java solution, which is chosen by many for it provides one of the richest and most tested suite of functionalities. But it’s not the only one – in particular, in the case of SAILS we’re aiming at building a prototype application in quite a short amount of time (I’m working on this two days a week, by the way), so we need an environment that can get us going pretty fast, which is not always the case with strongly-typed, heavily structured languages such as Java…. Moreover, ideally we would like to focus a bit on the user-interface side of things too, so it’d be nice to use an advanced web programming framework that will speed up the creation of repetitive tasks and let us focus on the interface design.
The choice has thus fallen on Python and Django as a base platform for SAILS – two environments that allow quick development and prototyping, are flexible, easy-to-use and widespread among people in different communities. Moreover, I already have done some work with these tools in the past, so I thought I could make use of this experience quite productively.
Not unsurprisingly, there are various possible solutions for python rdf-programming. In what follows I therefore tried to gather information about all the existing libraries and frameworks and present it in a more ‘digestible’ way (I’m currently examining these solutions in more depth, as in the near future I’ll have to choose one of them and move on with the project! – stay tuned for these results..).
1. Python libraries for working with Rdf
RdfLib (download) is a pretty solid and extensive rdf-programming kit for python. It contains parsers and serializers for RDF/XML, N3, NTriples, Turtle, TriX and RDFa. The library presents a Graph interface which can be backed by any one of a number of store implementations, including, memory, MySQL, Redland, SQLite, Sleepycat, ZODB and SQLObject.
The latest release is RdfLib 3.0, although I have the feeling that many are still using the previous release, 2.4. One big difference between the two is that in 3.0 some libraries have been separated into another package (called rdfextras); among these libraries there’s also the one you need for processing sparql queries (the rdf query language), so it’s likely that you want to install that too.
A short overview of the difference between these two recent releases of RdfLib can be found here. The APIs documentation for RdfLib 2.4 is available here, while the one for RdfLib 3.0 can be found here. Finally, there are also some other (a bit older, but possibly useful) docs on the wiki.
Next thing, you might want to check out these tutorials:
Getting data from the Semantic Web: a nice example of how to use RdfLib and python in order to get data from DBPedia, the Semantic Web version of Wikipedia.
How can I use the Ordnance Survey Linked Data: shows how to install RdfLib and query the linked data offered by Ordnance Survey.
A quick and dirty guide to YOUR first time with RDF: another example of querying Uk government data found on data.gov.uk using RdfLib and Berkely/Sleepycat DB.
The goal of RDFAlchemy (install | apidocs | usergroup) is to allow anyone who uses python to have a object type API access to an RDF Triplestore. In a nutshell, the same way that SQLAlchemy is an ORM (Object Relational Mapper) for relational database users, RDFAlchemy is an ORM (Object RDF Mapper) for semantic web users.
RdfAlchemy can also work in conjunction with other datastores, including rdflib, Sesame, and Jena. Support for SPARQL is present, although it seems less stable than the rest of the library.
FuXi is a Python-based, bi-directional logical reasoning system for the semantic web. It requires rdflib 2.4.1 or 2.4.2 and it is not compatible with rdflib 3. FuXi aims to be the ‘engine for contemporary expert systems based on the Semantic Web technologies’. The documentation can be found here; it might be useful also to look at the user-manual and the discussion group.
In general, it looks as if Fuxi can offer a complete solution for knowledge representation and reasoning over the semantic web; it is quite sophisticated and well documented (partly via several academic articles). The downside is that to the end of hacking together a linked data application.. well Fuxi is probably just too complex and difficult to learn.
ORDF (download | docs) is the Open Knowledge Foundation‘s library of support infrastructure for RDF. It is based on RDFLib and contains an object-description mapper, support for multiple back-end indices, message passing, revision history and provenance, a namespace library and a variety of helper functions and modules to ease integration with the Pylons framework.
Django-RDF (download | faq | discussiongroup) is an RDF engine implemented in a generic, reusable Django app, providing complete RDF support to Django projects without requiring any modifications to existing framework or app source code. The philosophy is simple: do your web development using Django just like you’re used to, then turn the knob and – with no additional effort – expose your project on the semantic web.
Django-RDF can expose models from any other app as RDF data. This makes it easy to write new views that return RDF/XML data, and/or query existing models in terms of RDFS or OWL classes and properties using (a variant of) the SPARQL query language. SPARQL in, RDF/XML out – two basic semantic web necessities. Django-RDF also implements an RDF store using its internal models such as Concept, Predicate, Resource, Statement, Literal, Ontology, Namespace, etc. The SPARQL query engine returns query sets that can freely mix data in the RDF store with data from existing Django models.
The major downside of this library is that it doesn’t seem to be maintained anymore; the last release is from 2008, and there seem to be various conflicts with recent versions of Django. A real shame!
Djubby (download | docs) is a Linked Data frontend for SPARQL endpoints for the Django Web framework, adding a Linked Data interface to any existing SPARQL-capable triple stores.
Djubby is quite inspired by Richard Cyganiak’s Pubby (written in Java): it provides a Linked Data interface to local or remote SPARQL protocol servers, it provides dereferenceable URIs by rewriting URIs found in the SPARQL-exposed dataset into the djubby server’s namespace, and it provides a simple HTML interface showing the data available about each resource, taking care of handling 303 redirects and content negotiation.
Redland (download | docs | discussiongroup) is an RDF library written in C and including several high-level language APIs providing RDF manipulation and storage. Redland makes available also a Python interface (intro | apidocs) that can be used to manipulate RDF triples.
This library seems to be quite complete and is actively maintained; only potential downside is the installation process. In order to use the python bindings you need to install the C library too (which in turns depends on other C libraries), so (depending on your programming experience and operating system used) just getting up and running might become a challenge.
SuRF (install | docs) is an Object – RDF Mapper based on the RDFLIB python library. It exposes the RDF triple sets as sets of resources and seamlessly integrates them into the Object Oriented paradigm of python in a similar manner as ActiveRDF does for ruby.
Other smaller (but possibly useful) python libraries for rdf:
Sparql Interface to python: a minimalistic solution for querying sparql endpoints using python (download | apidocs)
PySparql: again, a minimal library that does SELECT and ASK queries on an endpoint which implements the HTTP (GET or POST) bindings of the SPARQL Protocol (code page)
SPARQL Endpoint interface to Python another little utility for talking to a SPARQL endpoint, including having select-results mapped to rdflib terms or returned in JSON format (download page)
Sparta: Sparts is a simple, resource-centric API for RDF graphs, built on top of RDFLIB.
Oort: another Python toolkit for accessing RDF graphs as plain objects, based on RDFLIB. The project homepage hasn’t been updated for a while, although there is trace of recent activity on its google project page.
2. RDF Triplestores that are python-friendly
An important component of a linked-data application is the triplestore (that is, an RDF database): many commercial and non-commercial triplestores are available, but only a few offer out-of-the-box python interfaces. Here’s a list of them:
AllegroGraph RDFStore is a high-performance, persistent RDF graph database. AllegroGraph uses disk-based storage, enabling it to scale to billions of triples while maintaining superior performance. Unfortunately, the official version of AllegroGraph is not free, but it is possible to get a free version of it (it limits the DB to 50 million triples, so although useful for testing or development it doesn’t seem a good solution for a production environment).
The Allegro Graph Python API (download | docs | reference) offers convenient and efficient access to an AllegroGraph server from a Python-based application. This API provides methods for creating, querying and maintaining RDF data, and for managing the stored triples.
A hands-on overview of what’s like to work with AllegroGraph and python can be found here: Getting started with AllegroGraph.
Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is a “universal server”; it enables a single multithreaded server process that implements multiple protocols. The open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso.
Virtuoso from Python is intended to be a collection of modules for interacting with OpenLink Virtuoso from python. The goal is to provide drivers for `SQLAlchemy` and `RDFLib`. The package is installable from the Python Package Index and source code for development is available in a mercurial repository on BitBucket.
A possibly useful example of using Virtuoso from python: SPARQL Guide for Python Developer.
Sesame is an open-source framework for querying and analyzing RDF data (download | documentation). Sesame supports two query languages: SeRQL and Sparql. Sesame’s API differs from comparable solutions in that it offers a (stackable) interface through wich functionality can be added, and the storage engine is abstracted from the query interface (many other Triplestores can in fact be used through the Sesame API).
It looks as if the best way to interact with Sesame is by using Java; however there is also a pythonic API called pySesame. This is essentially a python wrapper for Sesame’s REST HTTP API, so the range of operations supported (Log in, Log out, Request a list of available repositories, Evaluate a SeRQL-select, RQL or RDQL query, Extract/upload/remove RDF from a repository) are somehow limited (for example, there does not seem to be any native SPARQL support).
A nice introduction to using Sesame with Python (without pySesame though) can be found in this article: Getting Started with RDF and SPARQL Using Sesame and Python.
The Talis Platform (faq | docs)is an environment for building next generation applications and services based on Semantic Web technologies. It is a hosted system which provides an efficient, robust storage infrastructure. Both arbitrary documents and RDF-based semantic content are supported, with sophisticated query, indexing and search features. Data uploaded on the Talis platform are organized into stores: a store is a grouping of related data and metadata. For convenience each store is assigned one or more owners who are the people who have rights to configure the access controls over that data and metadata. Each store provides a uniform REST interface to the data and metadata it manages.
Stores don’t come free of charge, but through the Talis Connected Commons scheme it is possible have quite large amounts of store space for free. The scheme is intended to support a wide range of different forms of data publishing. For example scientific researchers seeking to share their research data; dissemination of public domain data from a variety of different charitable, public sector or volunteer organizations; open data enthusiasts compiling data sets to be shared with the web community.
Good news for pythonistas too: pynappl is a simple client library for the Talis Platform. It relies on rdflib 3.0 and draws inspiration from other similar client libraries. Currently it is focussed mainly on managing data loading and manipulation of Talis Platform stores (this blog post says more about it).
Before trying out the Talis platform you might find useful this blog post: Publishing Linked Data on the Talis Platform.
4store (download | features | docs) is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.
4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.
4store offers a number of client libraries, among them there are two for python: first, HTTP4Store is a client for the 4Store httpd service – allowing for easy handling of sparql results, and adding, appending and deleting graphs. Second, py4s, although this seems to be a much more experimental library (geared towards multi process queries).
Furthemore, there is also an application for the Django web framework called django-4store that makes it easier to query and load rdf data into 4store when running Django. The application offers some support for constructing sparql-based Django views.
This blog post shows how to install 4store: Getting Started with RDF and SPARQL Using 4store and RDF.rb .
——————————
End of the survey.. have I missed out on something? Please let me know if I did – I’ll try to keep adding stuff to this list as I move on with the project work!
p.s.
A modified version of this post has been published also here