Recent Comments

Elasticsearch & Django Tutorial

Search is now an integral part of almost all applications that we build. Every thing from a blog to a social media app to a business intelligence dashboard requires search capability. Beyond searching for specific information end users expect search to be quick, in real time and be able to provide deeper insight in to their data set. Adding to the complexity is the exponential growth of web content. Hundreds of terabytes of structured and unstructured content is becoming a norm.

At OpenCrowd we have been using Elasticsearch to build our clients search infrastructure. One of largest deployments has been Govini, where you can search and analyze government procurement data. We chose Elasticsearch over other alternatives because in addition to providing full text search it provides meaningful real-time data analytics and is highly scalable with a strong support for clustered data infrastructure It also provides API's enabling easy integration into a custom application environment. Luckily for us, our Django (and more broadly Python) based development environment gave us a lot of cool out of the box tools to implement Elasticsearch.

During our implementation we faced several challenges as we optimized the search infrastructure. For example how to better implement a fast search in a MVC model or how much and how to cache data to avoid hitting the database multiple times for a single search. To demonstrate some of the cool capabilities of Elasticsearch and share some lessons learned I have developed up a three series tutorial. These tutorials should get you started with building a search application utilizing some of the Elasticsearch capabilities.

Luckily for us, Django (and more broadly Python) gives us a lot of cool tools out of the box to make implementing Elasticsearch as our datastore relatively easy. The best news is, I’m going to show you how in this series, the first of three parts. In this part, I will go over building up the models.py functions that will serve as our data sources. Later parts will cover views and templates.

For the purposes of this part, I am going to assume you know how to setup a Django project. If you don't, you should consult the documentation. You also need to have access to an Elasticsearch cluster/node, if you don’t already have access to a cluster, it’s easy to set up, just follow the documentation for Elasticsearch. The developers wrote an excellent walk-through that steps you through downloading to setting up the cluster.

So to keep things simple, we'll create a straightforward Elasticsearch-based site: in the time honored tradition of Web tutorials, a blogging engine. The only RDBMS functionality we will use for this exercise is the django.contrib.auth app with a SQLite3 backend.

As a result of using Elasticsearch as our data source, we will be unable to use regular Django models; however, we can create the necessary data-access functions. For consistency, we’ll create these functions in models.py.

Ok, confession time: I designed this Elasticsearch schema/mapping in a contrived manner. I made the “post” document type a child document of “author”. Why? Mostly, because later on in this series we'll delve into working with parent-child documents and once you've defined a mapping you can't change it without deleting the mapping (and any documents that match that type.) Also I did this because parent-child documents aren't really all that scary to work with and it's easier if we just get them out of the way. For the most part, you can safely ignore the parent-child relationship in the mapping; however, for a few types of search queries, parent/child can be extremely handy.

You can get a copy of this tutorial code by cloning the github repo.

We'll start with the commenting portion of this blog engine, and the two model-related functions: get_comments and get_comment. These two functions serve as entry points to querying the comments data type; they support some of the concepts of a Django queryset. Namely, you can pass in the parameters for querying in the function call as named function parameters, like so: get_comment(is_spam=False) to find all non-spam comments that have been posted.

def get_comments(*order_by, **kwargs):
    filters = [TermFilter(key, value) for key, value in kwargs.items()]
    if filters:
        q = FilteredQuery(MatchAllQuery(), ANDFilter(filters))
    else:
        q = MatchAllQuery()
    rdering = {}
    for field in order_by:
        if field.startswith('-'):
            field = field[1:]
            ordering[field] = 'desc'
        else:
            ordering[field] = 'asc'
    if ordering:
        q = q.search(sort=ordering)
    es = get_connection()
    rs = es.search(q, ELASTICSEARCH_INDEX, 'comment')
    results = []
    for row in rs:
        row['reference'] = get_reference(row['reference_to'], row['reference_type'])
        row['created_on'] = _parse_datetime(row['created_on'])
        results.append(row)
    return results


def get_comment_count(**kwargs):
    return len(get_comments(**kwargs))

How useful would a blog engine be without the ability to get to the actual posts? Not very. So, we have two functions in models.py for this: get_posts and get_post. As their name implies, get_posts fetches all the post objects in the system. get_post fetches a specific post.

def get_post(author, blog_id):
    es = get_connection()
    obj = es.get(ELASTICSEARCH_INDEX, 'post', blog_id, routing=author)
    obj['author'] = get_author(id=author)
    if 'created_on' in obj:
        obj['created_on'] = _parse_datetime(obj['created_on'])
    else:
        obj['created_on'] = datetime.utcnow()
    if 'updated_on' in obj:
        obj['updated_on'] = _parse_datetime(obj['updated_on'])
    else:
        obj['updated_on'] = datetime.utcnow()
    return obj


def get_posts(*order_by):
    es = get_connection()
    query = MatchAllQuery()
    ordering = {}
    for field in order_by:
        if field.startswith('-'):
            field = field[1:]
            ordering[field] = 'desc'
        else:
            ordering[field] = 'asc'
    if ordering:
        query = query.search(sort=ordering)
    else:
        query = query.search()
    author_cache = {}
    rs = es.search(query, ELASTICSEARCH_INDEX, 'post', scan=True)
    results = [row for row in rs]
    for row in results:
        try:
            row['author'] = author_cache[row['author']]
        except KeyError:
            author = get_author(id=row['author'])
            author_cache[row['author']] = author
            row['author'] = author
        except TypeError:
            pass
    if 'created_on' in row:
        row['created_on'] = _parse_datetime(row['created_on'])
    else:
        row['created_on'] = datetime.utcnow()
    if 'updated_on' in row:
        row['updated_on'] = _parse_datetime(row['updated_on'])
    else:
        row['updated_on'] = datetime.utcnow()
    return results

This function lets us query and fetch posts. Like get_comment above, it supports some of the concepts of a Django queryset. You might note that the code that parses out the named parameters is the same as in get_comment. The duplication is for readability, but in actual practice you would want to condense that.

Another requirement is the ability to add both blog posts as well as comments, so in models.py we have two more functions adding these features: index_post and index_comment. (note – I refer to these as index_XXX because adding new documents in Elasticsearch is called “indexing”.)

Here are the two functions that handle storing the data in the Elasticsearch nodes. The main difference between the two functions is the parent-child nature of the “post” data type and another minor difference is the Elasticsearch type that they reference. This (What is “this” ?) requires an “author” id from Elasticsearch to be able to properly store this document in the Elasticsearch nodes.

def index_post(author, post):
    es = get_connection()
    post['author'] = author.get_meta().id
    return es.index(post, ELASTICSEARCH_URL, 'post', parent=author)._meta._id

def index_comment(comment):
    es = get_connection()
    return es.index(comment, ELASTICSEARCH_INDEX, 'comment')._meta._id

We don't do much with authors in this file apart from provide a way to get a particular author. Again, like with get_post and get_comment, we support some of the Django queryset named parameter concept. Here though, we dumb that down even further. If you were going to refactor get_post and get_comment to have a separate function to handle this functionality, I would suggest that you include get_author in that mini-project as well. The function get_top_authors simply finds the authors that have the most “post” objects attached them.

def get_author(**kwargs):
    es = get_connection()
    if 'id' in kwargs:
        return es.get('blog', 'author', kwargs['id'])
    filters = [TermFilter(key, value) for key, value in kwargs.items()]
    q = FilteredQuery(MatchAllQuery(), ANDFilter(filters))
    return es.search(q, ELASTICSEARCH_INDEX, 'author')


def get_top_authors():
    q = MatchAllQuery()
    q = q.search()
    q.facet.add_term_facet('author')
    es = get_connection()
    facets = es.search(q, ELASTICSEARCH_INDEX, 'post').facets
    authors = []
    for term in facets['author']['terms']:
        authors.append(get_author(id=term['term']))
    return authors

In the part 2 of this series, we’ll build up view functions using Elasticsearch as the data source.

Comments