Wednesday, November 20, 2013

Integrating SQLAlchemy into Django

In my previous post I describe some of the limitations that I have hit while using Django's ORM for our RESTful API here at HireVue.  In this post I will describe how I integrated SQLAlchemy into our Django app for read-only (GET) requests, to handle the queries that Django doesn't allow.

Let's say we are running with a single database instance, so our django settings look something like

DATABASES {
    'default' : {
        'ENGINE' : 'django.db.backends.postgresql_psycopg2',
        'NAME' : 'mydatabase',
        'USER' : 'mydatabaseuser',
        'PASSWORD' : 'mypassword',
        'HOST' : '127.0.0.1',
        'PORT' : '5432',
    }
}

We will continue to let Django handle all of the connection management, so when SQLAlchemy needs a database connection, we want to just use the current connection.  The method to get this connection looks like

# custom connection factory, so we can share with django
def get_conn():
    from django.db import connections
    conn = connections['default']
    return conn.connection

Now we want SQLAlchemy to call get_conn whenever it needs a new connection.  In addition, we need to keep SQLAlchemy from trying to pool the connection, clean up after itself, etc.  We essentially need it to do absolutely no connection handling.  To accomplish this, we create the SQLAlchemy engine with a custom connection pool that looks like

# custom connection pool that doesn't close connections, and uses our
# custom connection factory
class SharingPool(NullPool):
    def __init__(self, *args, **kwargs):
        NullPool.__init__(self, get_conn, reset_on_return=False,
                          *args, **kwargs)

    def status(self):
        return 'Sharing Pool'

    def _do_return_conn(self, conn):
        pass

    def _do_get(self):
        return self._create_connection()

    def _close_connection(self, connection):
        pass

    def recreate(self):
        return self.__class__(self._creator,
                              recycle=self._recycle,
                              echo=self.echo,
                              logging_name=self._orig_logging_name,
                              use_threadlocal=self._use_threadlocal,
                              reset_on_return=False,
                              _dispatch=self.dispatch,
                              _dialect=self._dialect)

    def dispose(self):
        pass

The magic is where we pass get_conn to the NullPool constructor, and then we override most of the other methods to do nothing.  This allows SQLAlchemy to borrow the connection that Django is managing, without interfering.

Now we can create an engine with this pool like so:

# create engine using our custom pool and connection creation logic
engine = create_engine(db_url, poolclass=SharingPool)

Since we want to keep DRY, we use Django's settings to create the db_url that is passed to the engine constructor:

db_url = 'postgresql+psycopg2://{0}:{1}@{2}:{3}/{4}'.format(
            settings.DATABASES['default']['USER'],
            settings.DATABASES['default']['PASSWORD'],
            settings.DATABASES['default']['HOST'],
            settings.DATABASES['default']['PORT'],
            settings.DATABASES['default']['NAME'])

We also don't want to maintain table definitions separate from our Django models, so we use SQLAlchemy reflection to build the metadata:

# inspect the db metadata to build tables
meta = MetaData(bind=engine)
meta.reflect()

This is basically it!  Now you can get the tables you would like to query from the meta object, and create and run selects.  They will run on the connection provided by Django.  Don't try to do any transaction handling or it may mess up Django.  

Also, you'll want to make sure you reflect() on startup, so that tables are ready to go when you need them.

Monday, October 28, 2013

Limitations of Django's ORM

I'm currently working at HireVue, where we have a public RESTful API that we consume for our website. I have done quite a bit of work on the API, and in the process, have run into some of the limitations of the Django ORM.

Difficulty controlling joins

Let's say I have an Interview table for storing information about candidate interviews, an Evaluation table for storing interviewer evaluations, and a User table for information on the interviewers/evaluators.  The Evaluation table is a many-to-many connecting Users and Interviews, with evaluation information stored in the table as well.


Now I want to write a query that returns interviews that a specific user has evaluated highly.  I try something like this:

Interview.objects.filter(evaluation__rating__gt=5, evaluation__user__username='specificuser')

I cross my fingers and hope that Django only joins to the evaluation table once, and uses that same join for both filter conditions.  It seems to work in the cases I have tried, but there is no way for me to tell Django for sure that is what I want.

On the other hand, let's say I want to write a query that returns interviews that a specific user has evaluated low, but others have evaluated high.  How do I tell Django I need two different joins?

Interview.objects.filter(evaluation__rating__lt=3, evaluation__user__username='specificuser') \
        .filter(evaluation__rating__gt=5).distinct()

Will this work?  I have had mixed luck, and find myself banging my head against the Django ORM black box, wishing I could tell it more explicitly what I want.  The Django docs are very reassuring on this matter:
 Conditions in subsequent filter() or exclude() calls that refer to the same relation may end up filtering on different linked objects.
I'm also limited if I am trying to do filtering in different parts of my code, on the same relationship.  Let's say I have some 'permissions' functions that apply a filter so that the request can only see interviews that the current user has evaluated, and then other functions to filter based on search criteria, like whether the evaluation was high or low:

query = Interview.objects.all() 
query = apply_permissions(query) # filter(evaluation__user__username='specificuser')
query = apply_search(query) # filter(evaluation__rating__gt=5)

I don't know of any way to tell Django that both of the filters, applied by different methods, should both use the same joined table.  This makes it difficult to break code out into logical components like this.

Inability to force outer joins

This is the bigger problem, in my opinion:  I am unaware of any way to tell Django to use an outer join.  Others seem to agree that this is not possible.

What if I would like to return all interviews, and if the current user has evaluated the interview, I'd like to include the rating.  This is easily done using a LEFT OUTER JOIN in sql.  Django, however, does not allow a query like this, leaving us with extra queries and joins in code.  That may not seem like too big of a pain in most cases, but if you are querying against a large data set, and want to sort by the ratings in that left outer joined table, things get even more difficult.

In my experience, these are my biggest complaints with Django's ORM.  It's frustrating to be held back by limitations in the ORM, when I know that the database can handle the problem easily.  I find myself jumping through hoops, or coding around these limitations way too often, and thought I'd share, for others considering using Django.

I'll follow up this post with another describing how I recently bolted SQLAlchemy onto our existing Django app, to handle the 'tough' queries.