Ibis Project Blog

Python productivity framework for the Apache Hadoop ecosystem. Development updates, use cases, and internals.

Using other compute engines with Ibis

Several people have asked me about using Ibis with execution engines other than Impala. The purpose of this post is to explain how one can make Ibis work with other systems and what that might mean for the actual users.

Context

A primary goal of Ibis is to enable users to productively solve data analysis problems at any scale using 100% Python code, while leveraging the existing ecosystem of 3rd party libraries and high performance computing tools as much as possible. Personally, I'd like to see folks using tools like Cython to accelerate mission-critical analytics on a petabyte of data, and we simply are not there yet.

"100% Python" is not a gimmick; it means that Ibis, among other things, will also need to enable full utilization of the data analysis capabilities provided by SQL without writing any actual SQL code. The "writing SQL [in Python] without writing SQL" is such an important, and frankly nuanced, topic that I'm going to write a dedicated follow up blog post about it.

Starting in late 2014, I began developing a joint roadmap with the Impala team at Cloudera to solve the painful performance and usability problems that Python programmers experience at large scale (multi-terabyte to petabyte workloads). Since I've been focused on support and integration with Impala (which I understand not everyone has available yet), people have asked me about using Ibis with other systems like:

  • Other SQL-on-Hadoop compute engines: Hive, Presto, and others
  • Other "full-stack" (storage + compute) SQL engines: PostgreSQL, MySQL, Vertica, etc.
  • Generalized compute engines: Spark

Short answer: yes, if someone writes an Ibis expression translator for these platform targets, they can be used with them with no problems. Where things get dicey is how to make Python user-defined functions (which can't be compiled to SQL or some other form) run fast (or run at all); something that we are doing internal work in Impala to enable.

I'm really excited in particular to see more SQL-based backends built for Ibis so that data analysts can just use Python instead of a mix of Python and SQL (the unfortunate status quo). It's not as much work as you might think.

Building new SQL backends for Ibis

Ibis contains a fully-decoupled data expression API that implements a superset of SQL semantics. My view is that if you can write it in SQL, you should be able to write it in a composable, reusable, and testable Pythonic (or pandas-ic?) way with Ibis.

When you build and execute an Ibis expression using Impala SQL, here's what happens:

  1. The expression is analyzed and validated (types and relational semantics) while you're building it. So when you want to execute it you can be sure that it is valid. All of this code lives in ibis/expr in the codebase. I'll definitely write some follow up posts and documentation going into detail on the expression API design.

  2. The expression is "compiled" to identify each component corresponding to the parts of a SQL statement. See ibis/sql/compiler.py (NB. this module may move in the future).

    • In this step, further analysis takes place to expand higher-level Ibis constructs into concrete low-level SQL primitives.
    • Note this even handles all the SQL "weird stuff" like correlated subqueries; I'll write more technical detail about this in the future.
  3. The compiled SELECT data structure is translated into a valid SQL query. This falls into two buckets of functionality:

    • ibis/sql/ddl.py: the primary SQL statement builder classes.
    • ibis/sql/exprs.py: translates Ibis operations and arguments into concrete SQL function calls. This includes all built-in analytical functions functions, case statements, "synthetic" operations like bucketing, window functions, and so forth.

To give you a concrete example, the Ibis code

In [37]: bucket = t.float_col.bucket([0, 5, 10])

In [38]: double_pct = t.double_col / t.double_col.sum()

In [39]: expr = (t.mutate(bucket=bucket, double_pct=double_pct)
   ....:         .group_by(['bucket', 'string_col'])
   ....:         .double_pct.summary())

In [40]: expr.execute()
Out[40]:
   bucket string_col  count  nulls       min       max       sum      mean  approx_nunique
0       0          1    730      0  0.000030  0.000030  0.022222  0.000030               1
1       0          4    730      0  0.000122  0.000122  0.088889  0.000122               1
2       1          6    730      0  0.000183  0.000183  0.133333  0.000183               1
3       1          9    730      0  0.000274  0.000274  0.200000  0.000274               1
4       0          3    730      0  0.000091  0.000091  0.066667  0.000091               1
5       0          0    730      0  0.000000  0.000000  0.000000  0.000000               1
6       1          5    730      0  0.000152  0.000152  0.111111  0.000152               1
7       1          7    730      0  0.000213  0.000213  0.155556  0.000213               1
8       1          8    730      0  0.000244  0.000244  0.177778  0.000244               1
9       0          2    730      0  0.000061  0.000061  0.044444  0.000061               1

internally runs the SQL query

SELECT bucket, string_col, count(double_pct) AS `count`,
       sum(double_pct IS NULL) AS `nulls`, min(double_pct) AS `min`,
       max(double_pct) AS `max`, sum(double_pct) AS `sum`,
       avg(double_pct) AS `mean`, ndv(double_pct) AS `approx_nunique`
FROM (
  SELECT *,
    CASE
      WHEN (float_col >= 0) AND (float_col < 5) THEN 0
      WHEN (float_col >= 5) AND (float_col <= 10) THEN 1
      ELSE NULL
    END AS `bucket`, double_col / sum(double_col) OVER () AS `double_pct`
  FROM ibis_testing.`functional_alltypes`
) t0
GROUP BY 1, 2

Ibis's SQL translation toolchain is extensively tested in ibis/sql/tests.

Adapting Ibis to work with other systems that use the SQL language primarily involves addressing engine-specific differences in Step 3. There may be some nuances in Step 2; as a concrete example, not all databases have explicit semi-joins, which Ibis uses to adapt Top-N filter operations.

If you want to enable Ibis to use a SQL engine that you have available, like Vertica or PostgreSQL, I am happy to guide you in the efforts to begin building a test suite and refactoring Ibis's SQL translation subsystem to target multiple SQL dialects. The hard part (and why I won't promise to build the integrations myself) is thorough integration testing; think of it as "code coverage" for SQL. It's not enough to generate the queries; you have to make sure they work.

Executing Ibis expressions with pandas?

This is hypothetically possible, but there are some feature gaps in pandas, and pandas's loose type system (and known issues with NULL / NA values) would be a barrier. If someone wants to discuss this with me in more detail please get in touch.

I expect that an in-memory backend for Ibis will exist at some point in the future, but it's unlikely to use pandas. Interoperability with pandas is a separate issue, and using Ibis alongside pandas is a primary use case.

Why Ibis doesn't use SQLAlchemy (yet)

Several people have asked me, "Wes, why aren't you using SQLAlchemy to generate SQL code in Ibis? It does away with all the database-specific differences!" This is only half true. Here are the main reasons:

  • Impala will soon have complex types and its own SQL syntax for expressing queries involving arrays, structs, and maps as primitive value types. Engines like Presto also have this functionality and their own syntax for writing queries on such data. SQLAlchemy support for complex types is unlikely to drop in time (i.e. the next 6 months or sooner).
  • We have to write interface code to wrap all the SQL built-in functions, and many of these are database specific. SQLAlchemy provides an extension API for wrapping functions, but it was easier, at least initially, to handle all the function wrapping within Ibis.
  • Impala, like Hive, has various big data-specific DDL commands and modifiers; we are generating more than only SELECT, INSERT, and CREATE TABLE statements.

That being said, I would like to see a SQLAlchemy version of Ibis's SQL translation toolchain, in particular for targeting traditional RDBMSs like PostgreSQL or MySQL.

SQLAlchemy handles a number of things, like input sanitization, that Ibis does not yet, so there are likely some code reuse opportunities there.

Why Impala is so key to the Ibis vision

Among the production-grade open source big data systems one may consider, Impala is the only one that does not run on the JVM. So, SQL performance / interactivity considerations aside, we (myself and the Impala team) are working on shared memory / binary interoperability with Python that eliminates the data serialization and memory use overhead that has plagued Python extension APIs (for user-defined functions) on top of JVM-based runtimes. Most importantly, it will allow users to use scientific Python tools (like Cython) to write high performance code that operates on data flowing around the Impala C++/LLVM runtime. If you have to pay a high transfer cost to move data to and from Python, the benefits of such tools can be largely negated in many use cases.

Will Ibis be useful without Impala

Oh, yes, incredibly useful. The value of writing Python code (high level) instead of SQL code (low level) is extremely high. More on this in a future post.

About Blaze (Python project)

Blaze (from my friends at Continuum Analytics) is a Python project with some overlapping goals with Ibis. Both use the same general technical approach of building a decoupled expression API with a separate compiler-executor. Overall I would describe them as providing two distinct domain specific languages for structured data having different priorities (based on what I see on GitHub).

While it's too nuanced for this blog post, I will state my views on how Blaze is similar to / different from Ibis as follows:

  • Blaze has varying levels of support for lots of backends. The price of this is that any given backend is likely missing substantial functionality in Blaze's expression API (I know this to be true from reading the test suite). We started Ibis with comprehensive coverage of one backend (Impala, for which we are developing an efficient Python extension API) as a blueprint for additional backends to be contributed by the community.
  • I don't believe it was ever a goal of Blaze to replace SQL (the language) in user workflows.
  • Blaze's expression API is designed fairly differently. I may do an in-depth comparison of how the differences play out in real code (sooner if a lot of people are interested).

Complete coverage of the functionality provided by one SQL engine is truthfully a fairly difficult problem. There are canonical database benchmark suites like TPC-H and TPC-DS that you can use to put a lot of stress on any given DSL that can emit SQL queries.

If you're trying to decide which project to use, look at the projects (the test suites, in particular) and make your own decisions.

Summary

With the current completeness of the Impala SQL backend in Ibis, the work of integrating with other data engines that speak SQL is relatively straightforward and incremental. I will happily welcome well-tested contributions to the project.

In an upcoming blog post, I will write in some more detail about how Ibis will help bring about a post-SQL world and why that will be a productivity boon for data analysts everywhere.