ScalaSql: a New SQL Database Query Library for the com-lihaoyi Scala Ecosystem

Posted 2024-01-15

So, What's So Special About The Mill Scala Build Tool?12 years of the com.lihaoyi Scala Platform

This blog post discusses the introduction of the ScalaSql query library. Why it is needed, what it brings to the table over the many existing database query libraries in the Scala ecosystem, and what makes it fit nicely into the com-lihaoyi philosophy of "Executable Scala Pseudocode that's Easy, Boring, and Fast"

About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming

ScalaSql was released on January 1st 2024, but had been in the works for several months before that. It's a library that lets you define normal Scala case classes, write normal-looking Scala collection operations, and use them to perform SQL queries. The readme example from the ScalaSql Github repository should illustrate how it works from end-to-end:

import scalasql._, SqliteDialect._

// Define your table model classes
case class City[T[_]](
    id: T[Int],
    name: T[String],
    countryCode: T[String],
    district: T[String],
    population: T[Long]
)
object City extends Table[City]

// Connect to your database (example uses in-memory sqlite, org.xerial:sqlite-jdbc:3.43.0.0)
val dataSource = new org.sqlite.SQLiteDataSource()
dataSource.setUrl(s"jdbc:sqlite:file.db")
lazy val dbClient = new scalasql.DbClient.DataSource(
  dataSource,
  config = new scalasql.Config {
    override def nameMapper(v: String) = v.toLowerCase() // Override default snake_case mapper
    override def logSql(sql: String, file: String, line: Int) = println(s"$file:$line $sql")
  }
)

dbClient.transaction{ db =>
  // Initialize database table schema and data
  db.updateRaw(os.read(os.Path("scalasql/test/resources/world-schema.sql", os.pwd)))
  db.updateRaw(os.read(os.Path("scalasql/test/resources/world-data.sql", os.pwd)))

  // Adding up population of all cities in China
  val citiesPop = db.run(City.select.filter(_.countryCode === "CHN").map(_.population).sum)
  // SELECT SUM(city0.population) AS res FROM city city0 WHERE city0.countrycode = ?
  println(citiesPop)
  // 175953614

  // Finding the 5-8th largest cities by population
  val fewLargestCities = db.run(
    City.select
        .sortBy(_.population).desc
        .drop(5).take(3)
        .map(c => (c.name, c.population))
  )
  // SELECT city0.name AS res__0, city0.population AS res__1
  // FROM city city0 ORDER BY res__1 DESC LIMIT ? OFFSET ?
  println(fewLargestCities)
  // Seq((Karachi, 9269265), (Istanbul, 8787958), (Ciudad de México, 8591309))
}

The Missing Database Library in the com-lihaoyi Ecosystem?

ScalaSql is meant to plug a hole in the com-lihaoyi ecosystem.

The com-lihaoyi libraries are specifically designed for ease of use. Most people do not need the ultimate level of performance or extreme type-safety, and these libraries are performant enough and type-safe enough for the vast majority of use cases. But apart from being easy to use, the com-lihaoyi libraries are also small and self-contained: you can use them to perform useful real-world work without needing to pull in complicated external dependencies: scrape websites, automate HTTP JSON APIs, juggle files on disk or work with subprocesses, set up your own web/API servers, static website generators, or write your own parallel web crawlers. All of this can be done by using - and re-using - the same set of core libraries: uPickle to deal with JSON, OS-Lib for files and subprocesses, Requests-Scala for making HTTP requests, and so on.

This small set of easy-to-learn building blocks can be combined and re-combined in a myriad of different ways, allowing you to accomplish many different real-world tasks. This lets you minimize your up-front onboarding cost while maximizing what you are able to accomplish. That is the basis of my book Hands-on Scala Programming.

But there was perhaps one big hole in this com-lihaoyi world: any real-world backend system or service needs to store data. Directly storing serialized data on disk can be done, but is fragile, risky, and inflexible. Storing data in cloud blob storage like Amazon S3 works for large static files but not for fast-changing metadata. Most real-world systems use some kind of external database service to store their data, and perhaps the greater majority of these are SQL databases like MySql, PostgreSql, or Microsoft Sql Server.

Before ScalaSql, the com-lihaoyi ecosystem provided no way to query such a database: you were on your own to find your own solutions.

Other Scala Database Libraries

The Scala community has no shortage of database libraries. Here are some in no particular order,

These are a small, but representative, sample of the major Scala database libraries out there. There are countless others, including Java libraries like JOOQ that can be used from Scala as well.

However, none of these existing database libraries really satisfy my requirements for what I want someone using the com-lihaoyi ecosystem to experience: easy, small, and self-contained. I'll discuss two of them in more detail: Quill and SLICK

Quill

Quill is perhaps my favorite of the existing Scala database query libraries. It is unique in that it translates your Scala code snippets at compile-time to SQL queries. This means:

Zero runtime overhead: your Quill queries do not do any complex data structure manipulation at runtime, and are as fast as passing raw SQL strings to the database
Compile-time query logging: you can extract a comprehensive list of all the SQL queries in your application just by compiling it. This makes it very easy to figure out what SQL your Scala code is generating, or to pinpoint the Scala code that is responsible for a particular query you see in production

Quill was my original choice to fill the gap in the com-lihaoyi ecosystem: in my blog posts, in Chapter 15 of my book, I use Quill in concert with the Cask webserver and other com-lihaoyi libraries. In my professional work, I've deployed distributed systems with multi-terabyte SQL databases running thousands of queries per second, with all the database traffic going through Quill.

However, Quill has some limitations that are difficult to square with the com-lihaoyi philosophy:

It is really complicated internally. As a compile-time Scala-to-SQL translator, it does does not look like any "normal" library you may encounter elsewhere. Rather than boring classes and methods and interfaces, we are instead reasoning about exciting language features like macros, quasiquotes, refinement-types. And the whole thing needed to be rewritten from Scratch in Scala 3, with a whole new set of exciting language features: inline functions, quote pattern-matching, and other things. This means a high barrier to contributing, or even understanding what Quill is doing, which is an inevitable necessity when something misbehaves and you are trying to debug whether it's your fault or the library's. I like to think myself an advanced Scala user and metaprogramming expert, and even I find this stuff pretty scary
Quill's user-facing abstraction is leaky. Like Scala.js, Quill compiles Scala to another language at compile time. But unlike Scala.js, Quill only supports a tiny subset of the Scala language, and it's not easy to figure out what exactly that subset is. You can perform .<tab> completion on a String in Quill and see a hundred different operations, but only 7 of them are actually valid, and even after you figure that out you don't know what SQL those 7 valid operations translate into. This isn't the fault of the implementation, but is a fundamental issue with the design: it is simply not feasible to translate the full Scala language and library to SQL, the way that Scala.js can translate it to Javascript. The languages Scala and SQL are just too different, and trying to paper over the difference causes confusion.

Combined, these two issues make my personal experience using Quill very un-even:

When Quill works, it works great: the code is fast, it looks beautiful, it logs the SQL at compile time for inspection, and so on. The skies open and angels sing and everything is right in the world.
But when Quill doesn't work, it fails hard: it generates SQL you didn't expect, and the only thing you can do is trial-and-error to make it generate what you want. Or it crashes with some obscure macro compile-time query-optimization error that you have no hope of understanding, and again you are back to trial and error to figure out how to make it not blow up.

All libraries have bugs, all libraries having missing docs, and all libraries can be mis-used causing mis-behavior and user confusion. But I found Quill's complexity and advanced implementation techniques definitely work against it in cases where things go wrong: I cannot just jump-to-definition, I cannot just read the sources. And if even myself as an advanced Scala user have trouble with it, I expect the newbies that the com-lihaoyi ecosystem is designed for to have even more trouble.

ScalaSql is by no means a simple library, and it may never have the nice compile-time query generation and zero-runtime-overhead performance benefits that Quill provides. But I would expect that users would have an easier time working with it, especially in the inevitable scenarios where they need to peek under the covers and see how the library really works.

SLICK

SLICK is an old project dating back to 2009, long before I started using Scala, and originally called ScalaQuery before it got renamed. It has a similar design as ScalaSql: constructing query data structures at runtime using familiar collection-like operations that are then translated down to SQL and sent to the database.

Perhaps the biggest way that SLICK diverges from ScalaSql is its use of the DBIO[T] monad. This is intended for managing transactions and asynchronous workflows. However the DBIO[T] monad adds a ton of complexity: e.g. the book Essential Slick contains a whole chapter on manipulating DBIO[T] values, which distracts from the essential task of querying the database.

Method	Arguments	Result Type
`map` (EC)	`T => R`	`DBIO[R]`
`flatMap` (EC)	`T => DBIO[R]`	`DBIO[R]`
`filter` (EC)	`T => Boolean`	`DBIO[T]`
`named`	`String`	`DBIO[T]`
`zip`	`DBIO[R]`	`DBIO[(T,R)]`
`asTry`		`DBIO[Try[T]]`
`andThen` or `>>`	`DBIO[R]`	`DBIO[R]`
`andFinally`	`DBIO[_]`	`DBIO[T]`
`cleanUp` (EC)	`Option[Throwable]=>DBIO[_]`	`DBIO[T]`
`failed`		`DBIO[Throwable]`

This is especially bad because SLICK's DBIO[T] monad has map, filter, flatMap operations that look very similar to the map, filter and flatMap operations that SLICK's Query[T] type have. Not only does a user need to familiarize themselves with a whole new set of collection-like operations on DBIO[T], they now also need to constantly disambiguate their DBIO[T] operations from the Query[T] operations with the exact same name and function signature! Consider this example, again from the book Essential Slick, that demonstrates how in idiomatic usage you can be expected to have both Query[T].map and DBIO[T].map right next to each other, just asking for user confusion:

//                                    Query[T].map
val text: DBIO[Option[String]] = messages.map(_.content).result.headOption

//                                      DBIO[T].map                            Option[T].map
val backwards: DBIO[Option[String]] = text.map(optionalContent => optionalContent.map(_.reverse))

Some of this complexity can be papered over by tools like Blocking Slick, but much of it cannot, especially when it comes to the maintainability of SLICK itself and its internals. I would guess that SLICK's internal complexity has also costed it potential contributors and other improvements that on a surface level seem unrelated to the whole DBIO/asynchronous-execution design.

ScalaSql differs from SLICK in that it ignores asynchrony entirely, and thus avoids the monadic wrapper that SLICK requires. Most applications are never going to have such extreme performance requirements that asynchronous computation is really necessary, and those that do can always copy SLICK and run their ScalaSql queries on a thread pool themselves. Dropping asynchrony and the monadic wrapper means two less major topics that a user (or maintainer!) needs to become familiar with, so that they can focus on the core task of querying their SQL database.

ScalaSql

And so we come to ScalaSql. Where Quill spends its effort and complexity on "compile time translation", and SLICK spends its effort and complexity on "monadic asynchrony", ScalaSql spends its effort and complexity on ease of use. It is hard to summarize several months of "make things easier" work into a single blog post, but here are two facets of the ScalaSql design I would like to highlight:

Higher-Kinded Case Classes as Database Rows

ScalaSql defines your database rows as higher-kinded case classes:

case class City[T[_]](
    id: T[Int],
    name: T[String],
    countryCode: T[String],
    district: T[String],
    population: T[Long]
)
object City extends Table[City]

Here, City can be one of:

City[Sc]: a City in application code, where type Sc[T] = T. Or
City[Expr]: a City in your query, where Expr[T] values have a different set of operations available than plain Ts, representing the operations available in your specific database.

This means you can easily share your "database row" classes between the queries fetching data and the application code using that data, without needing to list out all the fields/columns twice (or more times!) like you have to do in many other libraries like SLICK

case class City(
    id: Int,
    name: String,
    countryCode: String,
    district: String,
    population: Long
)
class Cities(tag: Tag) extends Table[City](tag, "city"){
   def id = column[Int]("id")
   def name = column[String]("name")
   def countryCode = column[String]("country_code")
   def district = column[String]("district")
   def population = column[Long]("population")
}

SLICK also allows you to define mappings between query-time Row classes and application-level Row classes, but that necessitates yet another level of duplication for your schema:

case class LiftedCity(
    id: Rep[Int],
    name: Rep[String],
    countryCode: Rep[String],
    district: Rep[String],
    population: Rep[Long]
)
implicit object BShape extends CaseClassShape(LiftedCity.tupled, City.tupled)

In general, any "lifted" DSL has this issue, not just SLICK: the "lifted" case classes with fields lifted into Rep[T] or Expr[T] are different from the "vanilla" case classes with vanilla fields. This means you have to define your schema 2-3 times to accommodate both flavors. Quill solves this by having a "direct" style without lifting, though that comes with its own issues described earlier. By using higher-kinded case classes to model rows, ScalaSql is able to avoid the boilerplate of SLICK's lifted case class definitions, while also avoiding the confusion around Quill's direct-style use of vanilla case classes in queries. The only cost is a small amount of boilerplate defining each case class with a T[_] higher-kinded type parameter.

While many database libraries, SLICK included, allow you to set up code generation pipelines to generate this boilerplate for you, ScalaSql instead makes the code simple enough you don't need codegen in the first place. Write your case class, annotate it a bit with T[_] wrappers, and you're off to the races. Easy!

Object Oriented Queries

ScalaSql queries are just a bunch of bog-standard OO-style objects. That makes understanding them and extending them something that you do not need special Scala expertise to manage; anyone with any programming experience should know how the extends and override keywords work.

For example, consider the OnConflict.Update query class:

object OnConflict{
  class Update[Q, R](...) extends Query.DelegateQuery[R] {
    protected def renderSql(ctx: Context) = {
      implicit val implicitCtx = Context.compute(ctx, Nil, Some(table))
      val str = Renderable.renderSql(query)
      val columnsStr = SqlStr.join(columns.map(c => SqlStr.raw(c.name)), SqlStr.commaSep)
      val updatesStr = SqlStr.join(
        updates.map { case assign => SqlStr.raw(assign.column.name) + sql" = ${assign.value}" },
        SqlStr.commaSep
      )
      str + sql" ON CONFLICT ($columnsStr) DO UPDATE SET $updatesStr"
    }
       
    // ...
  }
}

Most databases have this syntax, but some don't. e.g. MySql uses ON DUPLICATE KEY UPDATE with a somewhat different syntax. In ScalaSql, it's just a matter of defining a new class and providing your own implementation and overrides:

object MySqlDialect{
  class OnConflictUpdate[Q, R](...) extends Query.DelegateQuery[R] {

    protected def renderSql(ctx: Context) = {
      implicit val implicitCtx = Context.compute(ctx, Nil, Some(table))
      val str = Renderable.renderSql(insert.query)

      val updatesStr = SqlStr.join(
        updates.map { case assign => SqlStr.raw(assign.column.name) + sql" = ${assign.value}" },
        SqlStr.commaSep
      )
      str + sql" ON DUPLICATE KEY UPDATE $updatesStr"
    }
      
    // ...
  }
}

The code here might be a bit verbose and unfamiliar, but it's not hard. It's just vanilla object-oriented inheritance and subclassing, the same kind that people learn about in programming 101. If someone unfamiliar with Scala or ScalaSql hit jump to definition and ended up here, they would probably be able to guess correctly what this code was doing, and they could probably fumble their way through any changes they needed to make.

Conclusion

Database access is a non-negotiable part of any real system, but the com-lihaoyi ecosystem has in the past not provided any libraries to support that. I have used other Scala community libraries as stopgaps in the past, and they work, though I had always felt them to be just a bit to "advanced and powerful" when what I really want is "easy and boring".

ScalaSql aims to plug that gap. It provides a database access library with many sacrifices: we lose Quill's zero-runtime-overhead performance, we lose SLICK's asynchronous design to support ultra-scalable ultra-high-concurrency services. But in exchange, we get something that is easy and boring, while hopefully still being performant enough and scalable enough for the vast majority of use cases out there. This opens up a whole new set of use cases that can be satisfied entirely by the com-lihaoyi ecosystem: interactive websites, API servers, data analysis and ETL work, etc. that would previously have required choosing and learning about a third-party library with its own philosophy and conventions.

ScalaSql is still a new library. Despite it's thorough internal test suite, it is not as battle-tested as existing libraries like Quill and SLICK with their 10-15 year histories. It supports 4 databases backends, not the dozen that Quill or SLICK support. Some features may be missing, e.g. at time of writing Scala 3 support is still a work in progress.

But I hope people are willing to give ScalaSql a try: work through the tutorial, skim the cheat sheet, report issues, start discussions, and generally participate. The Scala ecosystem has no shortage of database libraries, but perhaps ScalaSql is the first to really exemplify the com-lihaoyi philosophy: "Executable Scala Pseudocode that's Easy, Boring, and Fast".