How to work with Files in Scala

Posted 2019-06-02

Compact, Streaming Pretty-Printing of Hierarchical Data How to work with Subprocesses in Scala

Working with files and the filesystem is one of the most common things you do when programming. This tutorial will walk through how to easily work with files in the Scala programming language, in a way that scales from interactive usage in the REPL, to your first Scala scripts, to usage in a production system or application.

About the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi's book Hands-on Scala Programming

OS-Lib: a simple filesystem library

The easiest way to work with the filesystem in Scala is through the OS-Lib filesystem library. OS-Lib is available on Maven Central for you to use with any version of Scala:

// SBT
"com.lihaoyi" %% "os-lib" % "0.2.7"

// Mill
ivy"com.lihaoyi::os-lib:0.2.7"

OS-Lib also comes bundled with Ammonite, and can be used within the REPL and *.sc script files.

All functionality within this library comes from the os package, e.g. os.Path, os.read, os.list, and so on. To begin with, I will install Ammonite:

$ sudo sh -c '(echo "#!/usr/bin/env sh" && curl -L https://github.com/lihaoyi/Ammonite/releases/download/1.6.7/2.12-1.6.7) > /usr/local/bin/amm && chmod +x /usr/local/bin/amm'

And open the Ammonite REPL, using os.<tab> to see the list of available operations:

$ amm
Loading...
Welcome to the Ammonite Repl 1.6.7
(Scala 2.12.8 Java 11.0.2)
@ os.<tab>
/                                RelPath                          list
BasePath                         ResourceNotFoundException        makeDir
BasePathImpl                     ResourcePath                     move
BasicStatInfo                    ResourceRoot                     mtime
Bytes                            SeekableSource                   owner
CommandResult                    SegmentedPath                    perms
FilePath                         Shellable                        proc
...

From there, we can begin our tutorial.

Paths

Most operations we will be working with involve filesystem paths: we read data from a path, write data to a path, copy files from one path to another, or list a folder path to see what files are inside of it. This is represented by the os.Path type

Constructing Paths

By default, you have a few paths available: os.pwd, os.root, os.home:

@ os.pwd
res0: os.Path = root/'Users/'lihaoyi/'Github/'blog

@ os.root
res1: os.Path = root

@ os.home
res2: os.Path = root/'Users/'lihaoyi

These refer to your process working directory, filesystem root, and user home folder respectively. To refer to paths relative to an existing path, you can use the / operator to add additional path segments:

@ os.pwd / "post"
res3: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post

@ os.home / "Github" / "blog"
res4: os.Path = root/'Users/'lihaoyi/'Github/'blog

Note that you can only append single path segments to a path using the / operator on strings, e.g. this is not allowed:

@ os.home / "Github/blog"
os.PathError$InvalidSegment: [Github/blog] is not a valid path segment.
[/] is not a valid character to appear in a path segment. If you want to parse
an absolute or relative path that may have multiple segments, e.g. path-strings
coming from external sources use the Path(...) or RelPath(...) constructor calls
to convert them.

As the error message suggests, you need to use the os.RelPath constructor in order to construct a relative path of more than one segment. This helps avoid confusion between working with individual path segments (as Strings) and working with more general relative paths (as os.RelPaths)

You can also use the special os.up path segment to move up one level:

@ os.pwd
res5: os.Path = root/'Users/'lihaoyi/'Github/'blog

@ os.pwd / os.up
res6: os.Path = root/'Users/'lihaoyi/'Github

@ os.pwd / os.up / os.up
res7: os.Path = root/'Users/'lihaoyi

@ os.pwd / os.up / os.up / os.up
res8: os.Path = root/'Users

@ os.pwd / os.up / os.up / os.up / os.up
res9: os.Path = root

You can construct os.Paths from strings:

@ os.Path("/")
res10: os.Path = root

@ os.Path("/Users/lihaoyi")
res11: os.Path = root/'Users/'lihaoyi

This is helpful when paths are coming in from elsewhere, e.g. read from a file or command-line arguments.

Note that by default this only allows absolute paths:

@ os.Path("post")
java.lang.IllegalArgumentException: requirement failed: post is not an absolute path

If you want to take in a path that is relative, you have to provide a base path from which that relative path will begin at

@ os.Path("post", base = os.pwd)
res13: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post

@ os.Path("../Ammonite", base = os.pwd)
res14: os.Path = root/'Users/'lihaoyi/'Github/'Ammonite

Relative Paths

If you want to model relative paths, you want a os.RelPath:

@ os.RelPath("post")
res20: os.RelPath = 'post

@ os.RelPath("../hello/world")
res21: os.RelPath = up/'hello/'world

This helps ensure you do not mix up what you are working with, os.Paths are always absolute, os.RelPaths are always relative. To convert a relative path to an absolute path, you can use the same / operator:

@ val postFolder = os.RelPath("post")
postFolder: os.RelPath = 'post

@ os.pwd / postFolder
res23: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post

@ val helloWorldFolder = os.RelPath("../hello/world")
helloWorldFolder: os.RelPath = up/'hello/'world

@ os.home / helloWorldFolder
res25: os.Path = root/'Users/'hello/'world

If you want the relative path between two absolute paths, you can use .relativeTo:

@ val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = root/'Users/'lihaoyi/'Github

@ val usersPath = os.Path("/Users")
usersPath: os.Path = root/'Users

@ githubPath.relativeTo(usersPath)
res36: os.RelPath = 'lihaoyi/'Github

@ usersPath.relativeTo(githubPath)
res37: os.RelPath = up/up

Paths are always canonicalized:

os.Paths always resolve any .. segments:

@ val githubPathOne = os.Path("/Users/lihaoyi/Github/../Github")
githubPathOne: os.Path = root/'Users/'lihaoyi/'Github

@ val githubPathTwo = os.Path("/Users/lihaoyi/Github/../Github/../Github")
githubPathOne: os.Path = root/'Users/'lihaoyi/'Github

@ githubPathOne == githubPathTwo
res17: Boolean = true

As well as redundant/unnecessary /s, either in the middle of a path or trailing:

@ os.Path("/Users/lihaoyi////Github/")
res18: os.Path = root/'Users/'lihaoyi/'Github

@ os.Path("/Users/lihaoyi/Github") == os.Path("/Users/lihaoyi////Github/")
res19: Boolean = true

Thus, you can be sure that an os.Path is always in its canonical representation, and can be easily printed, compared, sorted, de-duplicated, etc.

Relative os.RelPaths are also canonical:

@ val helloPathOne = os.RelPath("../hello/world")
helloPathOne: os.RelPath = up/'hello/'world

@ val helloPathTwo = os.RelPath("../hello/../hello/world//../world")
helloPathTwo: os.RelPath = up/'hello/'world

@ helloPathOne == helloPathTwo
res29: Boolean = true

Type-safe extension

Given an absolute path and a relative path:

@ val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = root/'Users/'lihaoyi/'Github

@ val postPath = os.RelPath("post")
postPath: os.RelPath = 'post

You can only extend an absolute path with a relative path:

@ githubPath / postPath
res32: os.Path = root/'Users/'lihaoyi/'Github/'post

Or a relative path with another relative path:

@ postPath / postPath
res33: os.RelPath = 'post/'post

But you cannot extend an absolute path with an absolute path:

@ githubPath / githubPath
cmd34.sc:1: type mismatch;
 found   : os.Path
 required: os.RelPath
val res34 = githubPath / githubPath
                         ^
Compilation Failed

Or a relative path with an absolute path

@ postPath / githubPath
cmd34.sc:1: type mismatch;
 found   : os.Path
 required: os.RelPath
val res34 = postPath / githubPath
                       ^
Compilation Failed

It basically never makes sense to extend something with an absolute path, and the os.Path type makes sure you do not do so by accident.

Filesystem Operations

Queries

The first thing you may want to do is see what's available in a particular folder, which you can do using os.list:

@ os.list(os.pwd)
res38: WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

os.walk for a recursive listing:

@ os.walk(os.pwd)
res40: IndexedSeq[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/'post,
  root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining,
  root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubSearch.png",
  ...

You can also use os.stat, os.isFile, os.size, etc. to read metadata of individual files or folders.

Actions

os.read to read a file:

@ os.read(os.pwd / ".gitignore")
res39: String = """target/
scratch/
*.iml
.idea
.settings
.classpath
.project
.cache
.sbtserver
project/.sbtserver
tags
"""

os.write to write a file:

@ os.write(os.pwd / "new.txt", "Hello World")

@ os.list(os.pwd)
res42: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/"new.txt",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

@ os.read(os.pwd / "new.txt")
res43: String = "Hello World"

os.move to move a file:

@ os.move(os.pwd / "new.txt", os.pwd / "newer.txt")

@ os.list(os.pwd)
res45: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/"newer.txt",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

os.copy to copy a file:

@ os.copy(os.pwd / "newer.txt", os.pwd / "newer-2.txt")

@ os.list(os.pwd)
res47: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
  root/'Users/'lihaoyi/'Github/'blog/"newer.txt",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

os.remove to remove a file:

@ os.remove(os.pwd / "newer.txt")

@ os.list(os.pwd)
res49: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

os.makeDir to create a new folder

@ os.makeDir(os.pwd / "new-folder")

@ os.list(os.pwd)
res51: collection.mutable.WrappedArray[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/".gitignore",
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/"favicon.png",
  root/'Users/'lihaoyi/'Github/'blog/"new-folder",
  root/'Users/'lihaoyi/'Github/'blog/"newer-2.txt",
  root/'Users/'lihaoyi/'Github/'blog/'post,
)

Many of these commands take flags that let you configure the operation, e.g. os.read lets you pass in an offset to read from and a count of characters to read, and have variants like os.read.bytes to read binary data, os.read.lines to read lines. os.makeDir has os.makeDir.all to recursively create necessary folders, os.remove.all to recursively remove a folder and its contents, and so on.

The linked documentation for each command goes into more detail of what you can do with each one.

Streaming

Many operations expose a .stream variant, which allows you to process its output in a streaming fashion. This avoids accumulating all the output in memory, letting you process large results without causing memory issues.

For example, os.read.lines.stream to stream the lines of a file:

@ os.read.lines.stream(os.pwd / ".gitignore").foreach(println)
target/
scratch/
*.iml
.idea
.settings
.classpath
.project
.cache
.sbtserver
project/.sbtserver
tags

os.list.stream for streaming the contents of a folder

@ os.list.stream(os.pwd).foreach(println)
/Users/lihaoyi/Github/blog/build.sc
/Users/lihaoyi/Github/blog/post
/Users/lihaoyi/Github/blog/target
/Users/lihaoyi/Github/blog/favicon.png
/Users/lihaoyi/Github/blog/pages.sc
/Users/lihaoyi/Github/blog/.gitignore
/Users/lihaoyi/Github/blog/new-folder
/Users/lihaoyi/Github/blog/newer-2.txt
/Users/lihaoyi/Github/blog/blog.iml
/Users/lihaoyi/Github/blog/.git
/Users/lihaoyi/Github/blog/pageStyles.sc
/Users/lihaoyi/Github/blog/.idea

*.stream operations return a Generator type. These are similar to iterators, except they ensure that resources are always released after processing. This helps avoid leaking file handles or other filesystem resources. Other than that, most collection operators like .foreach, .map, .filter, .toArray, etc. all apply.

Use Case: Find Largest 5 Files

Now that we've gone over the basic operations that you can perform on a filesystem, let's walk through a simple use case.

Often when your disk is full, you want to look for the biggest files that you can remove to free up space. We can do this in a few steps:

First we list all the files and folders in a particular folder (for now just using os.pwd):

@ val allPaths = os.walk(os.pwd)
allPaths: IndexedSeq[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/'post,
  root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
  ...

Next, we can filter out the folders so we're only looking at files:

@ val allFiles = allPaths.filter(os.isFile)
allFiles: IndexedSeq[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/"build.sc",
  root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",

Find out how big each file is by using .map

@ val sizedFiles = allFiles.map(path => (os.size(path), path))
sizedFiles: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
  (8134L, root/'Users/'lihaoyi/'Github/'blog/"build.sc"),
  (73028L, root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md"),
  (49727L, root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md"),
  (17269L, root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md"),
  ...

Lastly, sort by the size and take the first 5:

@ sizedFiles.sortBy(_._1).takeRight(5)
res59: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
  (5499949L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'slides/"Why-You-Might-Like-Scala.js.pdf"),
  (6008395L, root/'Users/'lihaoyi/'Github/'blog/'post/'SmartNation/"routes.json"),
  (6008395L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'SmartNation/"routes.json"),
  (6340270L, root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubHistory.gif"),
  (6340270L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'Reimagining/"GithubHistory.gif")
)

Here, we can see the 5 largest files: in this folder, it's a number of large Gifs, JSON datasets, and a PDF document. You can do all this in command using:

@ os.walk(os.pwd).filter(os.isFile).map(path => (os.size(path), path)).sortBy(_._1).takeRight(5)
res60: IndexedSeq[(Long, os.Path)] = ArrayBuffer(
  (5499949L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'slides/"Why-You-Might-Like-Scala.js.pdf"),
  (6008395L, root/'Users/'lihaoyi/'Github/'blog/'post/'SmartNation/"routes.json"),
  (6008395L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'SmartNation/"routes.json"),
  (6340270L, root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubHistory.gif"),
  (6340270L, root/'Users/'lihaoyi/'Github/'blog/'target/'post/'Reimagining/"GithubHistory.gif")
)

Use Case: Folder Syncing

Let's walk through a second use case: write a program that will take a source and destination folder, and efficiently update the destination folder to look like the source folder as files are added to it or modified (for simplicity, we will ignore deletions).

@ val src = os.pwd / "post"; val dest = os.pwd / "post-copy"
src: os.Path = root/'Users/'lihaoyi/'Github/'blog/'post
dest: os.Path = root/'Users/'lihaoyi/'Github/'blog/"post-copy"

Lets also assume that simply deleting the destination and re-copying the source over is to inefficient:

@ os.remove.all(dest)

@ os.copy.all(src, dest)

And we want to do it on a per-file/folder basis.

To begin with, we need to recursively walk all contents of the source folder

@ val srcContents = os.walk(src)
srcContents: IndexedSeq[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/'post/"9 - Micro-optimizing your Scala code.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"24 - How to conduct a good Programming Interview.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
  root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining,
  root/'Users/'lihaoyi/'Github/'blog/'post/'Reimagining/"GithubSearch.png",
...

Then, we iterate over every entry, and see if its a file or folder:

@ for(path <- srcContents) println(os.isDir(path))
false
false
false
true
false
false

For simplicity, we'll ignore the presence of symbolic links, detectable via os.isLink.

We can find the corresponding isDir for the destination path using:

@ for(path <- srcContents) println(os.isDir(dest / path.relativeTo(src)))
false
false
false
false
false
false

For now, the source folder doesn't exist, so isDir returns false on all of the paths.

Next, we walk over the srcContents and the corresponding paths in dest together, and if they differ, delete the destination sub-path and copy the source sub-path over

@ for(srcSubPath <- srcContents) {
    val destSubPath = dest / srcSubPath.relativeTo(src)
    (os.isDir(srcSubPath), os.isDir(destSubPath)) match{
      case (false, true) | (true, false) => os.copy.over(srcSubPath, destSubPath)
      case (false, false)
        if !os.exists(destSubPath)
        || os.read.bytes(srcSubPath) != os.read.bytes(destSubPath)  =>

        os.copy.over(srcSubPath, destSubPath, createFolders = true)

      case _ => // do nothing
    }
  }

Now, we can walk the dest path and see all our contents in place:

@ os.walk(dest)
res13: IndexedSeq[os.Path] = ArrayBuffer(
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"9 - Micro-optimizing your Scala code.md",
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"24 - How to conduct a good Programming Interview.md",
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/"23 - Scala Vector operations aren't \"Effectively Constant\" time.md",
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining,
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining/"GithubSearch.png",
  root/'Users/'lihaoyi/'Github/'blog/"post-copy"/'Reimagining/"GithubBrowsing.gif",

We can wrap this all in a function for easy usage:

@ def sync(src: os.Path, dest: os.Path) = {
    val srcContents = os.walk(src)
    for(srcSubPath <- srcContents) {
      val destSubPath = dest / srcSubPath.relativeTo(src)
      (os.isDir(srcSubPath), os.isDir(destSubPath)) match{
        case (false, true) | (true, false) => os.copy.over(srcSubPath, destSubPath)
        case (false, false)
          if !os.exists(destSubPath)
          || os.read.bytes(srcSubPath) != os.read.bytes(destSubPath)  =>

          os.copy.over(srcSubPath, destSubPath, createFolders = true)

        case _ => // do nothing
      }
    }
  }

defined function syncAdd

To test incremental updates, we can try adding an entry to the src folder:

@ os.write(src / "ABC.txt", "Hello World")

Running the sync:

@ sync(src, dest)

We can then see our file has been synced over to dest

@ os.exists(dest / "ABC.txt")
res29: Boolean = true

@ os.read(dest / "ABC.txt")
res30: String = "Hello World"

And modifications to that file also get synced over:

@ os.write.append(src / "ABC.txt", "\nI am Cow")

@ sync(src, dest)

@ os.read(dest / "ABC.txt")
res33: String = """Hello World
I am Cow"""

This use case is greatly simplified for simplicity so it can fit within a blog post: we do not consider deletions, syncing permissions, sub-file level syncing of data (e.g. Dropbox famously syncs in 4mb blocks), or concurrency/parallelism concerns. Nevertheless, it should give you a good sense of how working with the filesystem via Scala's OS-Lib library works, and you can easily extend it if you need more functionality

Conclusion

While we have only covered two use cases in this post, the OS-Lib Cookbook has several other use cases you can browse to see how file handling works in a wider variety of situations:

This is only a quick tour of how to work with the filesystem in various ways. The library documentation has a much more thorough reference for all the things you can do and how to do them:

Documentation

Dealing with files and folders in Scala doesn't need to be difficult or verbose. With the OS-Lib library, querying information about the filesystem is both convenient and safe: you can accomplish what you want in very little code, while the compiler and library helps you check your logic and make sure you aren't e.g. messing up your path handling.

While OS-Lib is a third-party library, it is available on Maven Central and easy to use in any Scala environment: whether built using SBT, Maven, Mill, or directly in Ammonite's REPL or Scripts. All systems end up needing to interact with the filesystem for various miscellaneous tasks, and in Scala such interactions can be quick, easy, and safe.