Pandoc is the Swiss army knife for document conversions, and pretty much a necessary component of almost any system that requires flexible document exports. Unfortunately, the Amazon Linux image used by Lambda doesn’t include Pandoc. However, with a bit of tweaking, you can run almost any binary inside it. In this tutorial, you’ll learn how to create a flexible and scalable document conversion service using Pandoc, deployed to AWS Lambda. You’ll also learn how to deploy third-party binaries with your Lambda function, to install similar tools yourself.
TL;DR, just give me the files
If you are in a hurry, just want to get started quickly, and don’t care about how things work, check out the final code in Pandoc S3 Converter Example Project.
If you don’t want to use Node.js, but just want the precompiled Pandoc binary, download it from https://github.com/claudiajs/pandoc-aws-lambda-binary.
Step-by-step tutorial
Here’s the plan:
- Step 1: compile a static binary package for Lambda
- Step 2: bundle the binary with a Lambda function
- Step 3: create a Lambda function to download and upload files to S3
- Step 4: send everything up to Lambda
- Step 5: connect to an S3 event source
- Step 6: convert files
Step 1: compiling the binary package
Lambda runs on Amazon Linux, a modified version of CentOS, so it’s best to use that same image for compiling the static binary. You can find out the exact Amazon Machine Image ID for your region on the Lambda Execution Environment page. For example, at the time when this tutorial was written, the AMI in US-East region was ami-60b6c60a
. Launch the corresponding AMI in EC2. Then connect to it using ssh.
For compiling Pandoc, you’ll need at least a t2.medium
instance. Smaller instances may not have enough memory for everything. You’ll need this machine only to compile the binary, so don’t worry about the cost too much.
Pandoc is written in Haskell, so that will make it even more challenging to create a binary than with a typical Unix tool. David Tchepak wrote a nice guide on how to produce static binaries with Haskell on Linux a while ago, so we’ll use that as a guide and just make sure to use updated versions. For some more background info, check out the Static Build of Pandoc thread on GitHub.
First, we need to install a bunch of development tools:
Installing the haskell packages directly from YUM won’t help, because the version available for the Amazon instances is quite outdated. We’ll install the ghc
compiler directly from the Haskell repository. At the time when I wrote this, even the latest binary version of ghc
won’t work out of the box on the Amazon Linux AMIs used for Lambda, because it is looking for an older version of libgmp
. This is the error you’ll get trying to run it:
So we’re going to cheat a bit. This step may not be necessary in the future, if ghc
stops asking for version 3 of libgmp
, but with 8.0.1 you’ll need it to proceed:
Now we can install the Haskell compiler. Check out the GHC Download Page to find the latest version, and pick up the variant for x86_64-centos. At the time when I wrote this, the current version was 8.0.1. Here’s a quick script to install it:
Next, we’ll need the Cabal packaging tool. Find the most recent version on the Cabal download page, you most likely need the x86_64-unknown-linux
variant, and unpack it to a directory in the executable path:
We can now finally compile pandoc. To make sure it’s not putting any files around the system, we’ll use the Cabal sandbox feature:
You can now copy the pandoc.gz
file over to your local disk using SCP, for example, then shut down the EC2 instance.
Create a directory for your project files, and then save pandoc.gz
into the vendor
subdirectory.
Step 2: bundle the binary with a Lambda function
The statically linked Pandoc binary is roughly 50 MB. Lambda has a limit of 50 MB for the total package size, so we’ll send up a compressed version. Gzip brings the binary size down to roughly 11 MB, which is a lot better. We’ll then uncrompress this once the Lambda function executes. We’ll have to decompress this to the /tmp
directory, because that’s the only place a Lambda function can write.
To avoid latency and save money on Lambda execution, we should check first if Lambda is reusing an old VM container before decompressing. If the uncompressed binary is already there, we can just use it again. There’s no guarantee when and how Lambda decides to reuse the VMs, but this trick saves a lot of time in practice.
We’ll use child_process.exec
to decompress because it allows redirecting standard output into a file (which will be convenient for calling gzip without having to copy over the compressed file to /tmp
first). On the other hand, we don’t want to use .exec
to run pandoc
. exec
synchronously collects output and error buffer data, and gets killed if the buffer limit is exceeded. This means that, in case pandoc
blows up with a massive error dump, we might lose valuable troubleshooting data. Instead, we’ll use child_process.spawn
to execute pandoc
asynchronously and stream back any output and errors to the Lambda console. I like to use promises for asynchronous work, so I’ll first create two simple Promise
wrappers around those methods:
child-process-promise.js
Now we can just wrap the whole process of detecting a pre-existing binary, unzipping if it can’t be found, and invoking pandoc
into a nice, convenient javascript function. This code assumes that the pandoc.gz
file is in vendor
sub-directory.
pandoc.js
Step 3: create a Lambda function to download and upload files to S3
We’ll wire up the conversion process to simply listen for S3 events, and when a new file is uploaded, convert it locally and re-upload to S3 under a different name. S3 events can be associated with a prefix, which allows us to nicely use a single bucket for both incoming and outgoing files. We’ll set up the Lambda to work on anything uploaded to the /in
directory, and ignore all other uploads. That way, sending a converted file back to the same bucket in a different directory won’t trigger the Lambda conversion function again.
First we need two utility functions to grab files from S3 and save to the local `/tmp’ directory, and to upload local files back to s3.
s3-util.js
Next, we can write the conversion workflow, calling the pandoc
wrapper we created earlier:
### convert.js
The main Lambda code then just needs to call the convert
function and pass the right bucket name and file key. When S3 events trigger a Lambda function, the event will contain a Records
array, and the record field will have .s3.bucket.name
and s3.object.key
. If you want to find out more about the other event fields, just dump the event to console.log
before processing.
main.js
Step 4: send everything up to Lambda
To simplify uploading this package to Lambda, we’ll use claudia
. Install it as a global utility if you’ve not done that already.
Claudia can grab all dependencies using NPM, so initialize a package.json
in the directory with the source files, if you’ve not done that already.
We’re using the uuid
library in the conversion workflow to generate temporary file names, and aws-sdk
to access S3, so you’ll need to add them as production dependencies:
Now, we can deploy the function to Lambda
Step 5: connect to an S3 event source
Now we need to create a bucket on S3 for the files. Use the AWS SDK to create a new bucket – in this example, I’ll call it pandoc-test-bucket
:
Claudia has a handy short-cut to sets up an S3 event source for a Lambda function, enables the Lambda function to read and write to a bucket, and enables the bucket to invoke the Lambda function:
Step 6: convert files
The service is now live, wired up, and ready to go. It will convert any files you upload to the /in
directory of your S3 bucket to a docx
format, and save them back to the /out
folder.
Send a test file, for example a markdown file, to your bucket using the S3 console, or the AWS CLI tools. The command lines below assume the bucket is called pandoc-test-bucket
, so adjust the commands for your bucket name accordingly.
Wait a few seconds, and then check if the /out
folder of your S3 bucket
Download the file with the same base name, but the docx
extension, from the /out
folder:
Grab the code
To avoid copying and pasting, you can get the code from this tutorial directly from the Claudia.js Example Projects Github Repository.
The pandoc
binary and the wrapper function are also available as a separate Node.js module directly from NPM. Install them for your project using