How to run Pandoc in AWS Lambda

Pandoc is the Swiss army knife for document conversions, and pretty much a necessary component of almost any system that requires flexible document exports. Unfortunately, the Amazon Linux image used by Lambda doesn’t include Pandoc. However, with a bit of tweaking, you can run almost any binary inside it. In this tutorial, you’ll learn how to create a flexible and scalable document conversion service using Pandoc, deployed to AWS Lambda. You’ll also learn how to deploy third-party binaries with your Lambda function, to install similar tools yourself.

TL;DR, just give me the files

If you are in a hurry, just want to get started quickly, and don’t care about how things work, check out the final code in Pandoc S3 Converter Example Project.

If you don’t want to use Node.js, but just want the precompiled Pandoc binary, download it from https://github.com/claudiajs/pandoc-aws-lambda-binary.

Step-by-step tutorial

Here’s the plan:

Step 1: compiling the binary package

Lambda runs on Amazon Linux, a modified version of CentOS, so it’s best to use that same image for compiling the static binary. You can find out the exact Amazon Machine Image ID for your region on the Lambda Execution Environment page. For example, at the time when this tutorial was written, the AMI in US-East region was ami-60b6c60a. Launch the corresponding AMI in EC2. Then connect to it using ssh.

For compiling Pandoc, you’ll need at least a t2.medium instance. Smaller instances may not have enough memory for everything. You’ll need this machine only to compile the binary, so don’t worry about the cost too much.

Pandoc is written in Haskell, so that will make it even more challenging to create a binary than with a typical Unix tool. David Tchepak wrote a nice guide on how to produce static binaries with Haskell on Linux a while ago, so we’ll use that as a guide and just make sure to use updated versions. For some more background info, check out the Static Build of Pandoc thread on GitHub.

First, we need to install a bunch of development tools:

$ sudo yum -y install gmp-devel freeglut-devel python-devel zlib-devel gcc m4

Installing the haskell packages directly from YUM won’t help, because the version available for the Amazon instances is quite outdated. We’ll install the ghc compiler directly from the Haskell repository. At the time when I wrote this, even the latest binary version of ghc won’t work out of the box on the Amazon Linux AMIs used for Lambda, because it is looking for an older version of libgmp. This is the error you’ll get trying to run it:

... 
utils/ghc-pwd/dist-install/build/tmp/ghc-pwd: error while loading shared libraries: libgmp.so.3: cannot open shared object file: No such file or directory
configure: error: cannot determine current directory

So we’re going to cheat a bit. This step may not be necessary in the future, if ghc stops asking for version 3 of libgmp, but with 8.0.1 you’ll need it to proceed:

$ sudo ln -s /usr/lib64/libgmp.so.10 /usr/lib64/libgmp.so.3 && sudo ldconfig

Now we can install the Haskell compiler. Check out the GHC Download Page to find the latest version, and pick up the variant for x86_64-centos. At the time when I wrote this, the current version was 8.0.1. Here’s a quick script to install it:

$ curl -LO https://downloads.haskell.org/~ghc/latest/ghc-8.0.1-x86_64-centos67-linux.tar.xz && \
  tar xf ghc* && \
  cd ghc* && \
  ./configure --prefix=/usr && \
  sudo make install && \
  cd ..

Next, we’ll need the Cabal packaging tool. Find the most recent version on the Cabal download page, you most likely need the x86_64-unknown-linux variant, and unpack it to a directory in the executable path:

$ mkdir bin && \
  cd bin && \
  curl -LO https://www.haskell.org/cabal/release/cabal-install-1.24.0.0/cabal-install-1.24.0.0-x86_64-unknown-linux.tar.gz && \
  tar xf cabal* && \
  cd ..

We can now finally compile pandoc. To make sure it’s not putting any files around the system, we’ll use the Cabal sandbox feature:

$ cabal sandbox init && \
  cabal update && \
  cabal install hsb2hs && \ 
  cabal install --disable-documentation pandoc -fembed_data_files && \
  mv .cabal-sandbox/bin/pandoc ~ \
  gzip pandoc

You can now copy the pandoc.gz file over to your local disk using SCP, for example, then shut down the EC2 instance.

Create a directory for your project files, and then save pandoc.gz into the vendor subdirectory.

Step 2: bundle the binary with a Lambda function

The statically linked Pandoc binary is roughly 50 MB. Lambda has a limit of 50 MB for the total package size, so we’ll send up a compressed version. Gzip brings the binary size down to roughly 11 MB, which is a lot better. We’ll then uncrompress this once the Lambda function executes. We’ll have to decompress this to the /tmp directory, because that’s the only place a Lambda function can write.

To avoid latency and save money on Lambda execution, we should check first if Lambda is reusing an old VM container before decompressing. If the uncompressed binary is already there, we can just use it again. There’s no guarantee when and how Lambda decides to reuse the VMs, but this trick saves a lot of time in practice.

We’ll use child_process.exec to decompress because it allows redirecting standard output into a file (which will be convenient for calling gzip without having to copy over the compressed file to /tmp first). On the other hand, we don’t want to use .exec to run pandoc. exec synchronously collects output and error buffer data, and gets killed if the buffer limit is exceeded. This means that, in case pandoc blows up with a massive error dump, we might lose valuable troubleshooting data. Instead, we’ll use child_process.spawn to execute pandoc asynchronously and stream back any output and errors to the Lambda console. I like to use promises for asynchronous work, so I’ll first create two simple Promise wrappers around those methods:

child-process-promise.js

var childProcess = require('child_process'),
	execPromise = function (command) {
		'use strict';
		return new Promise(function (resolve, reject) {
			childProcess.exec(command, function (err, result) {
				console.log('exec complete', err, result);
				if (err) {
					reject(err);
				} else {
					resolve();
				}
			});
		});
	},
	spawnPromise = function (command, options) {
		'use strict';
		return new Promise(function (resolve, reject) {
			var process = childProcess.spawn(command, options);
			process.stdout.on('data', console.log);
			process.stderr.on('data', console.error);
			process.on('close', function (code) {
				console.log('spawn ended', code);
				if (code !== 0) {
					reject(code);
				} else {
					resolve();
				}
			});
		});
	};
module.exports = {
	exec: execPromise,
	spawn: spawnPromise
};

Now we can just wrap the whole process of detecting a pre-existing binary, unzipping if it can’t be found, and invoking pandoc into a nice, convenient javascript function. This code assumes that the pandoc.gz file is in vendor sub-directory.

pandoc.js

var os = require('os'),
	path = require('path'),
	fs = require('fs'),
	cp = require('./child-process-promise'),
	exists = function (target) {
		'use strict';
		return new Promise(function (resolve, reject) {
			fs.access(target, function (err) {
				if (err) {
					reject(target);
				} else {
					resolve(target);
				}
			});
		});
	},
	makeExecutable = function (target) {
		'use strict';
		return new Promise(function (resolve, reject) {
			fs.chmod(target, '0700', function (err) {
				if (err) {
					reject(target);
				} else {
					resolve(target);
				}
			});
		});
	},
	unzip = function (targetPath) {
		'use strict';
		return cp.exec('cat ' + path.join(__dirname, 'vendor', 'pandoc.gz') + 
                   ' | gzip -d  > ' + targetPath).then(function () {
			return makeExecutable(targetPath);
		});
	},
	findUnpackedBinary = function () {
		'use strict';
		return exists(path.join(os.tmpdir(), 'pandoc')).catch(unzip);
	};

module.exports = function pandoc(inPath, outPath, additionalOptions) {
	'use strict';
	return findUnpackedBinary().then(function (commandPath) {
		return cp.spawn(commandPath, [inPath, '-o', outPath].concat(additionalOptions || []));
	});
};

Step 3: create a Lambda function to download and upload files to S3

We’ll wire up the conversion process to simply listen for S3 events, and when a new file is uploaded, convert it locally and re-upload to S3 under a different name. S3 events can be associated with a prefix, which allows us to nicely use a single bucket for both incoming and outgoing files. We’ll set up the Lambda to work on anything uploaded to the /in directory, and ignore all other uploads. That way, sending a converted file back to the same bucket in a different directory won’t trigger the Lambda conversion function again.

First we need two utility functions to grab files from S3 and save to the local `/tmp’ directory, and to upload local files back to s3.

s3-util.js

var aws = require('aws-sdk'),
	path = require('path'),
	fs = require('fs'),
	os = require('os'),
	uuid = require('uuid'),
	s3 = new aws.S3(),
	downloadFromS3 = function (bucket, fileKey) {
		'use strict';
		console.log('downloading', bucket, fileKey);
		return new Promise(function (resolve, reject) {
			var filePath = path.join(os.tmpdir(), uuid.v4() + path.extname(fileKey)),
				file = fs.createWriteStream(filePath),
				stream = s3.getObject({
					Bucket: bucket,
					Key: fileKey
				}).createReadStream();

			stream.setEncoding('utf8');

			stream.on('error', reject);
			file.on('error', reject);
			file.on('finish', function () {
				console.log('downloaded', bucket, fileKey);
				resolve(filePath);
			});
			stream.pipe(file);
		});
	}, uploadToS3 = function (bucket, fileKey, filePath, acl) {
		'use strict';
		console.log('uploading', bucket, fileKey, filePath, acl);
		return new Promise(function (resolve, reject) {
			s3.upload({
				Bucket: bucket,
				Key: fileKey,
				Body: fs.createReadStream(filePath),
				ACL: acl || 'private'
			}, function (error, result) {
				if (error) {
					reject(error);
				} else {
					resolve(result);
				}
			})
		});
	};

module.exports = {
	download: downloadFromS3,
	upload: uploadToS3
};

Next, we can write the conversion workflow, calling the pandoc wrapper we created earlier:

### convert.js

var path = require('path'),
	fs = require('fs'),
	os = require('os'),
	uuid = require('uuid'),
	pandoc = require('./pandoc'),
	s3 = require('./s3-util');

module.exports = function convert(bucket, fileKey) {
	'use strict';
	var targetPath, sourcePath;
	console.log('converting', bucket, fileKey);
	return s3.download(bucket, fileKey).then(function (downloadedPath) {
		sourcePath = downloadedPath;
		targetPath = path.join(os.tmpdir(), uuid.v4() + '.docx');
		return pandoc(sourcePath, targetPath);
	}).then(function () {
		var uploadKey = fileKey.replace(/^in/, 'out').replace(/\.[A-z0-9]+$/, '.docx');
		console.log('got to upload', targetPath, sourcePath);
		return s3.upload(bucket, uploadKey, targetPath);
	}).then(function () {
		console.log('deleting', targetPath, sourcePath);
		fs.unlinkSync(targetPath);
		fs.unlinkSync(sourcePath);
	});
};

The main Lambda code then just needs to call the convert function and pass the right bucket name and file key. When S3 events trigger a Lambda function, the event will contain a Records array, and the record field will have .s3.bucket.name and s3.object.key. If you want to find out more about the other event fields, just dump the event to console.log before processing.

main.js

var convert = require('./convert');
exports.handler = function (event, context) {
	'use strict';
	var eventRecord = event.Records && event.Records[0];
	if (!eventRecord) {
		return context.fail('no records in the event');
	}
	if (eventRecord.eventSource !== 'aws:s3' || !eventRecord.s3) {
		context.fail('unsupported event source');
	}
	convert(eventRecord.s3.bucket.name, eventRecord.s3.object.key).then(context.done, context.fail);
};

Step 4: send everything up to Lambda

To simplify uploading this package to Lambda, we’ll use claudia. Install it as a global utility if you’ve not done that already.

Claudia can grab all dependencies using NPM, so initialize a package.json in the directory with the source files, if you’ve not done that already.

npm init

We’re using the uuid library in the conversion workflow to generate temporary file names, and aws-sdk to access S3, so you’ll need to add them as production dependencies:

npm install uuid aws-sdk -S

Now, we can deploy the function to Lambda

claudia create --region us-east-1 --handler main.handler

Step 5: connect to an S3 event source

Now we need to create a bucket on S3 for the files. Use the AWS SDK to create a new bucket – in this example, I’ll call it pandoc-test-bucket:

aws s3 mb s3://pandoc-test-bucket

Claudia has a handy short-cut to sets up an S3 event source for a Lambda function, enables the Lambda function to read and write to a bucket, and enables the bucket to invoke the Lambda function:

claudia add-s3-event-source --bucket pandoc-test-bucket --prefix in

Step 6: convert files

The service is now live, wired up, and ready to go. It will convert any files you upload to the /in directory of your S3 bucket to a docx format, and save them back to the /out folder.

Send a test file, for example a markdown file, to your bucket using the S3 console, or the AWS CLI tools. The command lines below assume the bucket is called pandoc-test-bucket, so adjust the commands for your bucket name accordingly.

aws s3 cp example.md s3://pandoc-test-bucket/in/example.md

Wait a few seconds, and then check if the /out folder of your S3 bucket

aws s3 ls s3://pandoc-test-bucket/out/

Download the file with the same base name, but the docx extension, from the /out folder:

aws s3 cp s3://pandoc-test-bucket/out/example.docx .

Grab the code

To avoid copying and pasting, you can get the code from this tutorial directly from the Claudia.js Example Projects Github Repository.

The pandoc binary and the wrapper function are also available as a separate Node.js module directly from NPM. Install them for your project using

npm install pandoc-aws-lambda-binary

Did you like this tutorial? Get notified when we publish the next one.

Once a month, high value mailing list, no ads or spam. (Check out the past issues)