SEO and JavaScript with PhantomJS server-side rendering

A pair of colleagues and I (Adam Parrish and Charles Fulnecky), while working for a major media company, developed what we believe to be a novel approach to dealing with the problem of dynamic, client-side content being ignored by search engines. This post is a primer describing our work. If there's interest I can write more detailed tutorials. Here is a github repo with a few examples of our implementation.

PhantomJS is a headless webkit browser that can be invoked from the command line. Our team discovered that one can integrate PhantomJS into their stack in such a way that Phantom pre-renders any HTML/JavaScript/CSS pages ahead of the content being served up to the client. This means that when a page is loaded, all of the initial JavaScript has already been executed, and it looks to the client as though that content is part of a static page. We have, in effect, turned client-side javascript into a server-side language. In the sections that follow, I’ll describe a trivialized nodejs example and define what challenges lay ahead.  

The example is made of 3 parts. An index page that includes a jquery reference and displays a basic message, a phantomJS rendering script, and a nodejs server that calls the phantom script and serves the web page.

Our content is a simple index.html page that uses jQuery to set a heading tag.

<html>
<head>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript">
$(function(){

$('.heading').html('Gorilla!'.toUpperCase());

});
</script>
</head>
<body>
<h1 class="heading"/>
</body>
</html>

The content is served to the client with a simple static nodejs server. The nodejs server passes the path of the file being served to a PhantomJS script named "render.js"

var http = require("http"),
url = require("url"),
path = require("path"),
fs = require("fs"),
spawn = require('child_process').spawn,
port = (process.argv[2] || 8888),
prerender = process.argv[3] || false;

var mimeTypes = {
"htm": "text/html",
"html": "text/html",
"jpeg": "image/jpeg",
"jpg": "image/jpeg",
"png": "image/png",
"gif": "image/gif",
"js": "text/javascript",
"css": "text/css"};

var virtualDirectories = {
//"images": "../images/"
};

http.createServer(function(request, response) {

var uri = url.parse(request.url).pathname
, filename = path.join(process.cwd(), uri)
, root = uri.split("/")[1]
, virtualDirectory;

virtualDirectory = virtualDirectories[root];
if(virtualDirectory){
uri = uri.slice(root.length + 1, uri.length);
filename = path.join(virtualDirectory ,uri);
}

path.exists(filename, function(exists) {
if(!exists) {
response.writeHead(404, {"Content-Type": "text/plain"});
response.write("404 Not Found\n");
response.end();
console.error('404: ' + filename);
return;
}

if (fs.statSync(filename).isDirectory()) filename += '/index.html';

fs.readFile(filename, "binary", function(err, file) {
if(err) {
response.writeHead(500, {"Content-Type": "text/plain"});
response.write(err + "\n");
response.end();
console.error('500: ' + filename);
return;
}

var mimeType = mimeTypes[path.extname(filename).split(".")[1]];
response.writeHead(200, {"Content-Type": mimeType});

if(prerender && mimeType === mimeTypes.html){
phantom = spawn('phantomjs', ['render.js', filename]);

phantom.stdout.on('data', function (data) {
response.write(data, "utf8");
response.end();
console.log('200: ' + filename + ' as ' + mimeType);
});
phantom.stderr.on('data', function (data) {
console.log('stderr: ' + data);
});

phantom.on('exit', function (code) {
console.log('child process exited with code ' + code);
});
}else{
response.write(file, "binary");
response.end();
console.log('200: ' + filename + ' as ' + mimeType);
}
});
});
}).listen(parseInt(port, 10));


console.log("Running on http://localhost:" + port + " with pre-render " + (prerender ? 'enabled' : 'disabled') + " \nCTRL + C to shutdown");

The "render.js" Phantom script processes the page and outputs rendered HTML.

var system = require('system'),
fs = require('fs'),
page = new WebPage(),
url = system.args[1],
output = system.args[2],
result;


page.open(url, function (status) {
if (status !== 'success') {
console.log('FAILED to load the url');
phantom.exit();
} else {
result = page.evaluate(function(){
var html, doc;

html = document.querySelector('html');

return html.outerHTML;
});

if(output){

var rendered = fs.open(output,'w');
rendered.write(result);
rendered.flush();
rendered.close();

}else{

console.log(result);

}
}
phantom.exit();
});

The output of this PhantomJS script is then served to the client with all initializing javascript pre-rendered.

As mentioned at the beginning of this post, this example is trivialized. Within our team, we've built functional prototypes that are pre-rendering thousands of lines of non-trivial, enterprise grade JavaScript. In order to accomodate PhantomJS, we created an infrastructure that differs from the example in this post. Our HTML5 pages are being served by nginx. Between nginx and the client is a Varnish cache layer, though we could have used the built-in nginx cacheing. If the cache layer has the requested page,  then the response is the cached, pre-rendered, page. If the cache does not have the page. The cache layer passes the request down to nginx which in turn fires PhantomJS through a java based origin app. The results of Phantom are sent to the client via the cache, where they remain until the cache is expired.

Here is a diagram of our setup.

Transient

Deciding how to fire the PhantomJS process should be based on your infrastucture. There is no reason the process must be triggered by a client request. For instance, PhantomJS could run on a scheduler, or could be hooked into some adjacent content monitor. In a production environment, one could even build and nginx or apache module that functions as a pre-processor. The possibilities are wide open. We really believe we are on to something interesting, and hope the community  gets involved in improving our work.

javascript, html5, seo, phantomjs, nodejs

Published on September 30, 2012 by James Wanga.