The Musings of Adrian Rossouw

Replacing CouchDB views with ElasticSearch 20 May 2012

Originally published on 09 April 2011 on the DevelopmentSeed intranet.

I edited this post to provide more context, so that the references to the project internals actually make sense to those who didn’t work on it.
This internal post eventually led to a public blog post, but this is the journal of my concrete experiences with ElasticSearch.

This devlog follows directly on from my post about performance improvements.

The main gist of the previous post, was that I was having trouble fixing some bugs happening during a bulk import, due to the glacial pace the process was running at, and this pace was directly attributed to needing to generate incredibly large CouchDB views.

What started as a straight forward task, forced me to try some very interesting approaches, eventually turning into a completely experimental branch of the project I have been fiddling with on my free time over the last weekend.

We have ElasticSearch available, let’s use it.

Since I had last written about elasticsearch, we have implemented it for use in the search functionality on the background and analysis pages. Knowing the kind of indexes we were building this incredibly slow view for in couchDB, i thought it might be faster, if not more straight forward, to simply index the data with elasticsearch as well, allow us to make use of it’s extensive query capabilities.

One of the concessions I did however make in my experiment, was that I saved the ‘materialized’ latest values and the respective year inside the object in CouchDB. It made not only the queries and indexing simpler, but made a whole bunch of the code around comparisons and displays cleaner too. It would probably have simplified things for the maps too, so I am of the opinion we should probably have done this a long time ago regardless.

About the data

The data for each of the school districts is split into around 65 indicators, which are then split (sparsely) into up to 10 values for each recorded year. The most complex view we have is used to compare each of the school districts to each other, based on these indicators. We end up with 57059 entries in the view for every dataset that is uploaded, and there are multiple datasets in the system at any one time.

Having this amount of data in CouchDB and the views is not a problem, but being in a situation where there is user-initiated batch imports into the system paints a very different performance picture than the traditional user contributed content workflow. What was killing us was having to import > 16k records all at once, not simply having > 16k records in the database.

JSON Schema and CouchDB bulk import performance 15 May 2012

Originally published on 05 April 2011 on the DevelopmentSeed intranet.

This project launched successfully, check out the release announcement for more info.
I have revised and double checked this article before publishing, but I am able to re-run the benchmarks.

During the last week I have been responsible for fixing a number of critical issues on the FEBP project to get it ready for the client to be able to work with it. One of the areas I ended up needing to spend a lot of time on was improving performance related to data import and validation.

Background

On the FEBP site, the administrator is able to upload a CSV file describing the data for each of the data sets, which then changes how we interpret and display this data, and they can upload new versions of the data itself. Each dataset and schema is versioned, and the site has the concept of an “active version”.

Dynamic Form Generation with JSON Schema. 06 May 2012

Originally published on 05 December 2010 on the DevelopmentSeed intranet.
I have taken the time to revise and double check the information contained within it.

Foreword Added on 06 May 2012

This is one of the earliest posts I wrote while learning to use Node.JS. It was written during a phase where I was still trying to turn Node into Drupal. It is also one of the first times I realized that trying to do so was a mistake.

You simply do not need to have a system that automagically generates forms based on a declaritive control structures. In my Drupal days, one of my largest contributions was the Drupal Forms API that worked on similar principles, so this was a very difficult lesson for me to learn.

While I dont think what I was trying to accomplish in this article is the correct approach, the technology I was researching to help me solve it is actually really powerful and useful.

Other people have also realized that schemas could be used to generate forms:

JSON Editor - Generates input forms for JSON Schemas.
JSON Schema Editor - Generates dynamically editable schemas from JSON objects.

Whenever you find yourself having to come up with a format to declare anything in JSON Schema, wether it be how a config file is structured or some other problem, I urge you to take a look to see if there isn’t already an agreed upon way to express this as a JSON schema.

Elasticsearch - 5 minute search integration 01 May 2012

Originally published on 15 March 2011 on the DevelopmentSeed intranet.

I have refactored the example code and made it less client specfic, and supporting the latest versions of node and ES.

One of the lingering tickets on the FEBP project is building full text search for the documents database, and this is the simplest way I have discovered to implement something like this.

Elastisearch is to Solr what CouchDB is to MySQL. It is marvelously simple to get going. You download a file, run a command and then you can index JSON with it via Curl.

Another great thing is that every time it starts up it seems to give itself a new name. The first time I played with it, it dubbed itself Algrim the Strong. It’s not often I run across software that makes me smile just from using it.

Building my first Jekyll site. 23 April 2012

Originally published on 20 October 2011 on the DevelopmentSeed intranet.
I have taken the time to revise and double check the information contained within it.

For the last 2 weeks I have been working on the new site for Global Adaptation Index, to bring it in line with the index data site we built. We chose to use jekyll and github to built the main brunt of the site, which mostly involved ‘about’ pages and team profiles.

The experience has been interesting, and relatively painless. It has been a while since I have had to build something so straight forward and from scratch. I can’t even remember the last time I had to build all the markup and styles for a site, as usually I only become involved once the structure is complete and we just need to wire up the functionality to it.

Ending an Era on a high note. 21 April 2012

I am parting ways with my current employer, the inimitable Development Seed at the end of April.

They came into my life, in much the same way as a white knight, shortly after my Raincity Studios failed. I had built the core of the Aegir hosting system at Raincity, but the project still needed a lot of work to live up to it’s potential.

I had spent about 5 months without a reliable income working on the project, at the time Devseed contacted me. They had heard I wasn’t able to make it to Drupalcon DC, they immediately offered to put up the money for my flight. I evaluated a couple employment opportunityes at the conference, but near the end Eric offered me something I couldn’t refuse.

A golden ticket

I joined them on what was going to be a 3 month ‘grant’ to focus the Aegir project, but as we worked more closely together, we ended up extending that initial arrangement to indefinite employment. They had the foresight and pluck to hire me to build something amazing, even though it didn’t contribute directly to the bottom line.

They didn’t need Aegir when I joined them, but they understood it’s value and believed that it should exist. That took vision, and I will always be grateful for that.

Out with the old …

I was also incredibly grateful and fortunate to be part of the company as we pivoted from being a Drupal oriented services company, to a pretty amazing node.js oriented product company. I could not have wished for a more stimulating and enriching environment on this journey. We were all on this journey together, and it made all the difference in the world.

Looking back at all the amazing projects I have helped them build over the last 18 months (since we pivoted), and knowing that I wasn’t even involved in all or most of them, has only increased my respect and admiration for the team at Development Seed. A lot of people are just getting over the hump of getting their first or second node-based sites into production, but they have accomplished so much more.

In with the new!

In the last couple weeks I have built this website for myself and more specifically an online portfolio to ‘show off my wares’ as it were. I’ve also been given permission to start republishing my old (nonsensitive) devlogs here. I was amazed at how much I have personally learnt in that time.

I have my next job already lined up, and It’s at a stealth startup in the valley. I start on the first of May. We are going to be building something rather spectacular involving industrial psychology and statistical analysis. I’m going to be bridging node into R (which is what the statisticians we employ will be generating), and doing many other fascinating things with a really interesting dataset.

The thing I am most excited about is the opportunity to build something completely new, from the ground up. For once I won’t have anyone else’s legacy systems to maintain, and the position gives me enough autonomy that i can have real investment in my work. It also gives me an amazing platform to learn more about data and analytics, which is something I have been dying to do.

I have only the best memories and sincerest wishes for Development Seed, and i hope they take the time to enjoy the success they have worked so hard for.

Simple async control flow, using only underscore.js 18 April 2012

Originally published on 25 August 2011 on the DevelopmentSeed intranet.

I don’t recommend this approach much anymore, because I have come to the conclusion that the async.js module is by far the best tool for the job

At DevelopmentSeed, we have mostly standardized on Step to handle our asynchronous function call requirements. It does sometimes feel like like bringing a sledgehammer to a game of croquet though.

While doing the import work for a client project i ran into the common use case of having to import files in a certain order and then process the results into a record to import. I came across the following pattern with underscore which seems to have fit the situation very well.

Returning to the fold 13 April 2012

This post marks the first time I have blogged in any personal capacity in many many years. It’s not that I had nothing to say mind you. I actually found myself writing a lot of blog posts, either for my employer DevelopmentSeed, or in my role as the lead developer of the Aegir Hosting System or even as a Drupal core developer.

Something interesting happened near the end of 2010 though: I reached the limits of what I had ever hoped to accomplish with my work on Aegir, and had less of a use for Drupal itself too. I also felt that I had reached the limits of where PHP could take me as a programmer. It was time for a change, and I was not the only person who felt so.