One problem I have when trying new technologies is actually seeing them in realistic situations – which translates to having a real use case for them. Think about this: when you build a “hello world” application you actually couldn’t be further away from the real world! I guess that’s why nobody calls it “hello real world” anyway… The whole experience you get from the proof of concept is just random installation trivia if you’re not using it to prove a point. That’s why it’s called “proof of concept” – it should have a concept, silly.
I guess this saying would deserve at least its own blog post but I don’t feel myself enough of a writer to properly put it in writing… so whether you agree or not, I’ll continue to the technicalities.
There’s this one web application I have seen in three different incarnations over many years. For a while I was involved in different parts of it as a developer and for a much longer while I experienced it as user/implementer. I guess I got this way to know quite a bit about it’s “whys and hows” to be able to use its concepts as a workbench for new ideas and new technologies.
This application has inside a sizeable SQL database storing data sets, and the problem with them is, they are slightly different. To enable complex searches involving Item1 and Item2 where the two items differ in 20 fields but are similar in 30, one big table with 70 columns was needed to accommodate them all. 70 = 30 common fields + 20 specific to Item1 and 20 specific to Item2. That, or having crazy joins between tables. What was the way to grow when Item3 came along? Add more columns, of course… it works, performance is ok for the amount of data, no major changes were undertaken to change the status quo.
But, can’t we do better nowadays? Maybe I could have my cake and eat it without schema, or at least with less schema in it – just like Cassandra promises. Datastax offers a handy Cassandra distribution which would be a breeze to integrate in my test application built before, wouldn’t it. And indeed, after a few module and vert.x trivial upgrades, switching the existing MongoDB persistor module to Cassandra was a breeze.
It’s just…. maybe Cassandra is not exactly the right choice for me. Reading, fiddling around, trying to model some trivial data structures, I came to me that you can search only on indexes. And I need to search in maaaanyyyy fields… Declaring everything as primary indexes or creating in code secondary indexes every time I need to search some other field isn’t exactly very performant, looks to me. And what if the fields usually have a high cardinality and these secondary indexes will be damn slow as Cassandra warns? Shortly put: can’t query random columns.
Well, I should have known better: if I need very flexible searches I should use an appropriate tool for searches: Lucene, with Elasticsearch or Solr on top! An hour or so of pseudo-random googling (or duckduckgoing, to be more precise) decided the comparison in favor of Elasticsearch. I cannot say whether my decision was the best or not, but I don’t think it’s completely off either so let’s start reading about ES.
Background noise: installing ES was again a no-brainer except for a minior quirk tied to the echo command on Windows. Trying the hello world example worked out of the box (I learned to appreciate this), then I started fiddling in Marvel and testing tutorials…
The main take for my case is that I could have the whole web application based on searches: showing a list is a search, doing a custom report is a search, showing one item is a search. KISS! And (later) I could get notifications on new incoming items using ES percolation, how cool is that! I won’t need clustering any time soon based on the known size of handled data, all to the better because some people are quite disappointed by ES’s clustering – here is a must read text from Jepsen/Aphyr. What I will certainly need is a reliable snapshot/restore mechanism of some sorts, because my items require an audit trail which would be very very very unpleasant to lose. Before you think about it: separating the audit in some other (maybe write-only) database won’t help as I need to integrate historical data into searches and reporting as well.
Now it’s time to work on my test application… in the next posting.