Sunday, May 19, 2013

Sorry, but no, even merge sort sucks in scala

This is a reply to an article which compares merge sort in Scala and Java. As anyone who knows their sorting algorithms, merge-sort is the poster boy for a functional sorting -- as its a super simple stable divide-and-conquer algorithm, that enjoys a great worst-case complexity. If there's any that scala would be good at, this is it.


So first looking at the Java code, its 40 lines of pretty ugly and verbose code, but it's well commented and not too bad to follow. Just looking at the implementation, it's not easy to see if bug-free (it is, I tested it) but its pretty see that its quite efficient. It however does a great job of showing what's wrong with Java, and why I hate it.


So now lets jump to the Scala code. First off, as someone with experience in both Scala and Java, the Scala code is *a lot* easier to read, and more obviously correct. The only problem, is this code is incredibly misleading.

Not using an array

This is quite subtle, but one of the big wins in his scala code is being able to pattern match. However, had he been directly comparable to the Java implementation, it would've also used Arrays which would require far uglier/complex pattern matching, namely case (Array(x, xs1 @ _*), Array(y, ys1 @ _*)) and what's worse, is the type of xs1 and ys1 are now Seq[Int] and not Array[Int] which precludes the recursive call...

It's uses O(n) stack space

Which is a complete non-starter. Even a small list with cause a stack overflow.

So lets write it tail-recursive

object MergeSort {
  def apply(xs: Seq[Int]): Seq[Int] = {
    val n = xs.length / 2
    if (n == 0) xs
    else {
      def merge(xs: Seq[Int], ys: Seq[Int], res: Seq[Int]) =
        (xs, ys) match {
          case (Seq(), ys) => res ++ ys
          case (xs, Seq()) => res ++ xs
          case (Seq(x, xs1 @ _*), Seq(y, ys1 @ _*)) =>
            if (x < y) merge(xs1, ys, res :+ x)
            else merge(xs, ys1, res :+ y)
        }
      val (left, right) = xs splitAt (n)
      merge(apply(left), apply(right), Seq())
    }
  }
}
The way this works, is by using the fact that Scala arguments are strict -- so instead of building up a big stack of cons's -- I can collapse it each time immediately. And to make it more comparable with Java's, it now uses Seq instead of a linked list. It's definitely not as elegant, but I still find it quite readable.

The problem

It now is honest and works, but it's slow as shit. My quick benchmarks with 100k Ints
Timing Java MergeSort with random ... ..took 19.97ms
Timing Java MergeSort with ordered ... ..took 4.72ms
Timing Java MergeSort with reversed ... ..took 5.21ms
Timing Spire Merge sort with random ... ..took 86.92ms
Timing Spire Merge sort with ordered ... ..took 4.4ms
Timing Spire Merge sort with reversed ... ..took 3.78ms
Timing Scala MergeSort random ... ..took 365.83s
Timing Scala MergeSort ordered ... ..took 91.8s
Timing Scala MergeSort reverse ... ..took 91.31s
AKA its so slow it is completely unusable. On linked lists, on arrays, on everything. And the variability of its slowness completely depends on the implementations of collections that are passed into it.

Conclusion

Scala is an awesome language compared to Java. But lets be honest with our praise, writing sort algorithms is a domain that Scala thoroughly sucks at. (But fortunately we can write imperative code, like Java, but neater)

If anyone's interested, all code / benchmarks are availabe at my playground

Monday, December 19, 2011

What I want from github.com

I like github, it's probably the primary reason I use git. Here's what I'd like about it to change:


Add a forum/mailing-list. It should be like a simple web forum, but also work as a mailing list via email



When you fork or branch a github repository, you lose both the wiki and issue tracker. For this reason, I think the wiki and issue tracker should be stored in a _directory_ in github. For bonus points, they should be very readable in a plain-text format.

The wiki is lame, especially when you're using git. Instead of a wiki, it should html pages. You'd still keep the WYSIWYG interface (like blogger), but it should make a commit to the actual repository (ala google code) and the underlying stuff should be plain-old html (probably only allow a subset of it). Html has a bad rep, but it's completely undeserved (maybe a separate blog post on that) and has the huge advantage of I can actually view it locally. It's quite important (imho) to be on a per-branch basis.

Also the current website concept is very lame. And almost entirely because its in it's own (shared-nothing) branch (gh-pages), which makes it very difficult to locally look at the website (when I'm coding), as well to make it track my dev branches. This could be unified with the html-wiki somehow.

Monday, December 5, 2011

Git is Ghetto

My biggest complaints with git, is it's full of fanatics, fanboys and otherwise clueless people who pretend: 
   * They know how to use it. 
   * They're getting productivity boosts
   * It's awesome

I think most of these people actually believe what they say, and justify pretending to make themselves look smart/trendy.   I've known people who've started a conversation with how fantastic git is, and end up asking how branching works.

But for this blog, I'm going to focus on a more technical issue, merges. For those that don't know, merges are the bread and butter of a distributed version control system like git. Yet it's unjustifiably atrocious, even compared to Bazaar.

Edit: I was under the wrong impression that Mercurial was any better, but it appears not.

Here's my example to illustrate the ghettoness of git. Imagine I'm managing a simple software project. In the morning I wake up, and there's been two new (unrelated) fixes. Perhaps they need some some mild formatting changes, but otherwise they're both pretty good.


Here's what our situation looks like:
Ok. So, obviously I want to add Daniela's and Bob's changes to my official line. The traditional git workflow would be to merge them in:

 

 Ewwwwwwww! Our history is a mess. Any one who looks at this, has little idea what's going on. If you look at this, you have no ability to go "back" in the *official* history. Like, you have no way of knowing if 12cca34 or 12ce39 was ever at the HEAD of the repository at any point in time. Not very useful for HISTORY. This might not look so bad now, but let's see how terribly it scales. Here's a (very pretty) graphical representation of what this looks like in a more complex project, managed by none-other than the original author of git):



No one sane can say this is acceptable. Look at the time stamps to get even further confused. The history almost useless for everything other than forensic investigation.  Clearly merging with a way that loses a linearized branch state is unacceptable. Fortunately git is quite flexible, and we can work around the issue. Using rebase (or something like it) we can do this:

Ok. This is hot. Super nice history, and how I recommend people try keep their repositories, but it's not without drawbacks:

We've changed the sha's. The first changed, (maybe because we fixed some white space). And the second one changed because it's now based on a new parent commit.

This is (slightly) problematic for anyone with the "old" sha's, and needs to be manually resolved (normally via a hard reset). Furthermore, if I need to make changes to the commit (to either resolve conflicts, or to fix trivial issues) it will look as if the initial commiter did that as part of their commit. If during this process I introduce a bug, this is obviously very not nice. And worst, if it's not noticed for a while -- there's nothing that says "Here is the initial commit" and "Here is how I included it in". Knowing those two parts is very important for fixing a bad merge.


Also, this whole approach works rather poorly with commits that have been cryptographically signed (because when we rebase, we're actually changing the commit in one shape or another). Also, while I can deal with it -- I really don't like the idea of someone making changes to my commits (without me having reviewed it first) and attributing it to me  [Which reminds me of a ghetto idea called "Google Wave"]


So both have pretty steep draw backs, but here's what makes it the worst. Conceptually there's a very simple solution. Merges that preserve a linear history of the state of each branch!

e.g.






And this is how bazaar does it. Graphically this can be awesomely represented (a linear list of changes to a branch, and for any change you can drill-down and see how it was developed in a completely lossless fashion).


This leads me to my conclusion: git is ghetto, and anyone who tells you otherwise is either lying or linus (or both).