We wrote gradient computation both in C++ and Scala under Spark.
Figure shows the result of operation over 64MB of input data. C++
is almost 51 time faster than Scala. We were also shocked by
these numbers; to determine the cause of this slowdown, we
decompiled the JVM bytecodes Scala generated into Java, rewrote
this code to remove its overheads step by step, and recompiled
it. The poor performance has three major causes. First, since
Scala's generic methods cannot use primitive types (e.g., they
must use the
Double
class rather than a
double
), every generic method call allocates a new
object for the value, boxes the value in it, un-boxes for the
operation, and deallocates the object. In addition to cost of a
malloc
and
free
, this results in
millions of tiny objects for the garbage collector to process.
85\% of logistic regression's CPU cycles are spent
boxing/un-boxing. Second, Spark's resilient distributed datasets
(RDDs) forces methods to allocate new arrays, write into them,
and discard the source array. For example, a
map
method that increments a field in a dataset cannot perform the
increment in-place and must instead create a whole new dataset.
This data duplication adds an additional factor of ~2x slowdown.
Third, using the Java Virtual Machine has an additional factor of
~3x slowdown over C++. This result is in line with prior
studies, which have reported 1.9x-3.7x for computationally dense
codes[
Loop Recognition in C++/Java/Go/Scala,
A Java vs. C++ performance evaluation: a 3D modeling benchmark
]. In total, this results in Spark code running 51 times slower
than C++.