My favorites | Sign in
Project Logo
                
Search
for
Updated Feb 11, 2008 by nuggetwheat
WhyWeChoseCppOverJava  
Why We Chose C++ Over Java

This document is to clarify our position regarding C++ vs. Java for choice of implementation language. There are two fundamental reasons why C++ is superior to Java for this particular application.

  1. Hypertable is memory (malloc) intensive. Hypertable caches all updates in an in-memory data structure (e.g. stl map). Periodically, these in-memory data structures get spilled to disk. These spilled disk files get merged together to form larger files when their number reaches a certain threshold. The performance of the system is, in large part, dictated by how much memory it has available to it. Less memory means more spilling and merging which increases load on the network and underlying DFS. It also increases the CPU work required of the system, in the form of extra heap-merge operations. Java is a poor choice for memory hungry applications. In particular, in managing a large in-memory map of key/value pairs, Java's memory performance is poor in comparison with C++. It's on the order of two to three times worse (if you don't believe me, try it).
  2. Hypertable is CPU intensive. There are several places where Hypertable is CPU intensive. The first place is the in-memory maps of key/value pairs. Traversing and managing those maps can consume a lot of CPU. Plus, given Java's inefficient use of memory with regard to these maps, the processor caches become much less effective. A recent run of the tool Calibrator (http://monetdb.cwi.nl/Calibrator/) on one of our 2GHz Opterons yields the following statistics:
  3. caches:
    level  size    linesize   miss-latency        replace-time
      1     64 KB   64 bytes    6.06 ns =  12 cy    5.60 ns =  11 cy
      2    768 KB  128 bytes   74.26 ns = 149 cy   75.90 ns = 152 cy
You can pack a fair amount of work into 150 clock cycles. Another place where Hypertable is CPU intensive is compression. All of the data that is inserted into a Hypertable gets compressed at least twice, and on average three times. Once when writing data to the commit log, once during minor compaction and then once for every merging or major compaction. And the amount of decompression that happens can be considerably more depending on the amount of query workload the table sees. It's arguable that the native Java implementation of zlib is comparable to the C implementation, but as soon as you start experimenting with different compression techniques (e.g. Bentley-McIlroy long common strings), you need to either implement them in Java which yields unacceptable performance (if you don't believe me, try it), or implement them in C/C++ and use JNI. With this second option, all of the benefits of Java get thrown out the window and there is significant overhead in invoking a method via JNI.

What about Hadoop DFS and Map-reduce framework?

Given that the bulk of the work performed by the Hadoop DFS and Map-reduce framework is I/O, Java is probably an acceptable language for those applications. There are some places where Java is sub-optimal. In particular, at scale, there will be considerable memory pressure in the Namenode of the DFS. Java is a poor choice for this type of memory hungry application. Another place where the use of Java is sub-optimal is the post-map sorting in preparation for the reduce phase. This is CPU-intensive and involves the type of CPU work that Java is not good at.


Comment by milesosb, Feb 12, 2008

if only Hadoop was in C++ too, things would be some much simpler.

Comment by nc.good.guy, Apr 04, 2008

Would love to see the Oracle/Tangosol Coherence team discuss this as they pretty much built their product as a large, distributed hashmap.

Comment by eddy73, Apr 04, 2008

Interesting but somewhat debatable, though if you have the time and money an app written in C/C++ will outperform java but its takes a lot of effort these days to do that. Hadoop not being written in C++ is probably quite a good thing, as new functionality tends to be easier to implement and test in java.

Some interesting comparative articles.

http://www.ddj.com/cpp/184401976?pgno=1

http://www.idiom.com/~zilla/Computer/javaCbenchmark.html

Comment by vicaya, Apr 04, 2008

These microbenchmarks practically remove java memory management overhead completely, because of the nature of gc, so they do much less work, i.e. not comparing apple to apple. A fairer comparison is to run the c++ code to find out peak memory usage and set java max vm setting to that or even 2x that and watch it suffer. In real world, such limit is determined by the amount of physical RAM on the machine.

Read this peer reviewed paper (vs the simplistic microbenchmarks) http://www.cs.umass.edu/%7Eemery/pubs/gcvsmalloc.pdf for more details.

Comment by j...@ieee.org, Apr 04, 2008

Given that Yahoo has a dedicated team for Hadoop and is committed to using Hadoop for its own infrastructure, does anyone know if Yahoo is equally committed to using HBase too?

Comment by james.time4tea, Apr 05, 2008

shame shame. you can convince yourself that c++ is "much more efficient" if it save 150 clock cycles every so often. shame your programmer efficiency will fall through the floor.

shame you'll lose all access to decent refactoring tools.

shame that network latency will eat those 150 cycles for breakfast.

shame that very interesting project being sidelined at the start by marginalising itself. (c++, git(!)).

even the investment bank's dont write their models in c++ any more as its so hard, and not even any quicker. micro optimizations are easy, but macro ones much harder.

otherwise a really interesting project!!!

Comment by vicaya, Apr 05, 2008

@james, the "hardness" of C++ is a myth that we don't share. Both Doug and I have done enough programming in both languages to say that the productivity in both languages are pretty much the same in steady state mode. With Java have higher initial environment setup cost and later maintenance cost when need to combat performance problems. Also the lack of quality peer reviewed libraries like boost in Java is also a factor.

There is no need to be emo about the technical choices. Feel free to use and contribute to Hadoop and HBase, it's an open source project as well.

Long live open source and choices.

Comment by kunthar, Aug 13, 2008

Thanks to all thinker geeks of this universe.

You keep on movin... Deep Purple Song for you guys

Comment by cameron.purdy, May 16, 2009

> Would love to see the Oracle/Tangosol Coherence team discuss this as they > pretty much built their product as a large, distributed hashmap.

Language choice tends to be very personal, and usually defaults to what a programmer knows best, or is most comfortable with.

Despite working in C/C++ for years before Java, and continuing to work in C++ (no C anymore) today -- alongside C# and Java -- I would still default to Java for most things.

I don't know if the rationale given here makes sense, but in general, for a tightly focused product and a small development team that is comfortable working in C++, then C++ can certainly have a performance edge over Java (for most use cases).

Java is significantly better at "global optimizations" from a development perspective, which is to say that within a process (e.g. JVM), the boundary cost of Java (libraries, APIs, down to the interface level) is far lower than the cost of C++ as a project grows. (It's hard to explain if you haven't worked with large scale C++ projects that pull together multiple libraries and work from different groups.) The side-effect of this is often that the cost of development in Java does not rise as fast in correlation to the complexity of the product, and that the fn(quality,performance) does not drop as fast. However, while I would testify to this under oath, you will find the opposite opinion to be just as strongly believed by another programmer.

Peace,

Cameron Purdy Oracle Coherence

Comment by vicaya, May 18, 2009

BTW, there is a C++ implementation of dfs/mapreduce called Sector/Sphere from UIC by Dr. Yunhong Gu et al. The terasort benchmarks showed that Sector/Sphere is about 3-5x faster than Hadoop for sorting 1TB.

Comment by tdhutt, May 27, 2009

Surely the development time arguments only apply if no-one really uses your software. As long as enough people use your software then even a marginal performance benefit is worth significant programming effort.

You wouldn't like it if Microsoft wrote Word in C# because it was slightly easier would you?

Besides, using modern libraries like Boost and Qt makes C++ development just as easy as Java/C#. And less annoying, cough StringBuilder? cough.

Comment by runtime.yha, Jun 08, 2009

tdhutt mentioned the two critical points that typical C++ vs Java argument does not mention: boost and Qt. Both are mature and complementary to a degree (although I am a 'boost' user) libraries that extend C++ core language with reference counting, synchronization, compile time algorithms, sophisticated type safety and typecasting, multi-paradigm programming (such as single/slots), networking services (such as boost::asio), serialization

I just wish that Corba such as omniORB would be made into a better synch and asynch communication framework -- so that most of the distributed systems would use that as a messaging system instead of constantly inveting their own.

I personally think that in terms of 'core' language C++ looses to java in 'Reflection' (I think now C++ standard shared_ptr is perfectly capable reference counting garbage collector without loosing performance of C++, and therefore the argument about 'pointers' is not a valid one)

And it terms of 'library' features -- a web server framework and database access are the only two things that are missing from boost (compared to Java's library features for example).

Of course, the advantages of C++ have been stated above. But in terms of productivity -- C++ with boost and Qt offers reference counted memory alloc, compile time code generator, compile time error checking, native performance on pretty much all of the well used CPU/OS combinations with single-source tree -- I believe, are far more important than Java's claim-to-fame garbage collection.

Comment by cameron.purdy, Jun 08, 2009

Hi tdhutt, runtime.yha,

We've done recent tests with C++ using Boost versus Java, and Sun's current Hotspot Java implementation is about 3x faster than Boost's smart pointer implementation; basically, Java is able to run code at C pointer speed, while smart pointers always have measurable overhead.

In other words, C++ with Boost may be easier than C++, but you have to choose between 3x slower than Java or a very light use of smart pointers. (C++ shared_ptr will be a similar cost.)

Also, regarding productivity of development, according to our developers (who program in all of Java, C# and C++, including C++ on Windows/x86, Linux/x86, Mac OSX/x86, Solaris/Sparc, and Solaris/x86), Java wins hands down over both C++ due to the ease (and instancy) of iterative development and the advanced tooling. Our fastest C++ build takes an hour (full builds take from 4 hours to 20 hours depending on the platform), while our Java full builds take less than 10 minutes. The fact that Java/C# has class level compilation granularity and no linking step obviously helps significantly reduce iteration time.

Further, the complexity of achieving both thread safety and high concurrency with C++ and Boost is significantly more difficult than it is with Java or C#.

However, like I said before, "while I would testify to this under oath, you will find the opposite opinion to be just as strongly believed by another programmer."

Peace,

Cameron Purdy | Oracle Coherence

Comment by alan.wit...@teleatlas.com, Jul 02, 2009

Admittedly, this becomes a religious issue and there is not necessarily a "right or wrong". But I want to note that C++ execution is quite deterministic, where Java's, particularly in light of runtime garbage collection, is not. In fact, with more and more object management being automated in the STL I honestly can't even recall the last time I had any kind of leak or corrpution issue - which usually is the complaint folks bring against C++. Furthermore, RAII (resource allocation is instantiation) is pretty meaningless in Java with garbage collection, whereas in C++ with scope-defined (automatic) variables, it becomes a powerful way to drop any resources that go out of scope. In our application we are able to finely control our memory and file handle simply by ensuring things go out of scope (which is often as easy as ensuring our methods are fine-grained enough - good programming practice to begin with). It seems most Java code I've seen lately has post-initialize and pre-destruction methods which must be called, in order to get around the non-determinism of garbage collection. Which means, of course, that now as an undesired side-effect of something which is supposed to make the system more bug-free, the programmer actually ends up explicitly managing resources instead! It seems to me as if the quality pendulum is swinging back - with the many open source libraries, STL, and the ability to rest on the strength of C++ automatics, C++ is actually becoming the more effective and bug-free language for heavy resource manipulation!

Comment by cameron.purdy, Jul 06, 2009

Alan -

I do agree with several of your points. Hard real-time determinism in execution is just not possible today in Java. Years ago, I remember the same complaint being leveled at C++, but generally speaking you can accomplish real-time behavior in C++ now (it's amazing what a few orders of magnitude increase in CPU speed can do for a language ;-). The obvious pitfall is still primarily I/O, and secondarily (to a far lesser extent) memory management -- particularly in multi-threaded environments.

Regarding RAII, its wide adoption as a best practice in C++ is certainly a huge improvement, but as you note it is entirely incomparable with automated garbage collection. Personally, I hope to see a union of the concepts in the future ..

I don't agree with your conclusion, but like you said: "this becomes a religious issue and there is not necessarily a 'right or wrong'". I personally hope there's a continued train of improvements that move us well beyond the current "state of the art" in programming without any unnecessary sacrifice of performance or determinism.

Peace,

Cameron Purdy | Oracle Coherence

Comment by aviad.in...@gmail.com, Aug 07, 2009

cameron,

Speaking of being deterministic, I would check Java Real Time Spec, and maybe Sun's implementation (http://java.sun.com/javase/technologies/realtime/index.jsp). I've no experience with it myself but seen some great stuff some people implement using it...

Aviad.

Comment by EdLinuxGuru, Sep 23, 2009

To demonstrate the misconceptions of the world: I have had people who have never written a java application before tell me 'java is slow' :). Before we started using hadoop I made a non map/reduce wordCount application in java and in perl. Guess which one finished faster? Java. And then someone said, "I bet I can write it in C and it would rock".... I am still waiting for that c wordcount program from that person.


Sign in to add a comment
Hosted by Google Code