|
WhyWeChoseCppOverJava
Why We Chose C++ Over Java
This document is to clarify our position regarding C++ vs. Java for choice of implementation language. There are two fundamental reasons why C++ is superior to Java for this particular application.
caches: level size linesize miss-latency replace-time 1 64 KB 64 bytes 6.06 ns = 12 cy 5.60 ns = 11 cy 2 768 KB 128 bytes 74.26 ns = 149 cy 75.90 ns = 152 cy You can pack a fair amount of work into 150 clock cycles. Another place where Hypertable is CPU intensive is compression. All of the data that is inserted into a Hypertable gets compressed at least twice, and on average three times. Once when writing data to the commit log, once during minor compaction and then once for every merging or major compaction. And the amount of decompression that happens can be considerably more depending on the amount of query workload the table sees. It's arguable that the native Java implementation of zlib is comparable to the C implementation, but as soon as you start experimenting with different compression techniques (e.g. Bentley-McIlroy long common strings), you need to either implement them in Java which yields unacceptable performance (if you don't believe me, try it), or implement them in C/C++ and use JNI. With this second option, all of the benefits of Java get thrown out the window and there is significant overhead in invoking a method via JNI. What about Hadoop DFS and Map-reduce framework? Given that the bulk of the work performed by the Hadoop DFS and Map-reduce framework is I/O, Java is probably an acceptable language for those applications. There are some places where Java is sub-optimal. In particular, at scale, there will be considerable memory pressure in the Namenode of the DFS. Java is a poor choice for this type of memory hungry application. Another place where the use of Java is sub-optimal is the post-map sorting in preparation for the reduce phase. This is CPU-intensive and involves the type of CPU work that Java is not good at. |
if only Hadoop was in C++ too, things would be some much simpler.
Would love to see the Oracle/Tangosol Coherence team discuss this as they pretty much built their product as a large, distributed hashmap.
Interesting but somewhat debatable, though if you have the time and money an app written in C/C++ will outperform java but its takes a lot of effort these days to do that. Hadoop not being written in C++ is probably quite a good thing, as new functionality tends to be easier to implement and test in java.
Some interesting comparative articles.
http://www.ddj.com/cpp/184401976?pgno=1
http://www.idiom.com/~zilla/Computer/javaCbenchmark.html
These microbenchmarks practically remove java memory management overhead completely, because of the nature of gc, so they do much less work, i.e. not comparing apple to apple. A fairer comparison is to run the c++ code to find out peak memory usage and set java max vm setting to that or even 2x that and watch it suffer. In real world, such limit is determined by the amount of physical RAM on the machine.
Read this peer reviewed paper (vs the simplistic microbenchmarks) http://www.cs.umass.edu/%7Eemery/pubs/gcvsmalloc.pdf for more details.
Given that Yahoo has a dedicated team for Hadoop and is committed to using Hadoop for its own infrastructure, does anyone know if Yahoo is equally committed to using HBase too?
shame shame. you can convince yourself that c++ is "much more efficient" if it save 150 clock cycles every so often. shame your programmer efficiency will fall through the floor.
shame you'll lose all access to decent refactoring tools.
shame that network latency will eat those 150 cycles for breakfast.
shame that very interesting project being sidelined at the start by marginalising itself. (c++, git(!)).
even the investment bank's dont write their models in c++ any more as its so hard, and not even any quicker. micro optimizations are easy, but macro ones much harder.
otherwise a really interesting project!!!
@james, the "hardness" of C++ is a myth that we don't share. Both Doug and I have done enough programming in both languages to say that the productivity in both languages are pretty much the same in steady state mode. With Java have higher initial environment setup cost and later maintenance cost when need to combat performance problems. Also the lack of quality peer reviewed libraries like boost in Java is also a factor.
There is no need to be emo about the technical choices. Feel free to use and contribute to Hadoop and HBase, it's an open source project as well.
Long live open source and choices.
Thanks to all thinker geeks of this universe.
You keep on movin... Deep Purple Song for you guys
> Would love to see the Oracle/Tangosol Coherence team discuss this as they > pretty much built their product as a large, distributed hashmap.
Language choice tends to be very personal, and usually defaults to what a programmer knows best, or is most comfortable with.
Despite working in C/C++ for years before Java, and continuing to work in C++ (no C anymore) today -- alongside C# and Java -- I would still default to Java for most things.
I don't know if the rationale given here makes sense, but in general, for a tightly focused product and a small development team that is comfortable working in C++, then C++ can certainly have a performance edge over Java (for most use cases).
Java is significantly better at "global optimizations" from a development perspective, which is to say that within a process (e.g. JVM), the boundary cost of Java (libraries, APIs, down to the interface level) is far lower than the cost of C++ as a project grows. (It's hard to explain if you haven't worked with large scale C++ projects that pull together multiple libraries and work from different groups.) The side-effect of this is often that the cost of development in Java does not rise as fast in correlation to the complexity of the product, and that the fn(quality,performance) does not drop as fast. However, while I would testify to this under oath, you will find the opposite opinion to be just as strongly believed by another programmer.
Peace,
Cameron Purdy Oracle Coherence
BTW, there is a C++ implementation of dfs/mapreduce called Sector/Sphere from UIC by Dr. Yunhong Gu et al. The terasort benchmarks showed that Sector/Sphere is about 3-5x faster than Hadoop for sorting 1TB.
Surely the development time arguments only apply if no-one really uses your software. As long as enough people use your software then even a marginal performance benefit is worth significant programming effort.
You wouldn't like it if Microsoft wrote Word in C# because it was slightly easier would you?
Besides, using modern libraries like Boost and Qt makes C++ development just as easy as Java/C#. And less annoying, cough StringBuilder? cough.
tdhutt mentioned the two critical points that typical C++ vs Java argument does not mention: boost and Qt. Both are mature and complementary to a degree (although I am a 'boost' user) libraries that extend C++ core language with reference counting, synchronization, compile time algorithms, sophisticated type safety and typecasting, multi-paradigm programming (such as single/slots), networking services (such as boost::asio), serialization
I just wish that Corba such as omniORB would be made into a better synch and asynch communication framework -- so that most of the distributed systems would use that as a messaging system instead of constantly inveting their own.
I personally think that in terms of 'core' language C++ looses to java in 'Reflection' (I think now C++ standard shared_ptr is perfectly capable reference counting garbage collector without loosing performance of C++, and therefore the argument about 'pointers' is not a valid one)
And it terms of 'library' features -- a web server framework and database access are the only two things that are missing from boost (compared to Java's library features for example).
Of course, the advantages of C++ have been stated above. But in terms of productivity -- C++ with boost and Qt offers reference counted memory alloc, compile time code generator, compile time error checking, native performance on pretty much all of the well used CPU/OS combinations with single-source tree -- I believe, are far more important than Java's claim-to-fame garbage collection.
Hi tdhutt, runtime.yha,
We've done recent tests with C++ using Boost versus Java, and Sun's current Hotspot Java implementation is about 3x faster than Boost's smart pointer implementation; basically, Java is able to run code at C pointer speed, while smart pointers always have measurable overhead.
In other words, C++ with Boost may be easier than C++, but you have to choose between 3x slower than Java or a very light use of smart pointers. (C++ shared_ptr will be a similar cost.)
Also, regarding productivity of development, according to our developers (who program in all of Java, C# and C++, including C++ on Windows/x86, Linux/x86, Mac OSX/x86, Solaris/Sparc, and Solaris/x86), Java wins hands down over both C++ due to the ease (and instancy) of iterative development and the advanced tooling. Our fastest C++ build takes an hour (full builds take from 4 hours to 20 hours depending on the platform), while our Java full builds take less than 10 minutes. The fact that Java/C# has class level compilation granularity and no linking step obviously helps significantly reduce iteration time.
Further, the complexity of achieving both thread safety and high concurrency with C++ and Boost is significantly more difficult than it is with Java or C#.
However, like I said before, "while I would testify to this under oath, you will find the opposite opinion to be just as strongly believed by another programmer."
Peace,
Cameron Purdy | Oracle Coherence
Admittedly, this becomes a religious issue and there is not necessarily a "right or wrong". But I want to note that C++ execution is quite deterministic, where Java's, particularly in light of runtime garbage collection, is not. In fact, with more and more object management being automated in the STL I honestly can't even recall the last time I had any kind of leak or corrpution issue - which usually is the complaint folks bring against C++. Furthermore, RAII (resource allocation is instantiation) is pretty meaningless in Java with garbage collection, whereas in C++ with scope-defined (automatic) variables, it becomes a powerful way to drop any resources that go out of scope. In our application we are able to finely control our memory and file handle simply by ensuring things go out of scope (which is often as easy as ensuring our methods are fine-grained enough - good programming practice to begin with). It seems most Java code I've seen lately has post-initialize and pre-destruction methods which must be called, in order to get around the non-determinism of garbage collection. Which means, of course, that now as an undesired side-effect of something which is supposed to make the system more bug-free, the programmer actually ends up explicitly managing resources instead! It seems to me as if the quality pendulum is swinging back - with the many open source libraries, STL, and the ability to rest on the strength of C++ automatics, C++ is actually becoming the more effective and bug-free language for heavy resource manipulation!
Alan -
I do agree with several of your points. Hard real-time determinism in execution is just not possible today in Java. Years ago, I remember the same complaint being leveled at C++, but generally speaking you can accomplish real-time behavior in C++ now (it's amazing what a few orders of magnitude increase in CPU speed can do for a language ;-). The obvious pitfall is still primarily I/O, and secondarily (to a far lesser extent) memory management -- particularly in multi-threaded environments.
Regarding RAII, its wide adoption as a best practice in C++ is certainly a huge improvement, but as you note it is entirely incomparable with automated garbage collection. Personally, I hope to see a union of the concepts in the future ..
I don't agree with your conclusion, but like you said: "this becomes a religious issue and there is not necessarily a 'right or wrong'". I personally hope there's a continued train of improvements that move us well beyond the current "state of the art" in programming without any unnecessary sacrifice of performance or determinism.
Peace,
Cameron Purdy | Oracle Coherence
cameron,
Speaking of being deterministic, I would check Java Real Time Spec, and maybe Sun's implementation (http://java.sun.com/javase/technologies/realtime/index.jsp). I've no experience with it myself but seen some great stuff some people implement using it...
Aviad.
To demonstrate the misconceptions of the world: I have had people who have never written a java application before tell me 'java is slow' :). Before we started using hadoop I made a non map/reduce wordCount application in java and in perl. Guess which one finished faster? Java. And then someone said, "I bet I can write it in C and it would rock".... I am still waiting for that c wordcount program from that person.
Java is quite slow at certain things. It's quite fast at others. It's not hard to find examples online in which for some task Java is better than C++. However-- Having some experience writing applications that are by their nature both CPU and memory allocation intensive, I can say C++ is by far the better choice. This is especially true if data structure access can be laid out to minimize cache misses. Ultimately though, C++ and Java have such different underlying structures that neither can be said to be the better of the two outside a specific context.
Do you Java geeks realize that the run time engine for Java (JRE) is written in C? All of your Java comments tell me that you don't even understand that your byte code is ultimately executed by the C language. In other words, Java is merely an abstraction on top of the C language. It's like a cocky child saying that their parents suck when in reality they are nothing more than a manifestation of their parents genes. A programmer that really understands this has respect for both languages. In fact, do you know what came before C? It was the language B...and guess what...before that it was A ... all developed by AT&T. Have some respect for your elders you cocky kids. If it wasn't for your parents and your grand parents you would be a single celled creatures floating aimlessly in the ocean of life. Same goes for you C++ programmers...have some respect for Java...it's a great language when used by a good programmer.
Just to give you an update from the industry and academia that nobody seems to have paid attention to so far:
Please meet...
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms http://portal.acm.org/citation.cfm?id=1640089.1640116
It seems the garbage collection comes at a very high price, and that price being Java saturating memory bus with it's memory usage algorithms, heuristics and whatnots.
It's an interesting read.
M
Hi ele7ven - you said "C++ is by far the better choice. This is especially true if data structure access can be laid out to minimize cache misses." This is a good point, that with C/C++, you control structure and array layout, and can optimize for cache locality. For massive crunching, this approach is unbeatable, but it is rare that you will have the combination of this problem domain and a competent programmer who can actually take advantage of what C has to offer ;-)
Hi jmama507 - the runtime engine is typically a compiler that converts Java "byte code" to native code, so "the JRE" itself (since I assume you mean the JVM implementation) doesn't run much at all other than to initially compile some byte code into x86 code (for example). As for "B" and "A" and respect for your elders, in reality what came before C was assembly, i.e. C replaced assembly, not "B" or "A" (since basically no one used "B" or "A").
Hi mateuszb - there are definitely challenges as the number of cores increase faster than the bus bandwidth does, but it's not an issue unique to Java. OTOH, Java does tend to use 2-10x as much memory as C (for example), so it's going to run into any such "wall" much faster.
Peace,
Cameron Purdy | Oracle Coherence
@jmama507 Your comment was so retarded that for a while my sarcasm-meter was on overload. Then I realized that you're just ignorant of how compilers and VMs work so shoo along, kid and let the men do some work.
@cameron purdy I programmed in B so don't say no one used it. C came next and was considered a Mid-Level language. Assembler was a Low Level language, but yes C was a great replacement for assembler. But technically C was an upgrade to B, and I used it years ago.
Java/C/C++ all have a place. I certainly wouldn't write Unix commands in Java nor would I do a huge Web application in C++....
I did enough research on this, so please allow me give some inputs. To me this is an absurd comparison. C/C++ generates native code, which is obviously much faster than Java's byte code. Java is used for typical business application development and currently more used in Universities, simply because it is easy to use and even easier to maintain. Java is definitely easier to understand, code and learn.
But when it comes to high-performance software, no idiot will ever use Java. All the world's fastest, most successful, powerful/famous, popular software are written in C/C++ and will always be written in C/C++. Microsoft Office, Mac OS X, Windows, Android, Adobe Photoshop/Fireworks/Flash, Autocad, VLC, iOS, Symbian, 3D Games (FarCry?, Quake, Doom...etc.), Java's own JRE, Oracle Database, MySQL, GIMP, ...in short, every successful software that millions of people us are written in C++.
Here's a question. Name any famous/wide used/respected Java-based application (other than Eclipse and Netbeans)? Answer is simple, there is none.
Dear browserspot,
Incorrect.
Java programs almost always run as native code, for example on Mac OSX (Apple JVM based on Sun JVM), Linux (IBM, Sun and jRockit JVMs), Windows (Sun and jRockit JVMs), AIX (IBM JVM), Z/OS (IBM JVM) and Solaris (Sun JVM).
Amazon. eBay. Google AdWords?. Gmail. Berkeley DB. Apache Tomcat. Major chunks of the NYSE, LSX, CME, etc. Trading systems, risk systems, etc. at basically every large bank.
Java is quite successful for large-scale applications.
Perhaps what you meant is that Java has not taken off for small-scale (desktop client) applications? While that is true, that does not (in and of itself) imply that C/C++ do not scale.
Also, here are Quake and Doom in Java .. from back in 2002!
http://java-emu.emuunlim.com/quake/quake.html
http://java-emu.emuunlim.com/doomcott/doom.html
You have apparently not yet done enough research on this.
Hi, Both has advantages and disadvantages.... It is you who need to choose what suits best for your App or Project. I will never say to create a device driver in java. they are in C.. even Assembly. If you are thing something complex like NYSE it should be Java. Java is for Business and that is true. So just don;t fight and take what you want.
No offense, but the quality of your product is in serious doubt if you claim that high throughput applications will have faster memory management in C++ over a garbage collected system. This is a complete misnomer, especially with newer GC algorithms such as Sun's G1GC. A garbage collected JVM running G1GC will easily beat the pants off a C++ implemented system using plain old malloc. And thats before you start looking at the fragmentation problems you are going to get in the C++ system, which will be horrendous at high throughputs.
I'd say you probably don't understand memory management too well if you stick by your counter claims. oops!!!!
Java stinks for high performance low level computing no matter how you slice it.
It is made for a specific purpose; protect the coder from the OS and make it easier to complete a task. This is more relevant today since many programmers aren't as aware of what happens under the hood as some should be, though many pretend to be. The former is fine for most application work.
Java has grown to be a useful, stable, and capable tool for the task it was intended. However, it was never intended as a low level, CPU intensive, high performance one.
Whomever referred to Amazon or Ebay using it for their systems, I rest my case. They are the slowest clunkers out there. Then again maybe they use windows too, lol!
Or maybe Google wrote Bigtable in C++ just so it would take longer for them to develop it and not for performance or stability reasons... ;)
Let's start with the opening claim:
What exactly is this fluffy concept you call "high performance low level computing"? Because depending on how you define that (were someone to ask you to be less vaporous), you may accidentally be correct. I wouldn't write device drivers in Java, for example.
Google is one of the largest Java shops. They built a custom web server (based on Tomcat) that they use extensively. Adwords (you know, the thing that makes them all their money) is Java. The list goes on and on.
Hi there, I just wanted to drop a note regarding Java and Doom:
Those examples posted by Cameron (the "Doom" and "Quake" projects by Julien Frelat) are outdated (they are from 2002), non-functional and never ever came close to demonstrating that either of those games were feasible in Java.
Case in point, I decompiled Doomcott's source code here, and proved that it does nothing more than displaying screen wipes. However it really does read IWAD resources like a proper source port, but it's nowhere near functional enough to call it a port, sorry.
It's really a pity that it is still touted as the definitive "Doom in Java". I don't know about the "Quake 2 port" by the same author, but I bet it's in a similar status.
However, there ARE legitimate ports of Quake (2) and Doom in Java:
There's Jake2 for Quake 2, written in pure Java, using OpenGL, and dating back to 2005. Performance was 60-70% of a native source port, sometimes even higher.
Also, only in 2010 did a complete source port of Doom in pure Java appear: Mocha Doom. I am the author of that port, and let me tell you, it took far more effort than just running the C source code through a cross-compiler, making a "Doom like" game with a different engine or using an automated virtualization tool like Adobe Alchemy. This is actually a pure Java port with the original code and internal functionality mostly preserved but adapted to a proper Java OO approach etc. and it DOES work, loads custom levels etc.
As for performance, it can actually outperform certain modern sourceports like prBoom or Chocolate Doom, and has about 50-60% of the performance of what is considered the most optimized source port today, prBoom+ (measured using timed demos).
Just my 2c ;-)
I don't believe the issue is whether on any given "for instance" test, one can't make either C++ or Java look more efficient. The issue is from the total cost of development which is more efficient. By that I mean, the cost of development staff, the cost of sustaining staff. The length from product inception to GA. How well C++ optimizes code is a function of who wrote the compiler optimizer and how long ago, ditto with java. Engineering staff keeps getting more expensive. Competent C++ developers are getting fewer in number and more expensive. Competent Java developers are getting greater in number and less expensive. Machine performance changes on a monthly basis, let Moore's Law deal with that aspect as it is irrelevant to the cost of development.