Java performance tuning tips or everything you want to know about Java performance in 15 minutes

Last updated: 13 September 2014

This is a summary of Java performance tuning tips described on java-performance.info website. This page will be updated after publishing a new article on Java performance tuning website.

Unlike most of Java performance books, this guide targets tuning your Java code instead of your JVM settings. It means that this guide will be useful for any low-latency or high-throughput application developers (especially for high frequency trading area).

Tools

UPDATED: Introduction to JMH: an overview of JMH - a new microbenchmarking framework from Oracle. I have covered most of essential functionality in the framework. JMH profilers will be the subject of a separate article. This article covers JMH 1.0.

Tags: JMH, microbenchmarking.

  • JMH is useful for all sorts of microbenchmarking - from nanoseconds to seconds per test. It takes care of all measurement logic, leaving you just a task of writing the test method(s). JMH also contains built-in support for all sorts of multithreaded tests - both uniform (all threads run the same code) and non-uniform (there are several groups of threads, each of them is running each own code).
  • If you have to remember just one JMH rule, it should be: always read test input from @State objects and return the result of your calculations (either explicitly or via a BlackHole object).
  • JMH is started differently since JMH 0.5: now you have to add one more dependency to your pom file and use maven-shade-plugin. It generates target/benchmarks.jar file, which contains all the code required to run all tests in your project.

Introduction to JMH Profilers: an overview of profilers bundled inside JMH - a new microbenchmarking framework from Oracle. They will let you get an insight in your microbencmarks, which could not be available using the normal profilers, because your tests could be too quick and you should separate test code from the framework code.

Tags: JMH, profiler, microbenchmarking.

  • JMH profilers are a cheap and convenient way to find out the bottlenecks in your microbenchmarks: they try to avoid measuring as much of JMH code as possible and you can use any subset of them simultaneously.
  • STACK profiler makes a thread dump every 10 ms by default, but you may probably want to decrease this interval a little on powerful boxes.
  • JIT compiler profilers (COMP / HS_COMP) are recommended for use on most of benchmarks - they will let you know if you have insufficiently warmed up your code.

JDK classes

Core Java 7 Change Log: a list of all changes in the core Java classes of JDK7 releases.

Tags: Java 7, changes.

I will keep track of all change in core Java JDK7 classes related to performance on this page. All JDK updates up to Java 7u45 are now covered.

Using double/long vs BigDecimal for monetary calculations: double, long, java.math.BigDecimal, java.lang.String:

Tags: finance, money, HFT, low latency.

  • If you want to implement fast and correct monetary arithmetic operations in Java, stick to the following rules:
    1. Store monetary values in the smallest currency units (for example, cents) in long variables.
    2. Avoid working with non-integral values while using double (calculate in the smallest currency units).
    3. Add/subtract using long.
    4. Round any multiplication/division results using Math.round/rint/ceil/floor (per your system requirements).
    5. Your calculations should fit into 52 bits (double precision).
  • Always use MathContext for BigDecimal multiplication and division in order to avoid ArithmeticException for infinitely long decimal results. Don't use MathContext.UNLIMITED for that reason - it is equivalent to no context at all.
  • Do not convert double to BigDecimal, instead convert String to BigDecimal when possible.

Changes to String internal representation made in Java 1.7.0_06: java.lang.String, java.util.HashMap, java.util.Hashtable, java.util.HashSet, java.util.LinkedHashMap, java.util.LinkedHashSet, java.util.WeakHashMap and java.util.concurrent.ConcurrentHashMap:

Tags: String.substring, Java 7, memory consumption, low latency.

  • From Java 1.7.0_06 String.substring always creates a new underlying char[] value for every String it creates. This means that this method now has a linear complexity compared to previous constant complexity. The advantage of this change is a slightly smaller memory footprint of a String (8 bytes less than before) and a guarantee to avoid memory leaks caused by String.substring (see String packing part 1: converting characters to bytes for more details on Java object memory layout).
  • Java 7u6+ functionality. Removed in Java 8. Starting from the same Java update, String class got a second hashing method called hash32. This method is currently not public and could be accessed without reflection only via sun.misc.Hashing.stringHash32(String) call. This method is used by 7 JDK hash-based collections if their size will exceed jdk.map.althashing.threshold system property. This is an experimental function and currently I don't recommend using it in your code.
  • Java 7u6 (inclusive) to Java 7u40 (exclusive) functionality. Not applicable to Java 8. All standard JDK non-concurrent maps and sets in all Java versions between Java 7u6 (inclusive) and Java 7u40 (exclusive) are affected by a performance bug caused by new hashing implementation. This bug affects only multithreaded applications creating heaps of maps per second. See this article for more details. This problem was fixed in Java 7u40.

NEW: String deduplication feature (from Java 8 update 20): this article will describe the string deduplication feature added in Java 8 update 20. It will allow you to save memory occupied by the duplicate strings without writing a single line of Java code. While this is not the most efficient memory saving tool in the absolute values, it is definitely a winner in the achievement vs developer efforts nomination.

Tags: String, Java 8, memory consumption.

  • String deduplication feature was added in Java 8 update 20. It is a part of G1 garbage collector, so it should be turned on with G1 collector: -XX:+UseG1GC -XX:+UseStringDeduplication
  • String deduplication is an optional G1 phase. It depends on the current system load.
  • String deduplication is looking for the strings with the same contents and canonicalizing the underlying char[] with string characters. You don't need to write code to use this feature, but it means you are being left with distinct String objects, each of those occupying 24 bytes. Sometimes it worth to intern strings explicitly using String.intern.
  • String deduplication does not process too young strings. The minimal age of processed strings is managed by -XX:StringDeduplicationAgeThreshold=3 JVM parameter (3 is the default value of this parameter).

Performance of various methods of binary serialization in Java: java.nio.ByteBuffer, sun.misc.Unsafe, java.io.DataInputStream, java.io.DataOutputStream, java.io.ByteArrayInputStream, java.io.ByteArrayOutputStream: comparison of binary serialization performance using various classes:

Tags: serialization in Java, unsafe memory access in Java, high throughput, low latency.

  • It is extremely slow to write single bytes to direct byte buffers. You should avoid using direct byte buffers for writing records with mostly single byte fields.
  • If you have primitive array fields - always use bulk methods to process them. ByteBuffer bulk methods performance is close to those of Unsafe (though ByteBuffer methods are always a little slower). If you need to store/load any other primitive array except byte - use ByteBuffer.to[YourType]Buffer.put(array) method call followed by your byte buffer position update. Do not call ByteBuffer.put[YourType] method in a loop!
  • The higher your average field length - the slower is a heap buffer and the faster is a direct byte buffer. Unsafe access even to separate fields is still faster.
  • In Java 7 many types of ByteBuffer accesses were seriously optimized compared to Java 6.
  • Always try to serialize primitive arrays using direct byte buffer with your platform native byte order - its performance is very close to Unsafe performance and it is portable, unlike Unsafe code.

Java collections overview: all JDK 1.6/1.7 standard collections are described and categorized in this overview.

Tags: Java 1.6 collections, Java 1.7 collections, Java collections guide, overview.

Here is a very brief summary of all JDK collections:

  Single threaded Concurrent
Lists
  • ArrayList - generic array-based
  • LinkedList - do not use
  • Vector - deprecated
  • CopyOnWriteArrayList - seldom updated, often traversed
Queues / deques
  • ArrayDeque - generic array-based
  • Stack - deprecated
  • PriorityQueue - sorted retrieval operations
  • ArrayBlockingQueue - bounded blocking queue
  • ConcurrentLinkedDeque / ConcurrentLinkedQueue - unbounded linked queue (CAS)
  • DelayQueue - queue with delays on each element
  • LinkedBlockingDeque / LinkedBlockingQueue - optionally bounded linked queue (locks)
  • LinkedTransferQueue - may transfer elements w/o storing
  • PriorityBlockingQueue - concurrent PriorityQueue
  • SynchronousQueue - Exchanger with Queue interface
Maps
  • HashMap - generic map
  • EnumMap - enum keys
  • Hashtable - deprecated
  • IdentityHashMap - keys compared with ==
  • LinkedHashMap - keeps insertion order
  • TreeMap - sorted keys
  • WeakHashMap - useful for caches
  • ConcurrentHashMap - generic concurrent map
  • ConcurrentSkipListMap - sorted concurrent map
Sets
  • HashSet - generic set
  • EnumSet - set of enums
  • BitSet - set of bits/dense integers
  • LinkedHashSet - keeps insertion order
  • TreeSet - sorted set
  • ConcurrentSkipListSet - sorted concurrent set
  • CopyOnWriteArraySet - seldom updated, often traversed

java.util.ArrayList performance guide: java.util.ArrayList:

Tags: low latency, high throughput, CPU cache friendly, Java collections, CPU optimization, memory optimization.

Try to follow these rules while using ArrayList:

  • Add elements to the end of the list
  • Remove elements from the end too
  • Avoid contains, indexOf and remove(Object) methods
  • Even more avoid removeAll and retainAll methods
  • Use subList(int, int).clear() idiom to quickly clean a part of the list

java.util.LinkedList performance: java.util.LinkedList, java.util.ArrayDeque:

Tags: Java collections, CPU optimization, avoid it.

If you need to write fast LinkedList code, try to stick to these rules:

  • Consider using ArrayDeque for queue-based algorithms
  • Use ListIterator with LinkedList
  • Avoid any LinkedList methods which accept or return index of an element in the list - they have nothing in common with performance
  • Check if you have a reason to use LinkedList.remove/removeFirst/removeLast methods, use pollFirst/pollLast instead
  • Try batch processing LinkedList

Bit sets: java.util.BitSet, java.util.Set<Integer>: representing set of integers in the most compact form, using bit sets to store set of Long/long values:

Tags: low latency, high throughput, CPU cache friendly, Java collections, CPU optimization, memory optimization.

  • Do not forget about bit sets when you need to map a large number of integer keys to boolean flags.
  • Sets of integer values should be replaced with bit sets in a lot of cases in order to save a lot of memory.

java.util.IdentityHashMap: discussion why an IdentityHashMap is so special and what alternatives does it have.

Tags: Java collections, object graph, avoid it.

  • java.util.IdentityHashMap uses System.identityHashCode to get object identity hash code. Avoid using IdentityHashMap if you either have primary key field in the objects (use them as a key for ordinary HashMap) or use Trove maps custom hashing strategy if you need to add your own equals and hashCode methods, but can't update the objects you are working on.
  • Do not try to iterate IdentityHashMap contents, because iteration order will be different on every run of your program, thus making your program results inconsistent.
  • Accessing the object identity hash code is a very cheap Java intrinsic operation.
  • Beware that an object with the calculated identity hash code can not be used for biased locking. While very rare in normal circumstances, you may end up in this situation if your lock will be accessed by any Java object graph traversal algorithm (serialization, for example).

Regexp-related methods of String: java.util.regex.Pattern, java.util.regex.Matcher, java.lang.String: pattern/matcher logic:

Tags: low latency, high throughput, CPU optimization.

  • Always (or nearly always) replace String.matches, split, replaceAll, replaceFirst methods with Matcher and Pattern methods - it will save you from unnecessary pattern compilation.
  • In Java 7 splitting by a single not regex-special character string is optimized in String.split method. Always use String.split to split such strings in Java 7.
  • In all other simple cases consider handwriting parsing methods for simple situations in the time-critical code. You can easily gain 10 times speedup by replacing Pattern methods with handcrafted methods.

java.util.Date, java.util.Calendar and java.text.SimpleDateFormat performance: java.util.Date, java.util.Calendar, java.text.SimpleDateFormat: date storage, parsing and converting back to string:

Tags: low latency, high throughput, finance, CPU optimization, memory optimization.

  • Do not use java.util.Date unless you have to use it. Use an ordinary long instead.
  • java.util.Calendar is useful for all sorts of date calculations and i18n, but avoid either storing a lot of such objects or extensively creating them - they consume a lot of memory and expensive to create.
  • java.text.SimpleDateFormat is useful for general case datetime parsing, but it is better to avoid it if you have to parse a lot of dates in the same format (especially dates without time). Implement a parser manually instead.

Joda Time library performance: org.joda.time.DateTime, org.joda.time.format.DateTimeFormat, org.joda.time.format.DateTimeFormatter.
This is a comparison of Joda Time library classes performance with standard JDK classes performance (java.util.Date, java.util.Calendar, java.text.SimpleDateFormat). I advice you to read this article in conjunction with a java.util.Date, java.util.Calendar and java.text.SimpleDateFormat performance article. This article was tested on Joda Time ver 2.1-2.3.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization.

  • All Joda Time date/time objects are built on top of a long timestamp, so it is cheap to construct those objects from a long.
  • Joda Time ver 2.1-2.3 is affected by a performance issue in a timezone offset calculation logic - all years after the last daylight savings rule change in the given timezone use a slow calculation path (European timezones are affected particularly badly). In essence it means that all zones will perform badly in all years after Joda Time release you are using.
  • Date/time objects construction and date/time arithmetics in Joda work 1.5-3 times faster than GregorianCalendar for the years not affected by an above mentioned performance issue. For affected years date operations performance in Joda plummets and could be 4 times slower than in GregorianCalendar.
  • Joda does not keep the human time - year/month/day/hour/min/second inside its objects (unlike GregorianCalendar). It means that accessing human time on Joda objects is more expensive if you need to get more than one field.
  • Date/time parsing in Joda is working a little faster than in JDK SimpleDateFormat. The advantage of Joda parsing is that constructing a parser - DateTimeFormatter object is extremely cheap, unlike an expensive SimpleDateFormat, so you don't have to cache parsers anymore.

JSR 310 - Java 8 Date/Time library performance (as well as Joda Time 2.3 and j.u.Calendar): an overview of a new Java 8 date/time implementation also known as JSR-310 and its performance comparison with Joda Time 2.3 and j.u.GregorianCalendar.

Tags: Java 8, overview, CPU optimization, memory optimization.

  • Java 8 date/time classes are built on top of human time - year/month/day/hour/minute/second/nanos. It makes them fast for human datetime arithmetics/conversion. Nevertheless, if you are processing computer time (a.k.a. millis since epoch), especially computer time in a short date range (a few days), a manual implementation based on int/long values would be much faster.
  • Date/time component getters like getDayOfMonth have O(1) complexity in Java 8 implementation. Joda getters require the computer-to-human time calcualtion on every getter call, which makes Joda a bottleneck in such scenarios.
  • Parsing of OffsetDateTime/OffsetTime/ZonedDateTime is very slow in Java 8 ea b121 due to exceptions thrown and caught internally in the JDK.

java.io.ByteArrayOutputStream: java.io.ByteArrayOutputStream, java.nio.ByteBuffer: why you should not use ByteArrayOutputStream in the performance critical code.

Tags: Java IO, avoid it.

  • For performance critical code try to use ByteBuffer instead of ByteArrayOutputStream. If you still want to use ByteArrayOutputStream - get rid of its synchronization.
  • If you are working on a method which writes some sort of message to unknown OutputStream, always write your message to the ByteArrayOutputStream first and use its writeTo(OutputStream) method after that. In some rare cases when you are building a String from its byte representation, do not forget about ByteArrayOutputStream.toString methods.
  • In most cases avoid ByteArrayOutputStream.toByteArray method - it creates a copy of internal byte array. Garbage collecting these copies may take a noticeable time if your application is using a few gigabytes of memory (see Inefficient byte[] to String constructor article for another example).

java.io.BufferedInputStream and java.util.zip.GZIPInputStream: java.io.BufferedInputStream, java.util.zip.GZIPInputStream, java.nio.channels.FileChannel: some minor performance pitfalls in these two streams.

Tags: high throughput, CPU optimization, memory optimization, data compression.

  • Both BufferedInputStream and GZIPInputStream have internal buffers. Default size for the former one is 8192 bytes and for the latter one is 512 bytes. Generally it worth increasing any of these sizes to at least 65536.
  • Do not use a BufferedInputStream as an input for a GZIPInputStream, instead explicitly set GZIPInputStream buffer size in the constructor. Though, keeping a BufferedInputStream is still safe.
  • If you have a new BufferedInputStream( new FileInputStream( file ) ) object and you call its available method rather often (for example, once or twice per each input message), consider overriding BufferedInputStream.available method. It will greatly speed up file reading.

java.lang.Byte, Short, Integer, Long, Character (boxing and unboxing): java.lang.Byte, java.lang.Short, java.lang.Integer, java.lang.Long, java.lang.Character:

Tags: low latency, high throughput, CPU optimization, memory optimization.

  • Never call java.lang.Number subclasses valueOf(String) methods. If you need a primitive value - call parse[Type]. If you want an instance of a wrapper class, still call parse[Type] method and rely on the JVM-implemented boxing. It will support caching of most frequently used values. Never call wrapper classes constructors - they always return a new Object, thus bypassing the caching support. Here is the summary of caching support for primitive replacement classes:
Byte, Short, Long Character Integer Float, Double
From -128 to 127 From 0 to 127 From -128 to java.lang.Integer.IntegerCache.high or 127, whichever is bigger No caching

Map.containsKey/Set.contains: java.util.Map, java.util.Set and most of their implementations:

Tags: low latency, high throughput, CPU optimization, Java collections.

  • For sets, contains+add/remove call pairs should be replaced with single add/remove calls even if some extra logic was guarded by contains call.
  • For maps, contains+get pair shall always be replaced with get followed by null-check of get result. contains+remove pair should be replaced with a single remove call and check of its result.
  • Same ideas are applicable to Trove maps and sets too.

java.util.zip.CRC32 and java.util.zip.Adler32 performance: java.util.zip.CRC32, java.util.zip.Adler32 and java.util.zip.Checksum:

Tags: CPU optimization, checksum.

  • If you can choose which checksum implementation you can use - try Adler32 first. If its quality is sufficient for you, use it instead of CRC32. In any case, use Checksum interface in order to access Adler32/CRC32 logic.
  • Try to update checksum by at least 500 byte blocks. Shorter blocks will require a noticeable time to be spent in JNI calls.

hashCode method performance tuning: java.lang.String, java.util.HashMap, java.util.HashSet, java.util.Arrays:

Tags: low latency, high throughput, CPU optimization, memory optimization.

  • Try to improve distribution of results of your hashCode method. This is far more important than to optimize that method speed. Never write a hashCode method which returns a constant.
  • String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes. If you are working with sets of strings, try to end up with BitSets, as described in this article. Performance of your code will greatly improve.

Creating an exception in Java is very slow: why it is too expensive to create exceptions in Java and how can you avoid those costs: java.lang.Throwable, java.lang.Exception, java.lang.RuntimeException, sun.misc.BASE64Decoder, java.lang.NumberFormatException:

Tags: low latency, high throughput, CPU optimization.

  • Never use exceptions as return code replacement or for any likely to happen events. Creating an exception is too expensive - the cost starts from approximately 1 microsecond per exception.
  • There are 3 ways to avoid exception costs: refactor your code not to use them; cache an instance of exception or override its fillInStackTrace method.
  • Avoid using any Number subclass parse*/valueOf methods if you call them for each piece of your data and you expect a lot of non-numerical data. Parse such values manually for top performance.

Java logging performance pitfalls: how to lose as little as possible performance while writing log messages: java.util.logging.Logger, java.util.logging.Handler, java.util.logging.Formatter, java.text.MessageFormat:

Tags: low latency, high throughput, CPU optimization, logging.

  • If you make expensive calculations while preparing data for log messages, either use Logger.isLoggable and do all data preparation inside or write an object which does all calculations in its toString method.
  • Never call Object.toString method in order to obtain a log message argument - just pass an original object. Logging framework will call toString method on your object.
  • Do not mix format string concatenation with log arguments - malicious concatenated string will allow your application user to break your logging/access data which was not supposed for user access.

Base64 encoding and decoding performance: an overview of several well-known Base64 Java implementations from the performance perspective: sun.misc.BASE64Encoder, sun.misc.BASE64Decoder, java.util.Base64 (Java 8 specific), javax.xml.bind.DatatypeConverter (Java 6+), org.apache.commons.codec.binary.Base64, com.google.common.io.BaseEncoding (Google Guava), http://iharder.net/base64, MiGBase64:

Tags: low latency, high throughput, CPU optimization, serialization in Java, Java 8.

  • If you looking for a fast and reliable Base64 codec - do not look outside JDK. There is a new codec in Java 8: java.util.Base64 and there is also one hidden from many eyes (from Java 6): javax.xml.bind.DatatypeConverter. Both are fast, reliable and do not suffer from integer overflows.
  • 2 out of 4 3rd party codecs described here are very fast: MiGBase64 and IHarder. Unfortunately, if you will need to process hundreds of megabytes at a time, only Google Guava will allow you to decode 2G of data at a time (360MB in case of MiGBase64 / 720M in case of IHarder and Apache Commons). Unfortunately, Guava does not support byte[] -> byte[] encoding.
  • Do not try to call String.getBytes(Charset) on huge strings if your charset is a multibyte one - you may get the whole gamma of integer overflow related exceptions.

A possible memory leak in the manual MultiMap implementation: an overview of multimap implementations in Java 8, Google Guava and Scala 2.10 as well as a description of a possible memory leak you can have while manually implementing a multimap using Java 6 or 7.

Tags: Java collections, Java 8, Scala, Google Guava.

  • As you have seen, it is quite easy to miss a memory leak while implementing a multilevel map. You need to be careful and split read and write accesses to the outer map.
  • Newer frameworks and languages, like Google Guava, Java 8 and Scala already provide you more convenient syntax and wider choice of collections thus allowing you to avoid possible memory leaks in the multilevel maps.

java.util.Random and java.util.concurrent.ThreadLocalRandom in multithreaded environments: an overview of java.util.Random and java.util.concurrent.ThreadLocalRandom in single and multithreaded environments as well as some low level analysis of their performance.

Tags: Java Random, Java 7, ThreadLocalRandom, multithreading, CAS.

  • Do not share an instance of java.util.Random between several threads in any circumstances, wrap it in ThreadLocal instead.
  • From Java 7 prefer java.util.concurrent.ThreadLocalRandom to java.util.Random in all circumstances - it is backwards compatible with existing code, but uses cheaper operations internally.

Charset encoding and decoding in Java 7/8: we will check how fast are Charset encoders/decoders in Java 7 and what are the performance improvements in Java 8.

Tags: Charset, ISO-8859-1, Java 8.

  • Always prefer national charsets like windows-1252 or Shift_JIS to UTF-8: they produce more compact binary representation (as a rule) and they are faster to encode/decode (there are some exceptions in Java 7, but it becoming a rule in Java 8).
  • ISO-8859-1 always works faster than US-ASCII in Java 7 and 8. Choose ISO-8859-1 if you don't have any solid reasons to use US-ASCII.
  • You can write a very fast String->byte[] conversion for US-ASCII/ISO-8859-1, but you can not beat Java decoders - they have direct access to the output String they create.

String switch performance: we will check how fast are various ways of implementing a string-based switch.

Tags: switch, Java 8, String.equals/equalsIgnoreCase.

  • String switch in Java 7 is implemented using a fixed size map with a number of slots close to 20. This means that you can freely use it in most of cases not worrying about its performance at all. As you have seen, string switch has identical performance compared to manually implemented maps when you have under 100 cases in the switch.
  • String.equals/equalsIgnoreCase perform great as a replacement for string switch while all your cases have different length (or at least the number of cases with the same string length is too low compared to the total number of cases) and the number of string cases is not too big. This property is achieved by comparing string length prior to comparing the actual contents in String.equals/equalsIgnoreCase.

Memory optimization

An overview of memory saving techniques in Java: this article will give you the basic advices on memory optimization in Java. Most of other Java memory optimization techniques are based on those advices.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization.

  • Prefer primitive types to their Object wrappers. The main cause of wrapper types usage are JDK collections, so consider using one of primitive type collection frameworks like Trove.
  • Try to minimize number of Objects you have. For example, prefer array-based structures like ArrayList/ArrayDeque to pointer based structures like LinkedList.

Memory consumption of popular Java data types - part 1: this article will describe the memory consumption of enums and EnumMap / EnumSet / BitSet / ArrayList / LinkedList / ArrayDeque JDK classes in Java 7.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, collections.

The following table summarizes the storage occupied per stored value assuming that a Java object reference occupies 4 bytes. Note that you must spend 4 byte per Object reference in any case, so subtract 4 bytes from the values in the following table to find out the storage overhead.

EnumSet, BitSet 1 bit per value
EnumMap 4 bytes (for value, nothing for key)
ArrayList 4 bytes (but may be more if ArrayList capacity is seriously more than its size)
LinkedList 24 bytes (fixed)
ArrayDeque 4 to 8 bytes, 6 bytes on average

Memory consumption of popular Java data types - part 2: this article will describe the memory consumption of HashMap / HashSet, LinkedHashMap / LinkedHashSet, TreeMap / TreeSet and PriorityQueue JDK classes in Java 7 as well as their Trove replacements.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, collections, primitive collections.

  • Always try to replace HashMap with Trove THashMap, HashSet with a THashSet and finally, LinkedHashSet with a Trove TLinkedHashSet. Such replacement requires adding a single letter to your code (letter 'T') and no other code changes except the import statement. Such replacement will give you significant memory savings - see table below.
  • The following table summarizes the storage occupied per stored value assuming that a reference occupies 4 bytes. Note that you must spend 4 byte per Object reference in any case, so subtract 4 bytes from the values in the following table to find out the storage overhead (subtract 8 bytes for maps, because there is a key as well as a value).
JDK collection Size Possible Trove substitution Size
HashMap 32 * SIZE + 4 * CAPACITY bytes THashMap 8 * CAPACITY bytes
HashSet 32 * SIZE + 4 * CAPACITY bytes THashSet 4 * CAPACITY bytes
LinkedHashMap 40 * SIZE + 4 * CAPACITY bytes None  
LinkedHashSet 40 * SIZE + 4 * CAPACITY bytes TLinkedHashSet 8 * CAPACITY bytes
TreeMap, TreeSet 40 * SIZE bytes None  
PriorityQueue 4 * CAPACITY bytes None  

A few more memory saving techniques in Java: this article describes the advantages of static inner classes, string pooling, boolean flag collections as well as special classes for tiny collections in JDK.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, collections.

  • Make all your inner classes static by default. Remove static qualifier only when you have to.
  • If you have a collection of generally small collections, try to use java.util.Collections.empty*/singleton* methods for memory-efficient storage of tiny collections.
  • Prefer a BitSet to arrays/lists of boolean or dense sets of any integer types: bit sets are both memory and CPU cache friendly.

String.intern in Java 6, 7 and 8 - string pooling: This article will describe how String.intern() method was implemented in Java 6 and what changes were made in it in Java 7 and Java 8 (which finally made it extremely useful).

Tags: CPU optimization, memory optimization.

  • Stay away from String.intern() method on Java 6 due to a fixed size memory area (PermGen) used for JVM string pool storage.
  • Java 7 and 8 implement the string pool in the heap memory. It means that you are limited by the whole application memory for string pooling in Java 7 and 8.
  • Use -XX:StringTableSize JVM parameter in Java 7 and 8 to set the string pool map size. It is fixed, because it is implemented as a hash map with lists in the buckets. Approximate the number of distinct strings in your application (which you intend to intern) and set the pool size equal to some prime number close to this value multiplied by 2 (to reduce the likelihood of collisions). It will allow String.intern to run in the constant time and requires a rather small memory consumption per interned string (explicitly used Java WeakHashMap will consume 4-5 times more memory for the same task).
  • The default value of -XX:StringTableSize parameter is 1009 in Java 6 and Java 7 until Java7u40. It was increased to 60013 in Java 7u40 (same value is used in Java 8 as well).
  • If you are not sure about the string pool usage, try -XX:+PrintStringTableStatistics JVM argument. It will print you the string pool usage when your program terminates.

String.intern in Java 6, 7 and 8 - multithreaded access: This article describes the performance impact of the multithreaded calls to String.intern().

Tags: CPU optimization, memory optimization.

  • Feel free to use String.intern() in the multithreaded code. "8 writers" scenario has only 17% overhead compared to "1 writer" (singlethreaded) scenario. "1 writer, 7 readers" scenario has 9% overhead in my test compared to the singlethreaded results.
  • JVM string pool is NOT thread local. Each string added to the pool will be available to all other threads in the JVM thus further improving the program memory consumption.

String.intern in Java 6, 7 and 8 - part 3: String.intern() usage best practices.

Tags: CPU optimization, memory optimization.

  • Despite serious optimizations done in the String.intern() implementation in Java 7+, it still takes a noticeable time to run (noticeable for CPU sensitive applications). The simple example in this article runs 3.5 times faster without calls to String.intern(). You should not use String.intern() as a safety net, passing every long living string into it. Instead process only fields with a limited number of possible distinct values (for example, states/provinces if processing addresses) - memory savings in this situation will definitely pay off the initial CPU costs of String.intern().

UPDATED: Trove library: using primitive collections for performance: this is an overview of Trove library, which is a primitive type collection library. There is also guidelines for migrating your code from JDK to Trove.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, primitive collections, CPU cache friendly.

  • The main reason to use Trove maps/sets is the seriously reduced memory consumption compared to JDK maps/sets. If there is a large array list/set/map with keys or values that could be a primitive type, it is worth replacing it with Trove collection. If there are some maps from a primitive type to a primitive type, it is especially worth to replace them.
  • Trove maps and sets support custom hashing strategies which allow to implement map/set specific equals and hashCode, for example to implement identity set or map.
  • Trove collections implement several additional methods, like grep, retainEntries or adjustOrPutValue. They allow to reduce code required for many common tasks.
  • JDK to Trove migration is quite an easy task - all you need to do in a lot of cases is to add letter 'T' to initialization: new HashMap should be rewritten as new THashMap. Migration to primitive-based collections require a bit more work, but this work will be paid off by massive reduction in memory consumption.

Various types of memory allocation in Java: how to allocate a large memory buffer in Java and how to write any Java types into such buffer.

Tags: low latency, high throughput, finance, CPU optimization, low level memory access in Java.

  • Array size in Java is limited by the biggest int value = 2^31 - 1. On the other hand, you are not limited by 2Gb - 1 bytes as a size of your array - you may allocate a long[], which occupies 8 times more memory (16Gb - 8 bytes).
  • You may use sun.misc.Unsafe.allocateMemory(long size) for allocating a buffer longer than 2Gb, but you will have to free such buffers yourself.
  • You can use sun.misc.Unsafe memory access methods for reading/writing any Java datatype from/to both Java arrays and Unsafe buffers in the uniform manner.

Memory introspection using sun.misc.Unsafe and reflection: how to find out Java object memory layout using sun.misc.Unsafe and reflection.

Tags: memory usage in Java, memory allocation in Java.

  • You can use the following sun.misc.Unsafe methods for obtaining Java object layout information: objectFieldOffset, arrayBaseOffset and arrayIndexScale.
  • Java Object reference size depends on your environment. It may be equal to 4 or 8 bytes depending on your JVM settings and on the amount of memory you have given to your JVM. It is always 8 bytes for heaps over 32G, but for smaller heaps it is 4 bytes unless you will turn off -XX:-UseCompressedOops JVM setting.

Protobuf data encoding for numeric datatypes: what type of numeric data encoding is used in Google Protobuf, how it impacts the compressed data size and how fast is it.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, serialization in Java.

  • Always try to upgrade to integer datatypes if you have found that some double/float fields in your serializable collections contain only integer values. This will increase the compression ratio of the general purpose algorithms on your binary data.
  • Try to use Google protobuf or any other similar encoding for your integer data if a noticeable part of your values happen to be small (either non-negative or by absolute value). You will get a noticeable data size reduction at a very low CPU cost, which will help you to store and later read a higher number of messages per time unit.

Use case: compacting price field disk representation: double, short, java.math.BigDecimal: an example of compacting your data:

Tags: high throughput, finance, memory optimization.

  • Try to avoid storing double values in your disk data structures. Often same information may be represented in a smaller number of bytes (compare, for example, 0.01 converted with writePriceUnsigned and 3f 84 7a e1 47 ae 14 7b - binary representation of 0.01).
  • Analyze properties of the data you have to store in the largest data structures in your programs. Try to identify cases when most of your data can fit into a more compact data type than an original one.
  • See Use case: how to compact a long-to-long mapping for other ideas of compacting your data.

Use case: how to compact a long-to-long mapping: a use case where we try to identify some long-2-long mapping properties in order to represent it in the most compact form.

Tags: low latency, high throughput, CPU optimization, memory optimization, data compression.

  • Analyze properties of the data you have to store in the largest data structures in your programs. Try to identify cases when most of your data can fit into a more compact data type than an original one. Number-to-number maps can be especially effectively compacted if you can notice that keys are nearly consecutive values. In this case a map can be converted into the array.

String packing part 1: converting characters to bytes: we discuss Java objects memory layout and consumption. After that we try to pack a String into a more compact representation, trying to minimize using any Objects.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, data compression.

  • java.lang.String objects were designed to be fast and flexible. That's why they can share internal char[] with other strings. They also cache the calculated hash code value, because strings are often used as HashMap keys or HashSet values. But these properties add a great penalty for short strings memory footprint. We can avoid this penalty by implementing our own string replacement objects.
  • Oracle developers tried to solve the same problem in late Java 6 releases by introducing -XX:+UseCompressedStrings option. Unfortunately, it is not supported anymore in Java 7, maybe due to not so big memory savings as one may expect from it.

String packing part 2: converting Strings to any other objects: we discuss how and when to convert a String into various more compact Java objects for temporary string representation.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization, data compression.

  • If you have a big collection of strings or objects with String fields in memory and you know that in some cases (at least at 10% of cases, for example) these strings may be actually converted to primitive type values, you may replace your String fields with Object fields and use provided pack/unpack methods to convert Strings to Objects and back, thus saving memory.
  • If you couldn't convert a string to a primitive, consider converting your strings into byte[] in UTF-8 encoding. This is loseless conversion, so you could always convert your binary byte[] back into an original string.

Small tricks

I/O bound algorithms: SSD vs HDD: This article will investigate an impact of modern SSDs on the I/O bound algorithms of HDD era.

Tags: low latency, high throughput, finance, CPU optimization, hardware, Java IO.

  • Replacing an HDD storage with an SSD one can turn your application from being I/O bound to being CPU-bound. It is especially related to a pure stream data read/write operations, because modern SSDs are capable to process data at speeds over 300Mb/sec (few applications can produce data at such speed).
  • Modern operating systems try to write data in the background, not blocking your application. Your write requests will be blocked only if OS can't write data faster than your application produces it.
  • SSD seek time has dramatically decreased compared to HDD seek time (~100 seeks/sec for HDD -> over 2000 seeks/sec for SSD). On the other hand, even SSD seek is too slow compared to modern CPU speed. On my laptop, CPU can execute about a million commands while an SSD executes a seek operation. Always try to arrange your data as a stream.

Forbidden Java actions: object assignments, type conversions etc on the low level in Java: This article will reveal you a few details about the low level Java memory layout: we will see how to implement Object assignments using just primitive types. Then we will see what's hidden in the array header and will convert an array of one type into an array of another type.

Tags: memory usage in Java, memory allocation in Java, unsafe memory access in Java.

  • All Java object references occupy 4 bytes for under 32G heaps. You can use sun.misc.Unsafe in order to treat such references as int fields.
  • Java arrays contain element type as int at the offset=8 in the array header. Length (int) is stored at offset=12. Changing these values is possible, but care must be taken in order not to extend an updated array outside of initially allocated memory.

Forbidden Java actions: updating final and static final fields: This article will discuss how you can update final or static final fields in Java using reflection and sun.misc.Unsafe.

Tags: memory usage in Java, memory allocation in Java, unsafe memory access in Java.

  • If you want to update a private and/or final field using Java reflection - make a Method or Field accessible via Method/Field.setAccessible( true ) and then set a new field value.
  • If you want to update a final static field using reflection - you will need to make 2 steps: make the field itself accessible and then make accessible modifiers field of Field you want to update and remove final flag from Field modifiers. Such updates to static final fields of primitive/String initialized with complile-time expressions will not be visible to clients, because static fields initialized with constant expressions (JLS 15.28) are inlined.
  • You may also update final and static final fields using sun.misc.Unsafe. Use Unsafe.putN( base, offset, newValue ) methods for updates. base is the owner of a field for instance fields and the Class object for static fields. offset could be obtained with Unsafe.objectFieldOffset( Field ) for instance fields and Unsafe.staticFieldOffset( Field ) for static fields.

Forbidden Java actions: not declaring a checked exception; avoiding a constructor while creating an object: In this article we will see how to throw a checked exception in Java without declaring it in the method throws clause and how to create an object without calling any of its constructors.

Tags: memory usage in Java, memory allocation in Java, unsafe memory access in Java.

  • There are several ways to avoid declaring a checked exception in a method which throws it. You can use Thread.stop(Throwable), Class.newInstance (and throw an exception in the constructor), sun.misc.Unsafe.throwException or use generic type erasure in order to avoid a checked exception declaration in the throws clause. We do not recommend you to use any of these practices :)
  • In Java you may create an object without calling any of its constructors. There are 2 legal ways to do it - cloning and serializing. sun.misc.Unsafe allows you to create an uninitialized instance of an object bypassing its constructors using Unsafe.allocateInstance method.

Static constructor code is not JIT-optimized in a lot of cases: Static constructor code is generally executed in the interpreted mode, even if you have a heavy calculations in it. But there is a way to force it run in the compiled mode:

Tags: Java pitfalls, avoid it.

  • If you need to execute CPU-expensive logic in your class static constructor, check if it takes excessive time to execute it. In this case try to move that logic into a separate helper class.

Inefficient byte[] to String constructor: be careful when using public String(byte bytes[], int offset, int length, Charset charset) constructor in Java 6:

Tags: Java pitfalls, avoid it.

  • Always make a copy of a part of your byte array you want to convert into a String, otherwise this constructor will make a temporary copy of your full original buffer.
  • Try to avoid unnecessary memory allocations in your program, because it may impact performance of your program in case if it is already using enough memory (1G+).

Java varargs performance issues: a short review of the actual varargs implementation in Java 5+.

  • Varargs are great for most application code because they shorten program code, but they should be replaced with precompiled arrays when all members of varargs are known constants.

Primitive types to String conversion and String concatenation: a description of various types of string concatenation in Java as well as a few JVM options helping us to make the string concatenation even faster.

  • Never use concatenation with an empty string "" as a "to string conversion". Use appropriate String.valueOf or wrapper types toString(value) methods instead.
  • Whenever possible, use StringBuilder for string concatenation. Check old code and get rid of StringBuffer is possible.
  • Use -XX:+OptimizeStringConcat option introduced in Java 6 update 20 in order to improve string concatenation performance. It is turned on by default in most of Java 7 releases, but it is still turned off in Java 6_41.

Use cases

In this set of articles we try to apply principles discussed in the other articles to the "real world" problems.

Use case: FIX messages processing. Part 1: Writing a simple FIX parser and Use case: FIX messages processing. Part 2: Composing a message out of fields: possible gateway implementation: a tag-based FIX message parsing and composing is described in two these articles. In essence, we parse a 0x0001 separated string into a list of name=value tags, which are converted to actual datatypes after that. In the second part we will discuss a best way to compose these messages back to String format as a part of a gateway implementation.

Tags: low latency, high throughput, finance, CPU optimization.

  • Always try to cache parsed dates if they do not have a time component in case of message processing: number of various dates in modern financial data is very low.
  • String.split should usually be avoided. The only exception is a single character pattern in Java 7. You can still write faster code even in this case, but you should add some parsing logic into a splitting loop.
  • Never parse a "field=value" pair with String.split. String.indexOf(char) with separator character is a far better alternative.
  • Always try to avoid "binary/string -> Java type -> binary/string" conversions for short-living objects. It is always better to store an original message (if you don't modify a message) or original fields (if you modify only some fields) and reuse them when you have to compose an output message, rather than to convert back from Java types into binary/text message. Besides saving CPU cycles on data conversions, you will also avoid unnecessary memory allocations for converted values.

Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression): we will see how to optimize memory consumption of a Java program which has to store a large set of readonly data in memory (using ByteBuffer or sun.misc.Unsafe). We will also try replacing index fields with their hash codes, still supporting the case of hash code collisions.

Tags: low latency, high throughput, finance, CPU optimization, memory optimization.

  • If you want to keep a large number of read-only records in memory, consider storing them in the most possible compact form in a ByteBuffer (or using sun.misc.Unsafe).
  • Use Trove maps for indices.
  • Check if you need to keep your ID field values as maps keys (you can still check ID field stored in ByteBuffer) or hash code is sufficient as a key.
  • If you know the possible query keys in advance - don't store IDs at all - hash code is sufficient replacement for map keys in a lot of situations.

Single file vs multi file storage: a short research on the file system cache implementation in Windows 7 and Linux.

Tags: hardware, file system, CPU optimization.

  • In general, avoid storing your data in a large number of data files. At least, limit the number of such files, so that the growth of your data will not linearly impact the number of files you have to process.
  • If you still need to handle a large number of small files, use Linux and any of its native file systems which allow you to use 4K sectors (thus limiting the storage and read/write overhead).
  • Once you have read the files (at least in Linux), you may expect to find them in the file cache, especially if the global memory consumption in your system is not too high. The same applies to the case of one application writing these files to disk and another one reading them - chances are high that the second program will read these files from OS file cache instead of actually reading them from the disk.

Static code compilation in Groovy 2.0: we will see how static compilation in Groovy makes it as fast as Java.

Tags: Groovy, dynamic languages, CPU optimization.

  • Groovy is a dynamic JVM language using dynamic dispatch for its method calls. Dynamic dispatch in Groovy 2.1.9 is approximately 3 times slower compared to a normal Java method call due to the need to obtain a method name and argument types (method signature) and match it to the cached java.lang.reflect.Method.
  • Groovy 2.0 has added the static compilation feature via @CompileStatic annotation, which allows to compile most of Groovy method calls into direct JVM bytecode method calls, thus avoiding all the dynamic dispatch overhead. Besides performance improvements, static compilation is also responsible for type checking of your Groovy code, letting you to discover a lot of typos/mistakes at compile time, thus reducing the need for extensive coverage unit tests.

Implementing a high performance Money class: how to implement the efficient Money class capable to deal with the arbitrary precision calculations.

Tags: money, monetary calcualtions, double, BigDecimal, finance, HFT, low latency.

  • You should avoid using BigDecimal for monetary calculations - its operations run 10+ times slower than the equivalent long/double operations.
  • Three following observations will help you to efficiently implement your long <-> double conversion layer:
  • 1. We can check that a double variable contains an integer value if we will cast it to long and then back to double and get the original value. Both casts are cheap operations.
  • 2. If we have a double value, which is possibly one ulp off the actual result, we may try to multiply this value by a sufficiently large power of 10 and check if the result is an integer value (see rule 1). If not, try to add or subtract ulp from the original value and multiply it by a power of 10 again (use Math.nextAfter to add/subtract ulp).
  • 3. If we will divide a long value by a double non-negative power of 10 (1, 10, 100, 1000, ...) - the result will never contain an error. Same applies to the case of division of a double containing long value by a long non-negative power of 10. At the same time we can not replace division by 10N by multiplication by 10-N: the result will be off by ulp in many cases. That's a pity, because multiplication is still significantly faster than division on modern CPUs. At the same time, we can safely use multiplication by 10-N if we are allowed to round the result using Math.round.
Author Google profile
Author Linkedin profile