My Technology blog…

Juicy Java Tidbits picked up from everywhere

Archive for the ‘Java Collections’ Category

Java generics simplified

Posted by tanvis on August 24, 2007

Java generics. Much talked about these days. So lets take a look at what generics means. Anyone who has use java has come across the java Collections Framework.Very convenient, fun to use pass around , manipulate play with and all that. Except for one little detail So far you could add objects of any type into a collection. For example:import java.util.List;
import java.util.ArrayList;

List guestList = new ArrayList();
personList.add(“Tshah”);
personList.add(new Integer(10));

And my code will happily compile and add the two objects in. Except, this brings two major issues to the fore:

1. Java collections so far lacked any type-checking unless you implemented it manually.
2. When you access a collection element, things could go horribly wrong unless you knew which class to typecast the collection element to, causing a ton of possible runtime errors.

Generics provides a way for you to communicate the type of a collection to the compiler, so that it can be checked. Once the compiler knows the element type of the collection, the compiler can check that you have used the collection consistently and can insert the correct casts on values being taken out of the collection. The compiler can now verify at compile time that the type constraints are not violated at run time, thereby guaranteeing more robustness.


The main difference in the two methodologies is that while in the first case the code and the compiler tell us what the developer “thinks” is true at a given point of execution in the code(which the VM checks only at run-time) versus, what a compiler “knows” and has “verified” to be true at the same given point in time.

While the primary use of generics is collections, there are many other uses. “Holder classes,” such as WeakReference and ThreadLocal, have all been generified, that is, they have been retrofitted to make use of generics. More surprisingly, class Class has been generified. Class literals now function as type tokens, providing both run-time and compile-time type information.Generics are implemented by type erasure: generic type information is present only at compile time, after which it is erased by the compiler. This is necessary to achieve total interoperability between generic code and legacy code that uses non-parameterized types (which are technically known as raw types). Doing so would however imply that parameter type information is not available at run time, meaning the new automatically generated casts may fail when interoperating with ill-behaved legacy code(One is type-checked the other is not, imagine what happens when the types dont match up?). There is, however, a way to achieve guaranteed run-time type safety for generic collections even when interoperating with ill-behaved legacy code.

This problem has been solved by adding wrapper classes to java.util.Collections which allow us to “wrap” a collection in a type-safe class and thereby provide guaranteed run-time type safety. They are similar in structure to the synchronized and unmodifiable wrappers.

Suppose you have a set of strings, s, into which some legacy code is mysteriously inserting an integer. Without the wrapper, you will not find out about the problem until you read the problem element from the set, and an automatically generated cast to String fails. At this point, it is too late to determine the source of the problem. If, however, you replace the declaration:

    Set<String> s = new HashSet<String>();

with this declaration:

    Set<String> s = Collections.checkedSet(new HashSet<String>(), String.class);

the collection will throw a ClassCastException at the point where the legacy code attempts to insert the integer. The resulting stack trace will allow you to diagnose and repair the problem. Meaning, we have now implemented type checking at the point-of-entry for previously unchecked code!

Posted in Collections, Generics, Java, Java Collections | Leave a Comment »

Collections trivia

Posted by tanvis on July 10, 2007

How would you preserve the order of insertion of entries into a Map?public class LinkedHashMap
extends HashMap

Hash table and linked list implementation of the Map interface, with predictable iteration order. This implementation differs from HashMap in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map. (A key k is reinserted into a map m if m.put(k, v) is invoked when m.containsKey(k) would return true immediately prior to the invocation.)

This implementation spares its clients from the unspecified, generally chaotic ordering provided by HashMap (and Hashtable), without incurring the increased cost associated with TreeMap. It can be used to produce a copy of a map that has the same order as the original, regardless of the original map’s implementation:

void foo(Map m) {
Map copy = new LinkedHashMap(m);

}
This technique is particularly useful if a module takes a map on input, copies it, and later returns results whose order is determined by that of the copy. (Clients generally appreciate having things returned in the same order they were presented.)

A special constructor is provided to create a linked hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (access-order). This kind of map is well-suited to building LRU caches. Invoking the put or get method results in an access to the corresponding entry (assuming it exists after the invocation completes). The putAll method generates one entry access for each mapping in the specified map, in the order that key-value mappings are provided by the specified map’s entry set iterator. No other methods generate entry accesses. In particular, operations on collection-views do not affect the order of iteration of the backing map.

The removeEldestEntry(Map.Entry) method may be overridden to impose a policy for removing stale mappings automatically when new mappings are added to the map.

This class provides all of the optional Map operations, and permits null elements. Like HashMap, it provides constant-time performance for the basic operations (add, contains and remove), assuming the the hash function disperses elements properly among the buckets. Performance is likely to be just slightly below that of HashMap, due to the added expense of maintaining the linked list, with one exception: Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.

A linked hash map has two parameters that affect its performance: initial capacity and load factor. They are defined precisely as for HashMap. Note, however, that the penalty for choosing an excessively high value for initial capacity is less severe for this class than for HashMap, as iteration times for this class are unaffected by capacity.

Note that this implementation is not synchronized. If multiple threads access a linked hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be “wrapped” using the Collections.synchronizedMapmethod. This is best done at creation time, to prevent accidental unsynchronized access:

Map m = Collections.synchronizedMap(new LinkedHashMap(…));
A structural modification is any operation that adds or deletes one or more mappings or, in the case of access-ordered linked hash maps, affects iteration order. In insertion-ordered linked hash maps, merely changing the value associated with a key that is already contained in the map is not a structural modification. In access-ordered linked hash maps, merely querying the map with get is a structural modification.)

The iterators returned by the iterator methods of the collections returned by all of this class’s collection view methods are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator’s own remove method, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the Iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.

Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.

This class is a member of the Java Collections Framework.

Posted in Collections, Java Collections, Uncategorized | Leave a Comment »

Collections trivia – static methods on Collections

Posted by tanvis on July 10, 2007

Most of us know only of the Collections interface. But there is also a collections class which has a bunch of really useful static methods that can be applied to any collection from the java Collections Framework. Take a look here.
public class Collections
extends Object

 

This class consists exclusively of static methods that operate on or return collections. It contains polymorphic algorithms that operate on collections, “wrappers”, which return a new collection backed by a specified collection, and a few other odds and ends.

The methods of this class all throw a NullPointerException if the collections provided to them are null.

The documentation for the polymorphic algorithms contained in this class generally includes a brief description of the implementation. Such descriptions should be regarded as implementation notes, rather than parts of the specification. Implementors should feel free to substitute other algorithms, so long as the specification itself is adhered to. (For example, the algorithm used by sort does not have to be a mergesort, but it does have to be stable.)

The “destructive” algorithms contained in this class, that is, the algorithms that modify the collection on which they operate, are specified to throw UnsupportedOperationException if the collection does not support the appropriate mutation primitive(s), such as the set method. These algorithms may, but are not required to, throw this exception if an invocation would have no effect on the collection. For example, invoking the sort method on an unmodifiable list that is already sorted may or may not throw UnsupportedOperationException.

This class is a member of the Java Collections Framework.

Posted in Java Basics., Java Collections | Leave a Comment »

Identity Hash Maps

Posted by tanvis on July 10, 2007

public class IdentityHashMap
extends AbstractMap
implements Map, Serializable, Cloneable

This class implements the Map interface with a hash table, using reference-equality(has to be the same instance of the object) in place of object-equality(two individual objects which are equal in every respect) when comparing keys (and values). In other words, in an IdentityHashMap, two keys k1 and k2 are considered equal if and only if (k1==k2). (In normal Map implementations (like HashMap) two keys k1 and k2 are considered equal if and only if (k1==null ? k2==null : k1.equals(k2)))

This class is not a general-purpose Map implementation! While this class implements the Map interface, it intentionally violates Map's general contract, which mandates the use of the equals method when comparing objects. This class is designed for use only in the rare cases wherein reference-equality semantics are required (generally when checking for single-copies – Consider an algorithm to find a loop in a linked list).

A typical use of this class is topology-preserving object graph transformations, such as serialization or deep-copying. To perform such a transformation, a program must maintain a “node table” that keeps track of all the object references that have already been processed. The node table must not equate distinct objects even if they happen to be equal. Another typical use of this class is to maintain proxy objects. For example, a debugging facility might wish to maintain a proxy object for each object in the program being debugged.

This class provides all of the optional map operations, and permits null values and the null key. This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

This class provides constant-time performance for the basic operations (get and put), assuming the system identity hash function (System.identityHashCode(Object)) disperses elements properly among the buckets.

 

This class has one tuning parameter (which affects performance but not semantics): expected maximum size. This parameter is the maximum number of key-value mappings that the map is expected to hold. Internally, this parameter is used to determine the number of buckets initially comprising the hash table. The precise relationship between the expected maximum size and the number of buckets is unspecified.

If the size of the map (the number of key-value mappings) sufficiently exceeds the expected maximum size, the number of buckets is increased Increasing the number of buckets (“rehashing”) may be fairly expensive, so it pays to create identity hash maps with a sufficiently large expected maximum size. On the other hand, iteration over collection views requires time proportional to the the number of buckets in the hash table, so it pays not to set the expected maximum size too high if you are especially concerned with iteration performance or memory usage.

Note that this implementation is not synchronized. If multiple threads access this map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be “wrapped” using the Collections.synchronizedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map:

     Map m = Collections.synchronizedMap(new HashMap(...));

The iterators returned by all of this class’s “collection view methods” are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator’s own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.

Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: fail-fast iterators should be used only to detect bugs.

Implementation note: This is a simple linear-probe hash table, as described for example in texts by Sedgewick and Knuth. The array alternates holding keys and values. (This has better locality for large tables than does using separate arrays.) For many JRE implementations and operation mixes, this class will yield better performance than HashMap (which uses chaining rather than linear-probing).

Chaining

 

Hash collision resolved by chaining.

Hash collision resolved by chaining.

 

In the simplest chained hash table technique, each slot in the array references a linked list of inserted records that collide to the same slot. Insertion requires finding the correct slot, and appending to either end of the list in that slot; deletion requires searching the list and removal.

 

Chained hash tables have advantages over open addressed hash tables in that the removal operation is simple and resizing the table can be postponed for a much longer time because performance degrades more gracefully even when every slot is used. Indeed, many chaining hash tables may not require resizing at all since performance degradation is linear as the table fills. For example, a chaining hash table containing twice its recommended capacity of data would only be about twice as slow on average as the same table at its recommended capacity.

 

Chained hash tables inherit the disadvantages of linked lists. When storing small records, the overhead of the linked list can be significant. An additional disadvantage is that traversing a linked list has poor cache performance.

 

Alternative data structures can be used for chains instead of linked lists. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, since each list is intended to be short, this approach is usually inefficient unless the hash table is designed to run at full capacity or there are unusually high collision rates, as might occur in input designed to cause collisions. Dynamic arrays can also be used to decrease space overhead and improve cache performance when records are small.

 

Some chaining implementations use an optimization where the first record of each chain is stored in the table. Although this can increase performance, it is generally not recommended: chaining tables with reasonable load factors contain a large proportion of empty slots, and the larger slot size causes them to waste large amounts of space.

 

 

Open addressing

 

Hash collision resolved by linear probing (interval=1).

Hash collision resolved by linear probing (interval=1).

 

Open addressing hash tables store the records directly within the array. This approach is also called closed hashing. A hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. Well known probe sequences include:

linear probing
in which the interval between probes is fixed–often at 1.
quadratic probing
in which the interval between probes increases linearly (hence, the indices are described by a quadratic function).
double hashing
in which the interval between probes is fixed for each record but is computed by another hash function.

The main tradeoffs between these methods are that linear probing has the best cache performance but is most sensitive to clustering, while double hashing has poor cache performance but exhibits virtually no clustering; quadratic probing falls in-between in both areas. Double hashing can also require more computation than other forms of probing. Some open addressing methods, such as last-come-first-served hashing and cuckoo hashing move existing keys around in the array to make room for the new key. This gives better maximum search times than the methods based on probing.

 

A critical influence on performance of an open addressing hash table is the load factor; that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, probing algorithms may even fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors by generating significant clustering. What causes hash functions to cluster is not well understood, and it is easy to unintentionally write a hash function which causes severe clustering.

 

 

Posted in Java Basics., Java Collections | Leave a Comment »