Vault of Thoughts

2006-07-13

Please visit my new blog at vaultofthoughts.net

GetHashCode of the string class

Being a curious person I have wondered how the GetHashCode method of the string class is implemented so I have performed some research. The results are interesting at least.


First thing I have checked is if the GetHashCode method for the same string say "xxx" on two separate machines returns the same value - and in fact it does. This has led me to use the Reflector to see the internals of the GetHashCode of the string class. What I have found there? It turns out that the hash code for any given string is calculated based on the characters making the string - that's why on two machines, the hash code is the same. But the question arises: how does this impact performance? A quick test has shown that for a very long string, calling the GetHashCode method can take a considerable amount of time because the result is not cached after the first call. For short strings it my not be a problem but for a larger ones? And what about a hash table with strings? There I suppose the GetHashCode is called very often.


Now I ask myself: why the hash code is calculated using the strings that the string is made of? Since strings are using the mechanism called interning (which more or less makes ensures that any particular string representation is kept in memory only in one place and string objects are only pointing to it). Given this fact, it would be orders of magnitude more performant to use the memory addres of such an interned string as a base for the hash code.


The same question araises for the Length property. It is also taking longer the longer the string, and subsequent calls are not faster.


So as for the algorithm for generating string hash code I have no idea why it uses characters and not some memory address. As for why the results of the calculation are not caches I suppose it has to do with memory consumption - caching a hash code would reqiore an additional integer variable. If anyone has some more insight on this topic, I would gladly read it - in the comments section. :-)

If you liked this article why not support its author by making a donation?

3 Comments:

  • And how does interning work? By doing a hash of the string and checking if the string is already in the internment-store. So, to get the hash code from the interned string, you have to intern it, which takes a call to GetHashCode(). I'll argue for simplicity when possible...

    By Blogger IDisposable, at 10:34 PM  

  • A very resonable answer. I wonder where did you get this information since for example the property IsStringInterned is defined as follows:

    internal extern string IsStringInterned(string str);

    and does not provide a managed implementation.

    By Blogger Mikeon, at 10:49 PM  

  • As to why mem address shouldnt be used for hash code :

    Suppose I implement hash table OTS using GetHashCode(), I would want same output for the similar string. That is not possible unless I take into account only characters in string.

    By Anonymous Anonymous, at 10:43 AM  

Post a Comment

<< Home