talk.tex 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384
  1. % Copyright (C) 2017 Koz Ross <koz.ross@retro-freedom.nz>
  2. % This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
  3. % International License. To view a copy of this license, visit
  4. % http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative
  5. % Commons, PO Box 1866, Mountain View, CA 94042, USA.
  6. \documentclass{beamer}
  7. \usepackage[utf8]{inputenc}
  8. \usecolortheme{seahorse}
  9. \usefonttheme{professionalfonts}
  10. \usepackage{fontspec}
  11. \setmainfont{DejaVu Sans}
  12. \setbeamertemplate{navigation symbols}{}
  13. \usepackage{caption}
  14. \usepackage[font=itshape, begintext=``, endtext='']{quoting}
  15. \usepackage{fancyvrb}
  16. \usepackage{color}
  17. \usepackage{booktabs}
  18. \usepackage[noend]{algpseudocode}
  19. \title{Hash tables}
  20. \subtitle{Or: why maths {\em matters\/}}
  21. \titlegraphic{\includegraphics[scale=0.8]{img/logo.pdf}}
  22. \author{Yuan Yuan}
  23. \date{20th July, 2017}
  24. \begin{document}
  25. \begin{frame}[c]
  26. \titlepage{}
  27. \end{frame}
  28. \begin{frame}[c]
  29. \frametitle{Outline}
  30. \tableofcontents
  31. \end{frame}
  32. \section{The dictionary problem}
  33. \begin{frame}[c]
  34. \frametitle{The dictionary problem}
  35. \begin{definition}
  36. An {\em entry\/} is a pair of {\em key\/} and {\em value}.
  37. \end{definition}\pause{}
  38. \begin{definition}
  39. A {\em dictionary\/} is a set of entries, such that no two entries have the
  40. same key.
  41. \end{definition}\pause{}
  42. \begin{definition}
  43. The {\em dictionary problem\/} requires us to maintain a dictionary $D$,
  44. with the following operations:\pause{}
  45. \begin{description}[$\mathrm{put}(D,k,v)$:]
  46. \item[$\mathrm{len}(D)$:] Return the number of entries in $D$\pause{}
  47. \item[$\mathrm{put}(D, k, v)$:] Add a new entry to $D$ with key $k$ and
  48. value $v$; if an entry with key $k$ already exists, replace its value
  49. with $v$\pause{}
  50. \item[$\mathrm{get}(D, k)$:] Return the value of the entry whose key is
  51. $k$, or $\mathrm{null}$ if no such entry exists
  52. \end{description}
  53. \end{definition}
  54. \end{frame}
  55. \begin{frame}[c]
  56. \frametitle{Why we care about the dictionary problem}
  57. \pause{}
  58. \begin{itemize}
  59. \item Databases (keys are identifiers, values are data)\pause{}
  60. \item A solution to this problem can implement almost any other data
  61. structure:\pause{}
  62. \begin{itemize}
  63. \item Arrays and lists (keys are consecutive integers)\pause{}
  64. \item Sets (keys and values are the same as each other)\pause{}
  65. \item Trees, graphs, etc\pause{}
  66. \item At least one programming language (Lua) does exactly this!\pause{}
  67. \end{itemize}
  68. \item Allows us to give {\em names\/} to data, and use those names instead
  69. of the data itself (c.f.\ any programming language ever)\pause{}
  70. \end{itemize}
  71. \medskip{}
  72. In short, {\em we want good solutions to the dictionary problem!\/}\pause{} We
  73. also don't want to impose too many constraints on the data beyond equality
  74. being defined (so ordering shouldn't matter, for example).
  75. \end{frame}
  76. \begin{frame}[c]
  77. \frametitle{First attempt: Array or list}
  78. We can store our entries in an array or list.\pause{} We can remember the
  79. number of elements we store, and modify it on all operations in constant time,
  80. which makes $\mathrm{len}$ a $O(1)$ operation.\pause{}
  81. \medskip
  82. For $\mathrm{put}$, we scan the array or list left-to-right, comparing each
  83. entry's key with $k$. If we get a match, replace that entry's value with $v$;
  84. otherwise, add a new entry with key $k$ and value $v$ at the end.\pause{} This
  85. is $O(n)$, because we might have to scan the entire array or list.\pause{}
  86. \medskip
  87. We can do $\mathrm{get}$ similarly, except that we return the value if we find
  88. a match, or $\mathrm{null}$ otherwise.\pause{} This is also $O(n)$.\pause{}
  89. \bigskip
  90. \alert{Verdict:} Not very good at all.
  91. \end{frame}
  92. \begin{frame}[c]
  93. \frametitle{Can we do better?}
  94. Ideally, we want the following:\pause{}
  95. \begin{itemize}
  96. \item $\mathrm{get}$ and $\mathrm{put}$ to be {\em asymptotically\/} fast
  97. (as close to $O(1)$ as possible)\pause{}
  98. \item All operations to be {\em practically\/} fast (i.e.\ no asymptotic
  99. `abuse')\pause{}
  100. \item A data structure which is as simple as possible (because programming
  101. is hard enough already)\pause{}
  102. \end{itemize}
  103. \medskip
  104. There {\em is\/} a structure which achieves all of the above.\pause{} First,
  105. we need to do a bit of preparation\ldots
  106. \end{frame}
  107. \section{Hash functions}
  108. \begin{frame}[c]
  109. \frametitle{Preliminaries}
  110. Let $\mathbb{N} = \{0, 1, 2, \ldots\}$ be the set of {\em natural
  111. numbers}. We use $\mathbb{N}_{k}$ to represent the set $\{x \in \mathbb{N}
  112. \mid x \text{ can be represented using } k \text{ bits }\}$, for $k \in
  113. \mathbb{N}$.\pause{} For example, $\mathbb{N}_{0} = \{\}$ (no number can be
  114. represented in $0$ bits) and $\mathbb{N}_{1} = \{0, 1\}$.\pause{}
  115. \medskip
  116. Let $A,B$ be sets. A {\em function\/} $f: A \rightarrow B$ is a set of pairs
  117. such that for any $(x,y) \in f$, $x \in A, y \in B$, and for every $x \in
  118. A$, there exists exactly one $y \in B$ such that $(x,y) \in f$.\pause{} In
  119. this case, $A$ is the {\em domain\/} of $f$ and $B$ is the {\em codomain\/} of
  120. $f$.\pause{} For any $x \in A$, we denote by $f(x)$ (the {\em value of $f$
  121. at $x$\/}) the $y$ such that $(x,y) \in f$.\pause{}
  122. \medskip
  123. An example function $f: \{a, b, c\} \rightarrow \{1, 2, 3\}$ is $\{(a,
  124. 1), (b, 1), (c, 2)\}$.\pause{} In this case, the domain of $f$ is $\{a, b,
  125. c\}$, the codomain of $f$ is $\{1, 2, 3\}$ and $f(a) = 1$.
  126. \end{frame}
  127. \begin{frame}[c]
  128. \frametitle{More about functions}
  129. Let $f: A \rightarrow B$ be a function. We say $f$ is {\em one-to-one\/} if,
  130. for any $x, y \in A$, if $x \neq y$, then $f(x) \neq f(y)$.\pause{} An example
  131. one-to-one function $f: \{a, b, c\} \rightarrow \{1, 2, 3\}$ would be
  132. $\{(a, 1), (b, 2), (c, 3)\}$.\pause{}
  133. \medskip
  134. \begin{lemma}
  135. Let $A$ be an infinite set and $B$ be a finite set. There is no function $f:
  136. A \rightarrow B$ such that $f$ is one-to-one.
  137. \end{lemma}\pause{}
  138. \medskip
  139. \begin{definition}
  140. Let $A$ be a set, and let $k \in \mathbb{N}$. A {\em hash function for $A$\/}
  141. is some $f: A \rightarrow \mathbb{N}_{k}$. We call the value of $f(x)$ the
  142. {\em hash\/} of $x$.
  143. \end{definition}\pause{}
  144. \medskip
  145. You can think of a hash function as producing a {\em fixed-length summary\/}
  146. of its input.
  147. \end{frame}
  148. \section{The hash table}
  149. \begin{frame}[c]
  150. \frametitle{The hash table}
  151. Let $K$ be a set of keys, $V$ be a set of values, and $k \in \mathbb{N}$.\pause{}
  152. A {\em hash table\/} $H$ for $K,V$ consists of:\pause{}
  153. \begin{itemize}
  154. \item A hash function $H.\mathrm{hash}: K \rightarrow \mathbb{N}_{k}$\pause{}
  155. \item An array $H.\mathrm{buckets}$ of {\em buckets}, capable of storing
  156. elements of $V$. Additionally, $\mathrm{len}(H.\mathrm{buckets}) \leq
  157. 2^{k}$, initially full of $\mathrm{null}$s.\pause{}
  158. \end{itemize}
  159. \medskip
  160. As $H.\mathrm{buckets}$ is an array, we store and update its length similarly
  161. to our original solution, giving us $\mathrm{len}$ in $O(1)$ time.\pause{}
  162. \end{frame}
  163. \begin{frame}[c, fragile]
  164. \frametitle{$\mathrm{put}$ for a hash table}
  165. As $\mathrm{hash}$ produces a number, we can convert the hash of any key $x$ into
  166. a valid index for $\mathrm{buckets}$ by taking $\mathrm{hash}(x)$ modulo
  167. $\mathrm{len}(\mathrm{buckets})$.\pause{} If there is nothing at that index,
  168. we simply store our value there; otherwise, we replace the value there with
  169. the one we are given.\pause{}
  170. \medskip
  171. \begin{algorithmic}
  172. \Function{$\mathrm{put}$}{$H, k, v$}
  173. \State{}$i \gets{}$ \Call{$H.\mathrm{hash}$}{$k$} $\mathrm{\%}$
  174. \Call{$\mathrm{len}$}{$H.\mathrm{buckets}$}
  175. \If{$H.\mathrm{buckets}[i] = \mathrm{null}$}
  176. \State{}\Call{$\mathrm{len}$}{$H$} $\gets$ \Call{$\mathrm{len}$}{$H$} $+
  177. 1$
  178. \EndIf{}
  179. \State{}$H.\mathrm{buckets}[i] \gets{} v$
  180. \EndFunction{}
  181. \end{algorithmic}\pause{}
  182. \medskip
  183. This only requires a constant amount of time, plus however long it takes to
  184. call $\mathrm{hash}$.
  185. \end{frame}
  186. \begin{frame}[c, fragile]
  187. \frametitle{$\mathrm{get}$ for a hash table}
  188. This is very similar to $\mathrm{put}$.\pause{}
  189. \medskip
  190. \begin{algorithmic}
  191. \Function{$\mathrm{get}$}{$H, k$}
  192. \State{}$i \gets{}$ \Call{$H.\mathrm{hash}$}{$k$} $\mathrm{\%}$
  193. \Call{$\mathrm{len}$}{$H.\mathrm{buckets}$}
  194. \State{}\Return{}$H.\mathrm{buckets}[i]$
  195. \EndFunction{}
  196. \end{algorithmic}\pause{}
  197. \medskip
  198. This also requires a constant amount of time, plus the call time of
  199. $\mathrm{hash}$.\pause{}
  200. \medskip
  201. This looks great!\pause{} However, this is too simplistic, and won't work in
  202. practice.
  203. \end{frame}
  204. \begin{frame}[c]
  205. \frametitle{Problems with our hash table}
  206. Our design assumes two things:\pause{}
  207. \begin{itemize}
  208. \item We will never try to $\mathrm{put}$ more entries into $H$
  209. than $\mathrm{len}(H.\mathrm{buckets})$\pause{}
  210. \item Given two different keys, $H.\mathrm{hash}$ will produce two different
  211. hashes\pause{}
  212. \end{itemize}
  213. \medskip
  214. Both of these are false in general:\pause{} the former obviously so, the latter
  215. because of our prior lemma, and the fact that most interesting sets of keys
  216. (e.g.\ all strings) are infinite.\pause{} Thus, what will inevitably happen at
  217. some point is that our $\mathrm{put}$ procedure will assign the same index to
  218. two different keys.\pause{} This is called a {\em collision}, and it really
  219. ruins our day (and design).\pause{}
  220. \medskip
  221. Collisions are inevitable --- based on this, we have to design with them in
  222. mind.\pause{} Luckily, our design is easy to fix to take collisions into
  223. account.
  224. \end{frame}
  225. \begin{frame}[c]
  226. \frametitle{Hash chaining}
  227. Instead of $\mathrm{buckets}$ storing values, each index stores a linked list
  228. of entries, which start out empty (a {\em bucket list\/}).\pause{} After we calculate an index, we
  229. work with the list at that position (just like with our first attempt), for both
  230. of our operations.\pause{}
  231. \medskip
  232. As long as our bucket lists fill up roughly evenly, the time required for
  233. $\mathrm{get}$ and $\mathrm{put}$ will be roughly
  234. $O(\frac{n}{m})$, where $m$ is the size of our bucket array.\pause{} This is
  235. pretty good in practice, as long as we don't try to crowd too many entries into
  236. our hash table.\pause{}
  237. \medskip
  238. How can we be sure that our bucket lists will fill evenly?\pause{} A function
  239. which hashes {\em everything\/} to $1$ is a hash function, and
  240. most certainly {\em won't\/} cause our bucket lists to fill evenly!
  241. \end{frame}
  242. \begin{frame}[c]
  243. \frametitle{{\em Good\/} hash functions}
  244. \pause{}
  245. In order to make sure our buckets fill evenly, we need more from our hash
  246. functions.\pause{} Specifically, for a hash function $f: A \rightarrow
  247. \mathbb{N}_{k}$, we would like $f$ to have one (or both!) of the following
  248. properties:\pause{}
  249. \medskip
  250. \begin{definition}
  251. $f$ is {\em uniform\/} if, given a random $x \in A$ and any specific $y \in
  252. \mathbb{N}_{k}$, there is a $\frac{1}{2^k}$ probability that $f(x) = y$.
  253. \end{definition}\pause{}
  254. \begin{definition}
  255. $f$ {\em exhibits the avalanche effect\/} if, for any $x,y \in A$ which differ
  256. by $1$ bit, $f(x), f(y)$ will differ by $\frac{k}{2}$ bits.
  257. \end{definition}\pause{}
  258. \medskip
  259. These guarantee that, when the number of entries gets large, there is a high
  260. probability that they will be distributed evenly over all bucket lists.
  261. \end{frame}
  262. \begin{frame}[c]
  263. \frametitle{Limitations of hash tables}
  264. \pause{}
  265. \begin{itemize}
  266. \item For small hash tables, the cost of hashing overwhelms everything else
  267. (making them slower than array or list-based approaches)\pause{}
  268. \item Finding a good hash function can be difficult, especially for
  269. variable-length keys (but easier now with
  270. cryptographic hashing and universal hash functions)\pause{}
  271. \item No order for entries (or rather, no {\em specific\/} order)\pause{}
  272. \item Cache-unfriendly due to bucket lists (but alternative approaches exist
  273. which don't require them)\pause{}
  274. \end{itemize}
  275. \medskip
  276. Overall, hash tables work best for dictionaries with large numbers of entries
  277. and simple fixed-length keys. This is not necessarily a problem, as a lot of
  278. real data fits these requirements.\pause{} However, as always, {\em know your
  279. data and your tradeoffs\/}!
  280. \end{frame}
  281. \section{Questions}
  282. \begin{frame}[fragile, c]
  283. \frametitle{Questions?}
  284. \begin{center}
  285. \includegraphics[scale=0.27]{img/questions.pdf}
  286. \end{center}
  287. \end{frame}
  288. \end{document}