From: François Fleuret Date: Thu, 18 Jan 2024 15:53:52 +0000 (+0100) Subject: Update. X-Git-Url: https://ant.fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=commitdiff_plain;h=1e87b0a30c1b32eb50429af1340ea9706e3ccab6;p=tex.git Update. --- diff --git a/inftheory.tex b/inftheory.tex new file mode 100644 index 0000000..33ccfe5 --- /dev/null +++ b/inftheory.tex @@ -0,0 +1,192 @@ +%% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*- + +\documentclass[10pt,a4paper,twoside]{article} +\usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry} +%\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry} +\usepackage[utf8]{inputenc} +\usepackage{amsmath,amssymb,dsfont} +\usepackage[pdftex]{graphicx} +\usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref} +\usepackage{tikz} +\usetikzlibrary{arrows,arrows.meta,calc} +\usetikzlibrary{patterns,backgrounds} +\usetikzlibrary{positioning,fit} +\usetikzlibrary{shapes.geometric,shapes.multipart} +\usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy} +\usetikzlibrary{tikzmark} +\usetikzlibrary{decorations.pathmorphing} +\usepackage[round]{natbib} +%\usepackage{cmbright} +%\usepackage{showframe} + +\usepackage{mleftright} + +\newcommand{\setmuskip}[2]{#1=#2\relax} +\setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu +\setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu +\setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu + +\setlength{\parindent}{0cm} +\setlength{\parskip}{12pt} +%\renewcommand{\baselinestretch}{1.3} +%\setlength{\tabcolsep}{0pt} +%\renewcommand{\arraystretch}{1.0} + +\def\argmax{\operatornamewithlimits{argmax}} +\def\argmin{\operatornamewithlimits{argmin}} +\def\expect{\mathds{E}} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%% The \todo command +\newcounter{nbdrafts} +\setcounter{nbdrafts}{0} +\makeatletter +\newcommand{\checknbdrafts}{ +\ifnum \thenbdrafts > 0 +\@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)} +\fi} +\newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}} +\makeatother +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\begin{document} + +Information Theory is awesome so here is a TL;DR about Shannon's entropy. + +This field is about quantifying the amount ``of information'' contained +in a signal and how much can be transmitted under certain conditions. + +What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not. + +\section{Shannon's Entropy} + +This is the key concept from which everything is defined. + +Imagine that you have a distribution of probabilities p on a finite set of symbols and that you generate a stream of symbols by sampling them one after another independently with that distribution. + +To transmit that stream, for instance with bits over a communication line, you can design a coding that takes into account that the symbols are not all as probable, and decode on the other side. + +For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and +$P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a +``B'' and ``11'' for a ``C'', 1.5 bits on average. + +If the symbol is always the same, you transmit nothing, if they are equiprobable you need $\log_2$(nb symbols) etc. + +Shannon's Entropy (in base 2) is the minimum number of bits you have to emit on average to transmit that stream. + +It has a simple formula: +% +\[ + H(p) = - \sum_k p(k) \log_2 p(k) +\] +% +where by convention $o \log_2 0 = 0$. + +It is often seen as a measure of randomness since the more deterministic the distribution is, the less you have to emit. + +The codings above are "Huffman coding", which reaches the Entropy +bound only for some distributions. The "Arithmetic coding" does it +always. + +From this perspective, many quantities have an intuitive +value. Consider for instance sending pairs of symbols (X, Y). + +If these two symbols are independent, you cannot do better than sending one and the other separately, hence +% +\[ +H(X, H) = H(X) + H(Y). +\] + +However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side. + +Hence in that case +% +\[ +H(X, Y) = H(X). +\] + +An associated quantity is the mutual information between two random +variables, defined with +% +\[ +I(X;Y) = H(X) + H(Y) - H(X,Y), +\] +% +that quantifies the amount of information shared by the two variables. + +\section{Conditional Entropy} + +Okay given the visible interest for the topic, an addendum: Conditional entropy is the average of the entropy of the conditional distribution: +% +\begin{align*} +&H(X \mid Y)\\ + &= \sum_y p(Y=y) H(X \mid Y=y)\\ + &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y) +\end{align*} + +Intuitively it is the [minimum average] number of bits required to describe X given that Y is known. + +So in particular, if X and Y are independent +% +\[ + H(X \mid Y)=H(X) +\] + +if X is a deterministic function of Y then +% +\[ + H(X \mid Y)=0 +\] + +And since if you send the bits for Y and then the bits to describe X given that X is known you have sent (X, Y), we have the chain rule: +% +\[ +H(X, Y) = H(Y) + H(X \mid Y). +\] + +And then we get +% +\begin{align*} +I(X;Y) &= H(X) + H(Y) - H(X,Y)\\ + &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\ + &= H(X) - H(X \mid Y). +\end{align*} + +\section{Kullback-Leibler divergence} + +Imagine that you encode your stream thinking it comes from +distribution $q$ while it comes from $p$. You would emit more bits than +the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the +Kullback-Leibler divergence between $p$ and $q$. + +In particular if $p=q$ +% +\[ + D_{KL}(p\|q)=0, +\] +% +and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and +% +\[ + D_{KL}(p\|q)=+\infty. +\] + +Its formal expression is +% +\[ +D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right) +\] +% +that can be understood as a value called the cross-entropy between $p$ and $q$ +% +\[ +H(p,q) = -\sum_x p(x) \log q(x) +\] +% +minus the entropy of p +\[ +H(p) = -\sum_x p(x) \log p(x). +\] + +Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them. +\end{document}