John Duchi October 22, 2024 Contents 1Introduction and setting10 1.1Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 1.2Moving to statistics and machine learning. . . . . . . . . . . . . . . . . . . . . . . .111.3Outline and chapter discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121.4A remark about measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 2An information theory review15 2.1Basics of Information Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.1.1Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.1.2Chain rules and related properties. . . . . . . . . . . . . . . . . . . . . . . .202.1.3Data processing inequalities:. . . . . . . . . . . . . . . . . . . . . . . . . . .222.2General divergence measures and definitions . . . . . . . . . . . . . . . . . . . . . . .232.2.1Partitions, algebras, and quantizers . . . . . . . . . . . . . . . . . . . . . . . .232.2.2KL-divergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .242.2.3f-divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .262.2.4Inequalities and relationships between divergences. . . . . . . . . . . . . . .282.2.5Convexity and data processing for divergence measures. . . . . . . . . . . .322.3First steps into optimal procedures: testing inequalities. . . . . . . . . . . . . . . .332.3.1Le Cam’s inequality and binary hypothesis testing. . . . . . . . . . . . . . .332.3.2Fano’s inequality and multiple hypothesis testing . . . . . . . . . . . . . . . .352.4A first operational result: entropy and source coding . . . . . . . . . . . . . . . . . .372.4.1The source coding problem. . . . . . . . . . . . . . . . . . . . . . . . . . . .372.4.2The Kraft-McMillan inequalities. . . . . . . . . . . . . . . . . . . . . . . . .382.4.3Entropy rates and longer codes. . . . . . . . . . . . . . . . . . . . . . . . . .412.5Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .432.6Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 3Exponential families and statistical modeling483.1Exponential family models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 3.2Why exponential families? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .503.2.1Fitting an exponential family model. . . . . . . . . . . . . . . . . . . . . . .533.3Divergence measures and information for exponential families. . . . . . . . . . . . .543.4Generalized linear models and regression . . . . . . . . . . . . . . . . . . . . . . . . .553.4.1Fitting a generalized linear model from a sample. . . . . . . . . . . . . . . .583.4.2The information in a generalized linear model . . . . . . . . . . . . . . . . . .593.5Lower bounds on testing a parameter’s value. . . . . . . . . . . . . . . . . . . . . .61 3.6Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .633.6.1Proof of Proposition 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .633.7Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .653.8Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65 IConcentration, information, stability, and generalization 4Concentration Inequalities67 4.1Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 4.1.1Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . .694.1.2Sub-exponential random variables. . . . . . . . . . . . . . . . . . . . . . . .734.1.3Orlicz norms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .774.1.4First applications of concentration: random projections. . . . . . . . . . . .804.1.5A second application of concentration: codebook generation . . . . . . . . . .814.2Martingale methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .834.2.1Sub-Gaussian martingales and Azuma-Hoeffding inequalities. . . . . . . . .844.2.2Examples and bounded differences. . . . . . . . . . . . . . . . . . . . . . . .854.3Matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .884.4Technical proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .914.4.1Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .914.4.2Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .924.4.3Proof of Theorem 5.1.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .934.4.4Proof of Proposition 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .934.5Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .954.6Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95 5E