Probability density estimation Parametric methods

Podobné dokumenty
Department of Mathematical Analysis and Applications of Mathematics Faculty of Science, Palacký University Olomouc Czech Republic

Vapnik-Chervonenkis learning theory

Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Teacher: Student:

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

Transportation Problem

Klasifikace a rozpoznávání. Bayesovská rozhodovací teorie

Využití hybridní metody vícekriteriálního rozhodování za nejistoty. Michal Koláček, Markéta Matulová

WORKSHEET 1: LINEAR EQUATION 1

Extrakce nezávislé komponenty

Základy teorie front III

Gymnázium, Brno, Slovanské nám. 7, SCHEME OF WORK Mathematics SCHEME OF WORK. cz

Aplikace matematiky. Dana Lauerová A note to the theory of periodic solutions of a parabolic equation

Litosil - application

A Note on Generation of Sequences of Pseudorandom Numbers with Prescribed Autocorrelation Coefficients

Compression of a Dictionary

Database systems. Normal forms

EXACT DS OFFICE. The best lens for office work

Vánoční sety Christmas sets

Klepnutím lze upravit styl předlohy. nadpisů. nadpisů.

GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA

Introduction to MS Dynamics NAV

USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING

Czech Republic. EDUCAnet. Střední odborná škola Pardubice, s.r.o.

Uni- and multi-dimensional parametric tests for comparison of sample results

Bioinformatika a výpočetní biologie. KFC/BIN VII. Fylogenetická analýza

VY_32_INOVACE_06_Předpřítomný čas_03. Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace

Geometry of image formation

Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční

Jak importovat profily do Cura (Windows a

6.867 Machine Learning

PRODEJNÍ EAUKCE A JEJICH ROSTOUCÍ SEX-APPEAL SELLING EAUCTIONS AND THEIR GROWING APPEAL

ÚVOD DO ROZPOZNÁVÁNÍ

Dynamic programming. Optimal binary search tree

STLAČITELNOST. σ σ. během zatížení

Production: The Case of One Producer

Radiova meteoricka detekc nı stanice RMDS01A

Číslo projektu: CZ.1.07/1.5.00/ Název projektu: Inovace a individualizace výuky

Mechanika Teplice, výrobní družstvo, závod Děčín TACHOGRAFY. Číslo Servisní Informace Mechanika:

Dynamic Signals. Ananda V. Mysore SJSU

Karta předmětu prezenční studium

Neparametrické odhady hustoty pravděpodobnosti

Invitation to ON-ARRIVAL TRAINING COURSE for EVS volunteers

DC circuits with a single source

Vliv metody vyšetřování tvaru brusného kotouče na výslednou přesnost obrobku

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

On large rigid sets of monounary algebras. D. Jakubíková-Studenovská P. J. Šafárik University, Košice, Slovakia

1, Žáci dostanou 5 klíčových slov a snaží se na jejich základě odhadnout, o čem bude následující cvičení.

CZ.1.07/1.5.00/

By David Cameron VE7LTD

SSOS_AJ_3.18 British education

LOGBOOK. Blahopřejeme, našli jste to! Nezapomeňte. Prosím vyvarujte se downtrade

CZ.1.07/1.5.00/

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

2. Entity, Architecture, Process

SPECIAL THEORY OF RELATIVITY

Dynamic Development of Vocabulary Richness of Text. Miroslav Kubát & Radek Čech University of Ostrava Czech Republic

Energy vstupuje na trh veterinárních produktů Energy enters the market of veterinary products

DATA SHEET. BC516 PNP Darlington transistor. technický list DISCRETE SEMICONDUCTORS Apr 23. Product specification Supersedes data of 1997 Apr 16

Image Analysis and MATLAB. Jiří Militky

Enabling Intelligent Buildings via Smart Sensor Network & Smart Lighting

USER'S MANUAL FAN MOTOR DRIVER FMD-02

Výukový materiál v rámci projektu OPVK 1.5 Peníze středním školám

Configuration vs. Conformation. Configuration: Covalent bonds must be broken. Two kinds of isomers to consider

Výukový materiál zpracovaný v rámci operačního programu Vzdělávání pro konkurenceschopnost

CHAIN TRANSMISSIONS AND WHEELS

Just write down your most recent and important education. Remember that sometimes less is more some people may be considered overqualified.

Pravděpodobnost a statistika; Opakování pro rozpoznávání

Parametrizace rozdělení škod v

VYSOKÁ ŠKOLA HOTELOVÁ V PRAZE 8, SPOL. S R. O.

The Over-Head Cam (OHC) Valve Train Computer Model

Výukový materiál zpracovaný v rámci projektu EU peníze do škol. illness, a text

Algebraic methods in Computer Vision Zuzana Kukelova, Tomas Pajdla, Martin Bujnak


Informace o písemných přijímacích zkouškách. Doktorské studijní programy Matematika

GENERAL INFORMATION MATCH: ALSA PRO ARENA MASTERS DATE: TIME SCHEDULE:

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

LOGOMANUÁL / LOGOMANUAL

VY_22_INOVACE_84. P3 U3 Revision

Chapter 7: Process Synchronization

Čtvrtý Pentagram The fourth Pentagram

Why PRIME? 20 years of Erasmus Programme Over 2 million students in total Annually

kupi.cz Michal Mikuš

A constitutive model for non-reacting binary mixtures

Mut goes shopping VY_22_INOVACE_27. Vzdělávací oblast: Jazyk a jazyková komunikace. Vzdělávací obor: Anglický jazyk. Ročník: 6

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Britské společenství národů. Historie Spojeného království Velké Británie a Severního Irska ročník gymnázia (vyšší stupeň)

ISS Numerické cvičení / Numerical exercise 6 Honza Černocký, FIT VUT Brno, December 14, 2016

Verb + -ing or infinitive

CZ.1.07/1.5.00/

PRAVIDLA ZPRACOVÁNÍ STANDARDNÍCH ELEKTRONICKÝCH ZAHRANIČNÍCH PLATEBNÍCH PŘÍKAZŮ STANDARD ELECTRONIC FOREIGN PAYMENT ORDERS PROCESSING RULES

WYSIWYG EDITOR PRO XML FORM

CHAPTER 5 MODIFIED MINKOWSKI FRACTAL ANTENNA

Obrábění robotem se zpětnovazební tuhostí

Jan Černocký ÚPGM FIT VUT Brno, FIT VUT Brno

Next line show use of paragraf symbol. It should be kept with the following number. Jak může státní zástupce věc odložit zmiňuje 159a.

Perception Motivated Hybrid Approach to Tone Mapping

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Effect of temperature. transport properties J. FOŘT, Z. PAVLÍK, J. ŽUMÁR,, M. PAVLÍKOVA & R. ČERNÝ Č CTU PRAGUE, CZECH REPUBLIC

SPECIFICATION FOR ALDER LED

Transkript:

Probability density estimation Parametric methods Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac, vaclav.hlavac@cvut.cz also Center for Machine Perception, http://cmp.felk.cvut.cz Courtesy: V. Franc, R. Gutierrez-Osuna. Outline of the talk: Three taxonomies. Maximum likelihood (ML) estimation. Examples: Gaussian, discrete distribution. ML for pattern rec., a coin example. Logistic regression. Bayesian estimate, a coin example.

What if the statistical model is unknown? 2/38 Bayesian and non-bayesian methods allow designing the optimal classifier provided we have a statistical model describing the observation x X and the hidden state y Y. Bayesian methods require posterior probability p(y x). Non-Bayesian methods require the conditional probability p(x y). In most practical situations, the statistical model is unknown and must be estimated from the training multiset of examples { (x1, y 1 ),..., (x m, y m ) } ˆp(x, y). The estimated ˆp(x, y) replaces the true distribution p(x, y). Note: Estimation (in statistics) is also called learning (in pattern recognition and machine learning).

Taxonomy 1: according to the statistical model Parametric nonparametric 3/38 Parametric: A particular form of the density function (e.g. Gaussian) is assumed to be known and only its parameters θ Θ need to be estimated (e.g. the mean, the covariance). Sought: p(x, θ); Performed: {x 1,..., x n } ˆθ. Nonparametric: No assumption on the form of the distribution. However, many more training examples are required in comparison with parametric methods. Sought: p(x); Performed: {x 1,..., x n } ˆp(x).

Taxonomy 2: according to the input information Supervised, unsupervised & semi-supervised 4/38 The goal is to estimate the probability density p(x, y) or p(x). The probability density estimation methods can be categorized according to the information available in the training multiset. Supervised methods: Completely labeled examples are available. Sought: p(x, y; θ); Performed: {(x 1, y 1 )..., (x n, y n )} ˆθ. Unsupervised methods: Unlabeled examples of observations, used, e.g. for clustering. Sought: p(x; θ); Performed: {x 1,..., x n } ˆθ. Semi-supervised methods: Both n labeled and m n unlabeled examples. Sought: p(x, y; θ); Performed: {(x 1, y 1 ),..., (x n, y n ), x n+1,..., x n+m } ˆθ.

Taxonomy 3: according to the estimation principle Maximum-Likelihood estimation: Sought: p(x, y; θ), where θ is fixed but unknown. Performed: ˆθ = argmax p(dataset θ) θ In words: The ML solution seeks the solution, which is the best explanation of the dataset X X using the likelihood function. Bayesian estimation: Sought: p(x, y; θ). θ is a random variable with the known prior p(θ). In words: Bayesian method estimates the optimal parameter Θ of the given probability density, which maximizes the posterior probability density p(θ X). Minimax estimation: (suggested by M.I. Schlesinger, not explicated here) Sought: p(x, y; θ) Performed: ˆθ = arg max θ min p(x i, θ),...,m 5/38

This lecture: supervised methods 6/38 This lecture covers the supervised methods estimating a known probability distribution, i.e. parametric methods. There will be a separate lecture in this course on non-parametric metods (as histogram-based; kernel density estimation; Parzen window estimation, k-nearest Neighbors estimation); unsupervised methods (as K-means clustering; EM-algorithm).

Assumptions Maximum likelihood estimation 7/38 The density function p(x; θ) is known up to a parameter θ Θ. The column vector of parameters θ = [θ 1,..., θ d ] R d. i.i.d. assumption: A measured data (set of examples) D={x1,..., xn} independently drawn from the identical distribution p(x; θ ) is available. The true parameter θ Θ is unknown but it is fixed. The probability p(x; θ) assumes that n examples D were generated (measured) for a given Θ is called the likelihood function p(d; θ) = n p(x i ; θ). Maximum likelihood estimation ˆθ ML = argmax ˆθ ML, which best explains the examples D. θ Θ p(d; θ) seeks for the parameter

Log-likelihood function L(θ) 8/38 It is convenient to maximize the log-likelihood function L(θ) = log p(d; θ) = log n p(x i ; θ) = log p(x i ; θ) Because the logarithm is monotonically increasing, it holds ˆθ ML = argmax θ Θ p(d; θ) = argmax θ Θ log p(x i ; θ) } {{} L(θ) = argmax θ Θ L(θ) Maximizing a sum of terms is always an easier task than maximizing a product; cf. the difficulty of expressing derivative of a long product of terms. In particular, the logarithm of a Gaussian is very easy.

ML estimate of the log-likelihood function 9/38 The (maximally likely) ML estimate can be often obtained by finding the stationary point of the log-likelihood function L(θ) θ j = θ L = 0. L(θ) = θ j θ j = = log p(x i ; θ) θ j log p(x i ; θ) 1 p(x i ; θ) p(x i ; θ) θ j = 0 Provided L(θ) is convex, the found solution is the ML estimate ˆθ ML.

Example: Normal distribution, ML estimate (1) 10/38 Input D = {x1, x2,..., xn} Rn ( ) p(x; θ = (µ, σ)) = 1 σ exp (x µ) 2 2π 2σ θ = {(µ, σ) µ R, σ R+}. 0 x x x x x x x μ Likelihood function p(d; θ) and its logarithm, i.e. log-likelihood function L(θ) 0.4 0.3 0.2 0.1 σ 5 5 x x x x x x x x x x p(d; θ) = n p(x i ; θ) = n 1 σ 2π exp ( ) (x µ) 2 2σ L(θ) = log p(d; θ) = ( log σ log 2π (x i µ) 2 ) 2σ 2

Example: Normal distribution, ML estimate (2) To compute ˆθ ML = argmax θ L(θ), we will solve θ L = L(θ) θ = 0. L(θ) = log p(d; θ) = ( log σ log 2π (x i µ) 2 ) 2σ 2 11/38 Mean value ˆµML: n L µ = ( ) 2(xi µ) 2σ 2 = 0 ˆµ ML = 1 n x i Variance ˆσ2ML: L σ = n ( 1 σ (x i µ) 2 2 ( )) 2 σ 3 = 0 σ 3 ( σ 2 + (x i µ) 2) = 0 ˆσ 2 ML = 1 n (x i µ) 2

Properties of the estimates (1) 12/38 ˆθ is a random variable. Consequently, we can talk about its mean value (the most probable value) and its variance. Bias Tells how close is the estimate to the true value. ) Unbiased estimate: E (ˆθ = θ. Variance How much does the estimate change for different runs, e.g. for different datasets? Bias-variance tradeoff: In most cases: One can only decrease either the bias or the variance at the expense of the other.

Properties of the estimates (2) 13/38 ) Efficiency of the estimate: var (ˆθ = E (ˆθ θ ) 2. The best estimate has the minimal dispersion. ) Consistent estimate E (ˆθ ) = θ for n ; var (ˆθ = 0 for n. The growing amount of samples induces that statistical characteristics converge to the true value. (The consequence of the rule of large numbers.)

2D illustration of the estimate bias and efficiency 14/38

ML estimate of the m-variate Gaussian 15/38 One data sample xi X = (x1, x2,..., xm). It can be derived similarly that the Maximum Likelihood parameter estimate yields the sample mean vector and the sample covariance matrix. Sample mean vector Sample covariance matrix ˆµ ML = 1 n x i ˆΣ = 1 n (x i ˆµ)(x i ˆµ)

Example: Discrete distribution, ML estimate (1) D = {x1, x2,..., xn} Discrete range (quantization levels) x X = {1, 2,..., m} p(x; θ) = p(x = 1) = θ 1 p(x = 2) = θ 2. p(x = m) = θ m 16/38 Likelihood function; δ is Dirac function p(x 1,..., x n ; θ) = n p(x i ) = n p(x) δ(x i = x) x X = x X p(x) n(x) Logarithm of the likelihood function L(θ) = log p(x 1,..., x n ; θ) = x X n(x) log p(x)

Example: Discrete distribution, ML estimate (2) 17/38 Logarithm of the likelihood function L(θ) = x X n(x) log p(x) is minimized under the condition p(x) = 1, p(x) 0 x X Lagrange multipliers method is used F (θ) = x X n(x) log p(x) + λ ( x X p(x) 1 ) F (θ) p(x) = n(x) p(x) + λ = 0, for x

Example: Discrete distribution, ML estimate (3) 18/38 Rewritten from the previous slide: F (θ) p(x) = n(x) p(x) + λ = 0, for x n(x) = λ p(x) p(x) = n(x) λ It follows from the constraint: x X p(x) = 1 = x X n(x) λ = 1 λ λ = n x X n(x) = n λ p(x) = n(x) λ = n(x) n

Maximum likelihood estimate if there is no analytical solution 19/38 If the analytical solution to ˆθ ML = argmax θ L(θ) does not exist then the ML estimation task can be solved numerically: 1. Using standard optimization techniques like gradient methods or Newton methods. 2. Using the Expectation-Maximization algorithm when p(x; θ) = h H p(x, h; θ) For example, in the case of the mixture models p(x; θ) = H p(x h; θ) p(h; θ) h=1

Pros: Properties of the ML estimate ML has favorable statistical properties. It is asymptotically (i.e. for n ) unbiased, it has the smallest dispersion, and it is consistent. ML leads to simple analytical solution for many simple distributions as was demonstrated in the Gaussian example. Cons: There can be several solutions with the same quality. There is a single global solution (the value of the quality). However, it is difficult to find it. There is no analytical solution for the mixture of Gaussians. In this case, EM algorithm is used often (and mentioned in this lecture later). Note: The ML estimate has often mentioned properties. In general, however, these properties cannot be guaranteed because the estimate values need not be real numbers. Thus we cannot claim what their distribution was. 20/38

ML estimation in pattern recognition (1) 21/38 D = {(x1, y1), (x2, y2)..., (xn, yn)} (X Y)n x X... discrete (a finite set) or continuous (an infinite set), typically X R m y Y... discrete (a finite set) p(x, y; θ) = p(x y; θy) p(y; θa) θ = θ 1 θ 2. θ Y θ A The likelihood function p(d; θ) reads p (D; θ) = n p(x j, y j ; θ y ) = n p(x j y j ; θ yj ) p(y j ; θ A ) j=1 j=1

ML estimation in pattern recognition (2) 22/38 L(θ) = log p (D; θ) = log p(x j y j ; θ yj ) + log p(y j ; θ A ) = y Y j=1 log p(x j y; θ y ) + j {i y i =y} j=1 j=1 log p(y j ; θ A ) The original maximization task decomposes into Y + 1 simpler independent tasks: ˆθ y = argmax log p(x j y; θ y ), y Y θ y j {i y i =y} ˆθ A = argmax p(y j θ A ) θ A j=1

Example: Is the coin fair? (ML case 1) 23/38 Assume a coin flipping experiment with the training set D = {head, head, tail, head,..., tail}. Let n be the length of the experiment (number of tosses) and k the number of heads observed in experiments. The probabilistic model of the unknown coin is: p(x = head θ) = θ and p(x = tail θ) = 1 θ. The likelihood of θ Θ = [0, 1] is given by Bernoulli distribution, p(d θ) = n p(x j θ) = θ k (1 θ) n k j=1

Example: Is the coin fair? (ML case 2) 24/38 Repeated from previous slide: p(d θ) = n j=1 p(x j θ) = θ k (1 θ) n k ML estimate ˆθ = argmax θ p(d θ) = k n. The optimal value is illustrated as the red dot. Example: What if only m = 2 samples are available? 1 1 0.2 p(d θ) 0.5 p(d θ) 0.5 p(d θ) 0.15 0.1 0.05 0 0 0.5 1 θ 0 0 0.5 1 θ 0 0 0.5 1 θ p ( m=2 k=2 θ) = θ 2 p ( m=2 k=0 θ) = (1 θ) 2 p ( m=2 k=1 θ) = θ(1 θ)

Use of ML in unsupervised learning (1) D X = {x1, x2,..., xn}... only observable states. p(x, y; θ) = p(x y; θy) p(y, θa)... the statistical model considers an unknown hidden state y. 25/38 p(d x ; θ) = = n j=1 n j=1 n p(x j ; θ) = p(x j, y; θ) j=1 y Y p(x j y; θ) p(y; θ A ) y Y L(θ) = log p(d; θ) = = log j=1 y Y log p(x j y; θ) p(y; θ A ) y Y j=1 p(x j y; θ) p(y; θ A )

Use of ML in unsupervised learning (2) 26/38 The quality of the solution is copied from the previous page, L(θ) = log j=1 y Y p(x j y; θ) p(y; θ A ) (1) θ L(θ) = 0 does not have the analytical solution! Expectation-Maximization (EM) algorithm is often used for this unsupervised learning task. It is an iterative method, which can be used to find the ML estimate of Equation (1). It can be used if we are able to perform ML estimate from DX. One lecture is dedicated to EM algorithm later in this course.

Logistic regression (1) Given examples D = {(x1, y1),..., (xn, yn)} (Rn {+1, 1})n. Assuming the examples were i.i.d. drawn. The goal is to find the posterior model p(y x) needed for the classification. Let assume that the logarithm of the likelihood ratio is a linear function 27/38 log p(y = +1 x) p(y = 1 x) = x, w + b Under this assumption, we get that the p(y = 1 x) is the logical function (i.e. yielding the value TRUE or FALSE) p(y x; w) = 1 1 + exp ( y( x, w + b) ) The likelihood of the parameters θ = (w, b) given the examples D reads p(d; w, b) = n p(x i, y i ; w, b) = n p(y i x i ; w, b) p(x i )

Logistic regression (2) Bayes classifier does not require the model of p(x) and hence 28/38 (ŵ, ˆb) = argmax w,b = argmax w,b = argmax w,b n p(y i x i ; w, b) p(x) ( n log p(y i x i ; w, b) + log p(y i x i ; w, b) log p(x i ) ) The maximization task is concave in variables w, b. However, the task does not have the analytical solution. Bayes classifier minimizing the classification error leads to the linear rule log p(y = +1 x) p(y = 1 x) = x, w + b q(x) = { +1 if x, w + b 0 1 if x, w + b < 0

Bayesian estimate, the introduction Our uncertainty about the parameters Θ is represented by the probability distribution in the Bayesian approach. Before we observe the data, the parameters are described by a prior density p(θ), which is typically very broad reflecting the fact that we know little about its true value. Once we obtain data, we use the Bayes theorem to find the posterior p(θ X). Ideally, we want the data to sharpen the posterior p(θ X). Said differently, we want to reduce our uncertainty about the parameters. 29/38 However, it has be kept in mind that our goal is to estimate the density p(x) or, more precisely, the density p(x X), the density given the evidence provided by the dataset X.

Bayesian estimate of the probability density parameters (1) 30/38 D = {x1, x2,..., xn}... observed data. p(d, θ)... joint probability distribution of D and θ. θ Θ... is understood as a realization of the random variable (in the contrast to ML estimate) with the known prior distribution p(x θ). The likelihood function p(d θ) = n j=1 p(x j θ). Bayesian estimation seeks the posterior distribution of the parameter θ best explaining the observed data D. It is obtained by Bayes formula: p(d θ) = n p(x j θ). j=1

Use of Bayesian estimation 31/38 What is the posterior distribution p(θ D) = p(d θ) p(θ) p(d) good for? 1. Marginalize over the parameter θ (numerically expensive) p(x D) = θ Θ p(x, θ D) dθ = θ Θ p(x θ) p(θ D) dθ 2. Compute the point estimate by minimizing the expected risk ˆθ = argmin θ θ Θ W (θ, θ) p(θ, D) dθ, where W : Θ Θ R is a penalty function penalizing incorrectly estimated parameters.

Bayesian estimation: Examples of loss functions 32/38 Zero-one penalty function W (θ, θ) = { 0 if θ = θ 1 if θ θ leads to the Maximum Aposteriori (MAP) estimate ˆθ = argmax p(θ D). θ Note: ML estimate ˆθ = argmax p(d θ) can be seen as the MAP estimate with an uninformative prior p(θ). θ Quadratic penalty function W (θ, θ) = (θ θ)2 leads to the estimate of the expected value of the parameter ˆθ = θ Θ θ p(θ D) dθ

Bayesian estimate of the probability density parameters (2) 33/38 The estimate ˆθ = Ψ(D) is formulated as in Bayesian decision with the penalty function W : Θ Θ R. Bayesian risk R(Ψ) = D The optimal decision function (stragegy, drule) Ψ (D) = argmin Ψ θ R(Ψ) = argmin ψ p(d, θ) W (θ, Ψ(D)) θ p(d, θ) W (θ, Ψ(D))

Bayesian estimate for the W (θ, Ψ(D)) = (θ ψ(d)) 2 34/38 After the substitution Ψ (D) = argmin the minimal loss Ψ Ψ θ ( ) p(d, θ) (θ Ψ) 2 = 0 θ p(d, θ) (θ Ψ) 2, it must hold for Ψ θ p(d, θ) = θ θ p(d, θ) Ψ (D) = θ θ p(d, θ) p(d) = θ θ p(θ D) p(d) p(d) = p(d) θ θ p(θ D) p(d) Ψ (D) = θ θ p(θ D)

Example: Is the coin fair? (Bayesian case 1) 35/38 (The same slide as in the ML case. The slide No. 23 is repeated here.) Assume a coin flipping experiment with the training set D = {head, head, tail, head,..., tail}. Let n be the length of the experiment (number of tosses) and k the number of heads observed in experiments. The probabilistic model of the unknown coin is: p(x = head θ) = θ and p(x = tail θ) = 1 θ. The likelihood of θ Θ = [0, 1] is given by Bernoulli distribution, p(d θ) = n p(x j θ) = θ k (1 θ) n k j=1

Example: Is the coin fair? (Bayesian case 2) 36/38 Assume the uniform prior p(θ) = { 1 for θ [0, 1], 0 otherwise. ˆθ B = p(x = head D) = Ψ θ p(d θ) p(θ) dθ (D) = p(d) 1 0 = θk+1 (1 θ) n k dθ 1 0 θk (1 θ) n k dθ = k + 1 n + 2. Note 1: ˆθML = kn. Note 2: For n, it also holds ˆΘB ˆΘML. 1 0

Example: Is the coin fair? (Bayesian case 3) 37/38 ML estimate ˆθ ML = k n (red dot) versus Bayesian estimate ˆθ B = k+1 n+2 (black dot). Example: What if only m = 2 samples are available? 1 1 0.2 p(d θ) 0.5 p(d θ) 0.5 p(d θ) 0.15 0.1 0.05 0 0 0.25 0.5 0.75 1 θ 0 0 0.25 0.5 0.75 1 θ 0 0 0.25 0.5 0.75 1 θ p ( m=2 k=2 θ) = θ 2 p ( m=2 k=0 θ) = (1 θ) 2 p ( m=2 k=1 θ) = θ(1 θ)

Relationship between the Maximal likelihood and Bayesian estimation 38/38 By definition, p(x Θ) peaks at the ML estimate. If this peak is relatively sharp and the prior is broad, then the outcome will be dominated by the region around the ML estimate. Thus, the Bayesian estimate will approximate the ML solution. When the number of available data increases, the posterior p(θ X) tends to sharpen. Therefore, the Bayesian estimate of p(x) will approach the ML estimate for number of samples N. In practice, when we have a limited number of observations will the two approaches yield different results.