Vapnik-Chervonenkis learning theory

Podobné dokumenty
Probability density estimation Parametric methods

Transportation Problem

Department of Mathematical Analysis and Applications of Mathematics Faculty of Science, Palacký University Olomouc Czech Republic

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Teacher: Student:

Využití hybridní metody vícekriteriálního rozhodování za nejistoty. Michal Koláček, Markéta Matulová

Dynamic programming. Optimal binary search tree

On large rigid sets of monounary algebras. D. Jakubíková-Studenovská P. J. Šafárik University, Košice, Slovakia

Gymnázium, Brno, Slovanské nám. 7, SCHEME OF WORK Mathematics SCHEME OF WORK. cz

USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING

Database systems. Normal forms

Dynamic Signals. Ananda V. Mysore SJSU

Základy teorie front III

Social Media a firemní komunikace

WORKSHEET 1: LINEAR EQUATION 1

GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA

Litosil - application

6.867 Machine Learning

2. Entity, Architecture, Process

ÚVOD DO ROZPOZNÁVÁNÍ

Geometry of image formation

Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Student: Draw: Convex angle Non-convex angle

Czech Republic. EDUCAnet. Střední odborná škola Pardubice, s.r.o.

Aktivita CLIL Chemie I.

Next line show use of paragraf symbol. It should be kept with the following number. Jak může státní zástupce věc odložit zmiňuje 159a.

Aplikace matematiky. Dana Lauerová A note to the theory of periodic solutions of a parabolic equation

VYSOKÁ ŠKOLA HOTELOVÁ V PRAZE 8, SPOL. S R. O.

A Note on Generation of Sequences of Pseudorandom Numbers with Prescribed Autocorrelation Coefficients

Vánoční sety Christmas sets

PC/104, PC/104-Plus. 196 ept GmbH I Tel. +49 (0) / I Fax +49 (0) / I I

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Aktivita CLIL Chemie III.

EXACT DS OFFICE. The best lens for office work

LOGOMANUÁL / LOGOMANUAL

Vliv metody vyšetřování tvaru brusného kotouče na výslednou přesnost obrobku

Compression of a Dictionary

Číslo projektu: CZ.1.07/1.5.00/ Název projektu: Inovace a individualizace výuky

PART 2 - SPECIAL WHOLESALE OFFER OF PLANTS SPRING 2016 NEWS MAY 2016 SUCCULENT SPECIAL WHOLESALE ASSORTMENT

Introduction to MS Dynamics NAV

The Over-Head Cam (OHC) Valve Train Computer Model

SPECIAL THEORY OF RELATIVITY

Extrakce nezávislé komponenty

kupi.cz Michal Mikuš

Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115

VY_32_INOVACE_06_Předpřítomný čas_03. Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace

Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Návrh a implementace algoritmů pro adaptivní řízení průmyslových robotů

CHAIN TRANSMISSIONS AND WHEELS

PAINTING SCHEMES CATALOGUE 2012

FIRE INVESTIGATION. Střední průmyslová škola Hranice. Mgr. Radka Vorlová. 19_Fire investigation CZ.1.07/1.5.00/

PixLa PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH PIXEL LABYRINTH

GENERAL INFORMATION MATCH: ALSA PRO ARENA MASTERS DATE: TIME SCHEDULE:

DATA SHEET. BC516 PNP Darlington transistor. technický list DISCRETE SEMICONDUCTORS Apr 23. Product specification Supersedes data of 1997 Apr 16

HASHING GENERAL Hashovací (=rozptylovací) funkce

Angličtina v matematických softwarech 2 Vypracovala: Mgr. Bronislava Kreuzingerová

Algebraic methods in Computer Vision Zuzana Kukelova, Tomas Pajdla, Martin Bujnak

Mechanika Teplice, výrobní družstvo, závod Děčín TACHOGRAFY. Číslo Servisní Informace Mechanika:

Aktuální trendy ve výuce a testování cizích jazyků v akademickém prostředí

Stojan pro vrtačku plošných spojů

TKGA3. Pera a klíny. Projekt "Podpora výuky v cizích jazycích na SPŠT"

Jakub Zavodny (University of Oxford, UK)

Invitation to ON-ARRIVAL TRAINING COURSE for EVS volunteers

Klepnutím lze upravit styl předlohy. nadpisů. nadpisů.

VŠEOBECNÁ TÉMATA PRO SOU Mgr. Dita Hejlová

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

Nová éra diskových polí IBM Enterprise diskové pole s nízkým TCO! Simon Podepřel, Storage Sales

GENERAL INFORMATION MATCH: ALSA PRO HOT SHOTS 2018 DATE:

Název školy STŘEDNÍ ODBORNÁ ŠKOLA a STŘEDNÍ ODBORNÉ UČILIŠTĚ, Česká Lípa, 28. října 2707, příspěvková organizace

ehealth a bezpečnost dat

Perception Motivated Hybrid Approach to Tone Mapping

CHAPTER 5 MODIFIED MINKOWSKI FRACTAL ANTENNA

Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115

Problém identity instancí asociačních tříd

UŽIVATELSKÁ PŘÍRUČKA

Enabling Intelligent Buildings via Smart Sensor Network & Smart Lighting

Výukový materiál zpracovaný v rámci operačního programu Vzdělávání pro konkurenceschopnost

DC circuits with a single source

SSOS_AJ_3.18 British education

1, Žáci dostanou 5 klíčových slov a snaží se na jejich základě odhadnout, o čem bude následující cvičení.

PRODEJNÍ EAUKCE A JEJICH ROSTOUCÍ SEX-APPEAL SELLING EAUCTIONS AND THEIR GROWING APPEAL

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

LOGBOOK. Blahopřejeme, našli jste to! Nezapomeňte. Prosím vyvarujte se downtrade

Uživatelská příručka. Xperia P TV Dock DK21

Are you a healthy eater?

ČTENÍ. M e t o d i c k é p o z n á m k y k z á k l a d o v é m u t e x t u :

Právní formy podnikání v ČR

Výukový materiál v rámci projektu OPVK 1.5 Peníze středním školám

SEZNAM PŘÍLOH. Příloha 1 Dotazník Tartu, Estonsko (anglická verze) Příloha 2 Dotazník Praha, ČR (česká verze)... 91

AJ 3_16_Prague.notebook. December 20, úvodní strana

CZ.1.07/1.5.00/

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Klasifikace a rozpoznávání. Bayesovská rozhodovací teorie

Úvod do datového a procesního modelování pomocí CASE Erwin a BPwin

B1 MORE THAN THE CITY

Java Cvičení 05. CHARLES UNIVERSITY IN PRAGUE faculty of mathematics and physics

Energy vstupuje na trh veterinárních produktů Energy enters the market of veterinary products

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Transkript:

Vapnik-Chervonenkis learning theory Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 160 00 Prague 6, Jugoslávských partyzánu 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac, vaclav.hlavac@cvut.cz also Center for Machine Perception, http://cmp.felk.cvut.cz Courtesy: M.I. Schlesinger. Outline of the talk: Classifier design. Mathematical formulation of the risk describing process of learning. Upper bound = guaranteed risk. VC-dimension calculation. Structural risk minimization.

Classifier design (1) 2/22 The object of interest is characterized by observable properties x X and its class membership (unobservable, hidden state) y Y, where X is the space of observations and Y the set of hidden states. The objective of a classifier design is to find the optimal decision function q : X Y. Bayesian decision theory solves the problem of minimization of the Bayesian risk R(q) = x,y p XY (x, y) W (y, q(x)) given the following quantities: p XY (x, y), x X, y Y the statistical model of the dependence of the observable properties (measurements) on class membership. W (y, q(x)) the loss of decision q(x) if the true class is y.

Classifier design (2) 3/22 Constraints or penalties for different errors depend on the application problem formulation. However, in applications typically: None of the class conditional probabilities (likelihoods) are known, e.g., p(x y), p(y), x X, y Y. The designer is only given a training multi-set T = {(x1, y1)... (xl, yl)}, where L is the length (size) of the training multi-set. The desired properties of the classifier q(x) are assumed. Note: Non-Bayesian decision theory offers the solution to the problem if p(x y), x X, y Y are known, but p(y) are unknown (or do not exist).

Classifier design via parameter estimation Assume p(x, y) have a particular form, e.g., a mixture of Gaussians, piece-wise constant, etc., with a finite (i.e., small) number of parameters Θ y. Estimate the parameters Θy from the training multi-set T. Solve the classifier design problem (i.e., minimize the risk) by substituting the estimated ˆp(x, y) for the true (and unknown) probabilities p(x, y). There is no direct relationship between known properties of estimated ˆp(x, y) and the properties (typically the risk) of the obtained classifier q (x). If the true p(x, y) is not of the assumed form then q (x) may be arbitrarily bad, even if the size of training set L approaches infinity! + Implementation is often straightforward, especially if parameters Θ y for each class are assumed independent. + Performance on real data can be predicted empirically from performance on training set (divided to training set and validation set, e.g., crossvalidation). 4/22

Learning in statistical pattern recognition Choose a class Q of decision functions (classifiers) q: X Y. 5/22 Find q Q by minimizing some criterion function on the training set that approximates the risk R(q) (which cannot be computed). Learning paradigm is defined by the approximating criterion function: 1. Maximizing likelihood. Example: Estimating the probability density. 2. Using a non-random training set. Example: Image analysis. 3. Empirical risk minimization in which the true risk is approximated by the error rate on the training set. Examples: Perceptron, Neural nets (Back-propagation), etc. 4. Structural risk minimization. Example: SVM (Support Vector Machines).

Overfitting and underfitting 6/22 How rich class Q of classifiers q(x, Θ) should be used? The problem of generalization is a key problem of pattern recognition: a small empirical risk R emp need not imply a small true expected risk R! underfit fit overfit

Asymptotic behavior 7/22 For infinite training data, the law of large number assures lim R emp(θ) = R(Θ). L In general, unfortunately, there is no guarantee for a solution based on the expected risk minimization because argmin Θ R emp (Θ) argmin Θ R(Θ). Performance on training data is often better than on test data (or real performance).

The idea of the guaranteed risk 8/22 Idea: add a prior (called also regularizer). This regularizer favors a simpler strategy, cf., Occam razor. RR, emp R( Θ) R emp ( Θ ) Vapnik-Chervonenkis learning theory introduces a guaranteed risk J(Θ), R(Θ) J(Θ), with the probabilistic confidence η. RJ, min R min R emp J( Θ) Θ The upper bound J(Θ) may be so large (meaning pessimistic) that it can be useless. R( Θ) min R min J Θ

The upper bound of a true risk 9/22 The upper bound was derived by Chervonenkis and Vapnik in the 1970s. With the confidence η, 0 η 1, R(Θ) J(Θ) = R emp (Θ) + h ( log ( ) ) ( 2L h + 1 log η 4) L. where L is the length of the training multi-set, h is the VC-dimension of the class of strategies q(x, Θ). Note that the above upper bound is independent of the true p(x, y)!! It is the worst case upper bound valid for all possible p(x, y). Structural risk minimization means minimizing the upper bound J(Θ). (We will return to structural risk minimization after we explain how to compute VC-dimension.)

Vapnik-Chervonenkis dimension 10/22 It is a number characterizing the decision strategy. Abbreviated VC-dimension. Named after Vladimir Vapnik and Alexey Chervonenkis (Appeared in their book in Russian. V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974). It is one of the core concepts in Vapnik-Chervonenkis theory of learning. In the original 1974 publication, it was called capacity of a class of strategies. The VC dimension is a measure of the capacity of a statistical classification algorithm.

VC-dimension, the idea informally 11/22 f(x) x f 1 (x) = (x 1) f 2 (x) = (x 1)(x + 2) f 3 (x) = (x 2)(x 1)(x + 2) f 6 (x) = (x 2)(x 1) x (x + 1) (x + 2)(x + 3) The VC-dimension (capacity) of a classification strategy tells how complicated it can be. An example: the thresholding a high-degree polynomial. If a very high-degree polynomial is used, it can be very wiggly, and can fit a training set exactly (overfit). Such a polynomial has a high capacity and problems with generalization. A linear function, e.g., has a low VC-dimension. Light green circles symbolize data points.

Shattering Consider a classification strategy q with some parameter vector Θ. 12/22 The strategy q can shatter a set of data points x1, x2,..., xn if, for all possible assignments of labels y Y to data points, there exists a parameter Θ such that the model q makes no errors when evaluating that set of data points. Shattering example: q is a line in a 2D feature space. 3 points, shattered 4 points, undivisible

VC-dimension h, definition 13/22 Consider a set of dichotomic strategies q(x, Θ) Q. The set consisting of h data points (observations) can be labelled in 2h possible ways. A strategy q Q exists which assigns labels correctly to all possible configurations. (Process of finding all possible configurations with correctly assigned labels is called shattering.) VC-dimension (definition) is the maximal number h of data points (observations) that can be shattered.

VC-dimension of a linear strategy in a 2D feature space 14/22 A set of parameters Θ = {Θ0, Θ1, Θ2}. A linear strategy q(x, Θ) = Θ 1 x 1 + Θ 2 x 2 + Θ 0. Shattering example (revisited): 3 points, shattered 4 points, undivisible 3 points in 2D space (n = 2) can be shattered. There was counter example given that 4 points cannot be shattered. VC-dimension h = 3.

VC-dimension for a linear strategy in a n-dimensional space 15/22 A special case, n=2. VC-dimension = 3. Generalization to n-dimensions for linear classifiers A hyperplane in the space Rn shatters any set of h = n + 1 linearly independent points. Consequently, VC-dimension of linear decision strategies is h = n + 1.

VC-dimension in a 2D space for a circular strategy Maximally 4 data points in R 2 can be shattered in 8 possible ways VC-dimension h = 4. 16/22 1 x 1 2 x 4 5 4 x 3 6 x 2 3 8 7

Small # of parameters, VC-dimension= Counterexample by E. Levin, J.S. Denker (Vapnik 1995): A sinusoidal 1D classifier, q(x, Θ) = sign(sin(θx)), x, Θ R. For any given number L N, the points xi = 10 i, i = 1,..., L and be found and arbitrary labels y i, y i { 1, 1} can assigned to x i. ( ) Then q(x, Θ) is the correct labelling if Θ = π 1 + L. Example: L = 3, y 1 = 1, y 2 = 1, y 3 = 1. i=1 (1 y i ) 10 i 2 17/22 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Plot of the decision function sin(theta), Theta = 317.301 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 x Plot of the decision function sin(theta), Theta = 317.301 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.0085 0.009 0.0095 0.01 0.0105 0.011 0.0115 0.012 x Thus the VC dimension of this decision strategy is infinite.

Examples of other VC-dimension = strategies 18/22 Nearest-neighbor classifier any number of observations, labeled arbitrarily, will be classified. Thus VC-dimension =. Also R emp = 0. The VC-dimension provides no information in this particular case. Convex polygons classifying observation lying on a circle, VC-dimension =. SVM classifiers with Gaussian (or RBF... ) kernel, VC-dimension =.

Structural risk minimization Minimize guaranteed risk J(Θ), that is the upper bound R(Θ) J(Θ) = R emp (Θ) + For each model i in the list of hypotheses Compute its VC-dimension h i. Θ i = argmin Θ i R emp (Θ i ). Compute J i (Θ i, h i). Choose the model with the lowest J i (Θ i, h i). h ( log ( ) ) ( 2L h + 1 log η 4) L. 19/22 Preferably, optimize directly over both (Θ, h ) = argmin Θ,h J(Θ, h). Gap tolerant linear classifiers minimize Remp(Θ) while maximizing margin. Support Vector Machine does just that.

Structural risk minimization pictorially 20/22 Risks Guaranteed risk J( Θ, h) Regularizer Empirical risk VC-dimension h Space of nested hypotheses with decreasing h

VC-dimension, a practical view 21/22 Bad news: Computing the guaranteed risk is useless in many practical situations. VC dimension cannot be accurately estimated for non-linear models such as neural networks. Structural Risk Minimization may lead to a non-linear optimization problem. VC dimension may be infinite (e.g., for a nearest neighbor classifier), requiring infinite amount of training data. Good news: Structural Risk Minimization can be applied for linear classifiers. Especially useful for Support Vector Machines.

Empirical risk minimization, notes 22/22 Is then empirical risk minimization = minimization of training set error, e.g., neural networks with backpropagation, dead? No! Guaranteed risk J may be so large that this upper bound becomes useless. Find a tighter bound and you will be famous! It is not impossible! + Vapnik, Chervonenkis suggest learning with progressively more complex classes of the decision strategies Q. + Vapnik & Chervonenkis theory justifies using empirical risk minimization on classes of functions with a reasonable VC dimension. + Empirical risk minimization is computationally hard (impossible for large L). Most classes of decision functions Q for which the empirical risk minimization (at least locally) can be efficiently organized are often useful. Where does the nearest neighbor classifier fit in the picture?