sample complexity for infinite hypothesis space in machine learning

Chapter 7: Computational Learning Theory

Sample complexity for infinite hypothesis spaces.

The Vapnik-Chervonenkis dimension of H, VC(H) , allows us to typically place a fairly tight bound on the sample complexity
A set of instances S is shattered by hypothesis H if and only if for every dichotomy of S (2 |S| total), there exists some hypothesis in H consistent with this dichotomy. Take a look at Figure 7.3.
VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H
Note: if an arbitrarily large finite set of X can be shattered by H, then VC(H) ≡ ∞
Note: for finite H, VC(H) ≤ log 2 |H|
If X = {real numbers} and H = {intervals on the real line} then VC(H) = 2
If X = {points on the x,y plane} and H = {linear decision surfaces} then VC(H) = 3. See Figure 7.4.
If X = {conjunctions of exactly 3 boolean literals} and H = {conjunctions of up to 3 boolean literals} then VC(H) = 3

Sample Complexity Bounds

Upper Bound. m ≥ (1/ε)[4 log 2 (2/δ) + 8 VC(H) log 2 (13/ε)]
Lower bound. Consider any concept class C such that VC(C) ≥ 2, and learner L, and any 0 < ε < 1/8 and 0 < δ < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max[(1/ε)log(1/δ), (VC(C) - 1)/(32 ε)], then with probability at least δ, L outputs a hypothesis h having error D (h) > ε

Mistake Bound of Learning

How many mistakes will the learner make in its predictions before it learns the target concept?
The learner must predict c(x) before being told the answer
This is useful in applications such as predicting fraudulent credit card purchases
In this section, we are interested in learning the target concept exactly
Initialize h to a 1 ⋀ ¬a 1 ... ⋀ a n ⋀ ¬a n
For each positive training instance x, remove from h any literal that is not satisfied by x

The largest number of mistakes that can be made to learn a concept (the mistake bound) is n + 1

Halving Algorithm

Use a version space
The goal is to reduce the number of viable hypotheses to 1
The classification is determined using a majority vote
The mistake bound is log 2 |H|
Again, this is an upper bound - it is possible to learn without making any mistakes

Optimal Mistake Bounds

Let M A (c) denote the maximum over all possible sequences of training examples of the number of mistakes made by A to exactly learn c
Let M A (C) ≡ max M A (c)
Let C be an arbitrary non-empty concept class
The optimal mistake bound for C, Opt(C), is the minimum over all possible learning algorithms A of M A (C)
VC(C) ≤ Opt(C) ≤ M HALVING (C) ≤ log 2 |C|

Weighted Majority Algorithm

Table 7.1 shows the algorithm
Theorem: Relative mistake bound for Weighted-Majority. Let D be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence D. Then the number of mistakes over D made by the Weighted-Majority algorithm using Β = 1/2 is at most 2.4(k + log 2 n)
Note: the weighted majority idea is the basis behind many ensemble learners that use boosting (such as AdaBoost ) in order to obtain better classification accuracy

Andy Jones Publications CV Technical blog [email protected]

Sample complexity of linear regression.

Here, we’ll look at linear regression from a statistical learning theory perspective. In particular, we’ll derive the number of samples necessary in order to achieve a certain level of regression error. We’ll also see a technique called “discretization” that allows for proving things about infinite sets by relying on results in finite sets.

Problem setup

Consider a very standard linear regression setup. We’re given a set of $d$-dimensional data and 1-dimensional labels (or “response variables”), ${(x_i, y_i)}$, and we’re tasked with finding a linear map $w \in \mathbb{R}^d$ that minimizes the squared error between the predictions and the truth. In particular, we’d like to minimize $\sum\limits_{i = 1}^n (w^\top x_i - y_i)^2$.

Without loss of generality, assume for all $i$,

(The following results will hold true for any constant bound, and we can always rescale as needed.)

Learnability for finite hypothesis classes

Recall that a hypothesis class is essentially the set of functions over which we’re searching in any given statistical learning problem. In our current example, the hypothesis class is the set of all linear functions with bounded norm on its weights. Said another way, each hypothesis in our hypothesis class corresponds to a weight vector $w$, where

In the PAC learning setting, a hypothesis class is considered “learnable” if there exists an algorithm that, for any $\epsilon$ and $\delta$, can return a hypothesis with error at most $\epsilon$ with probability $1 - \delta$ if it observes enough samples.

An important result in PAC learning is that all finite hypothesis classes are (agnostically) PAC learnable with sample complexity

where $|\mathcal{H}|$ is the cardinality of the hypothesis class.

This can be shown by using a Hoeffding bound. Note that the “agnostic” part of learnability means that the algorithm will return a hypothesis that has error $\epsilon$ as compared to the best hypothesis in the class $\mathcal{H}$, rather than absolute error.

However, notice that in the linear regression setting, the hypothesis class is infinite: even though the weight vector’s norm is bounded, it can still take an infinite number of values. Can we somehow leverage the result for finite classes here? This brings us to an important point: We can discretize the set of hypothesis classes and bound how much error we incur by doing so.

After discretizing, we need to account for error arising from two places:

Error from not finding the best hypothesis originally.
Error from discretizing the set of hypotheses.

Discretizing hypothesis classes

We now set out to “approximate” our infinite hypothesis class with a finite one. We’ll do this in the most straightforward way: simply choose a resolution at which to discretize, and split up each dimension into equally spaced bins. Mathematically, if we choose any $\epsilon’$, then we can represent the discretized hypothesis class as $\mathcal{H}’ = {h_w \in \mathcal{H} : w \in \epsilon’ \mathbb{Z}}$.

Recall that we constrained $w$ such that

so each dimension of $w$ lives in the interval $[-1, 1]$. Thus, there are $\left(\frac{2}{\epsilon’}\right)^d$ hypotheses in $\mathcal{H}’$. Using the generic sample complexity for finite hypothesis classes, this means that the sample complexity to learn this class is

Quantifying error

After discretizing the set of hypotheses, we won’t be able to find the optimal hypothesis in the original continuous set. Let’s now quantify how much error we incur by doing this. If $\tilde{w}$ is the learned weight vector after discretizing, then $\tilde{w}^\top x$ are the “predictions”. Quantifying the error here, we have for any $x$ in the training set:

\begin{align} |\tilde{w}^\top x - w^\top x| &\leq \left| \sum\limits_j x^{(j)} \epsilon’ \right| && (x^{(j)} \text{ is the j’th coordinate}) \\ &\leq \epsilon’ ||x||_1 \\ &\leq d\epsilon’\end{align}

Using the Cauchy-Schwarz inequality, we have

Recall that our original goal was to minimize the squared error. The error in the discretized setting will be $(\tilde{w}^\top x - y)^2$, and in the original continous setting, it will be $(w^\top x - y)^2$. We’d like to quantify the difference between these two errors. An important note is that the function $f(x) = x^2$ is 4-Lipshitz, i.e., for any $x, y \in \mathbb{R}$, $|x^2 - y^2| \leq 4|x - y|$. Thus, we have

\begin{align} |(\tilde{w}^\top x - y)^2 - (w^\top x - y)^2| &\leq 4|(\tilde{w}^\top x - y) - (w^\top x - y)| \\ &= 4|\tilde{w}^\top x - w^\top x| \\ &\leq 4 d\epsilon’.\end{align}

If we now set $\epsilon’ = \frac{\epsilon}{4d}$ (remember, we can choose $\epsilon’$ to be any level of discretization we like), we obtain that

To sum up, we incur $\epsilon$ error when we discretize the hypothesis class, on top of the $\epsilon$ error already incurred in the original setting. This means the total error will be $2\epsilon$. Using our sample complexity computed above, and plugging in $\epsilon’ = \frac{\epsilon}{4d}$, we have:

Here, we used the discretization trick to compute the sample complexity of linear regression. Interestingly, the sample complexity increases proportionally to $d\log d$ (ignoring the $\epsilon$ terms). This was a surprising result to me, as this is relatively slow growth.

Prof. Elad Hazan’s Theoretical Machine Learning course
Understanding Machine Learning: From Theory to Algorithms , by Shai Shalev-Shwartz and Shai Ben-David

← ^ →

Computational Learning Theory

Sample Complexity for Finite Hypothesis Spaces

The growth in the number of required training examples with problem size is called the sample complexity of the learning problem.
We will consider only consistent learners , which are those that maintain a training error of 0.
We can derive a bound on the number of training examples required by any consistent learner!
Fact: Every consistent learner outputs a hypothesis belonging to the version space.
Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.
The version space $VS_{H,D}$ is said to be ε-exhausted with respect to $c$ and $\cal{D}$, if every hypothesis $h$ in $VS_{H,D}$ has error less than ε with respect to $c$ and $\cal{D}$. \[(\forall h \in VS_{H,D}) error_{\cal{D}}(h) < \epsilon \]

José M. Vidal .

On the complexity of hypothesis space and the sample complexity for machine learning

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

PPT
Hypothesis space and inductive bias
Power of a Hypothesis Space
The hypothesis space is the set of all possible hypotheses (i.e. functions from your inputs to
Sample complexity for finite hypothesis space part-2
Sample complexity for infinite hypothesis space

VIDEO

Hypothesis spaces, Inductive bias, Generalization, Bias variance trade-off in tamil -AL3451 #ML
Types of Learning :: Learning with Different Output Space @ Machine Learning Foundations (機器學習基石)
11b Machine Learning: Computational Complexity
Probabilistic ML
Types of Learning :: Learning with Different Input Space @ Machine Learning Foundations (機器學習基石)
The hypothesis space (DS4DS 3.02)

COMMENTS

PDF LECTURE 16: LEARNING THEORY
CS446 Machine Learning Sample complexity (finite H) - The version space VS H,D is said to be ε-exhausted with respect to concept c and distribution D if every h∈ VS ... Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than
PDF Sample complexity for in nite hypothesis spaces
With probability at least 1 , a hypothesis h2Hconsistent with mexamples sampled independently from distribution Dsatis es err(h) lnjHj+ln 1 m: Sample complexity for in nite hypothesis spaces We seek to generalize Occam's razor to in nite hypothesis spaces. To do so, we look at the set of behaviors H(S) of hypotheses from Hon a sample S. H(S ...
Sample complexity
Definition. Let be a space which we call the input space, and be a space which we call the output space, and let denote the product .For example, in the setting of binary classification, is typically a finite-dimensional vector space and is the set {,}. Fix a hypothesis space of functions :.A learning algorithm over is a computable map from to .In other words, it is an algorithm that takes as ...
PDF Machine Learning
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing ... hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily ... Thus the VC dimension of this machine is infinite.
PDF 10-806 Foundations of Machine Learning and Data Science
We now prove an important sample complexity result using the shatter coe cient. We focus on the realizable case (where the target function belongs to class C). It can be easily changed to handle the non-realizable case (and will cover it in a future lecture). Theorem 1 Let Cbe an arbitrary hypothesis space. Let Dbe an arbitrary, xed unknown proba-
PDF ml learning with infinite hypothesis sets
Mehryar Mohri - Foundations of Machine Learning page Rademacher Complexity Bound Theorem: Let be a family of functions mapping from to . Then, for any , with probability at least , the following holds for all : Proof: Apply McDiarmid's inequality to 6 G Z > 0 1 g G [0, 1] E[g(z)] 1 m m i=1 g(z i)+2R m(G)+ ⇥ log 1 2m. E[g(z)] 1 m m i=1
PDF Machine Learning Theory II
Sample Complexity: Infinite Hypothesis Spaces Realizable Case •Not too easy to interpret sometimes hard to calculate exactly, but can get a good bound using "VC-dimension •VC-dimension is roughly the point at which H stops looking like it contains all functions, so hope for solving for m. If Hm൞2୽, then m൤୽ ϵ ቌ….ቍ
Chapter 7: Computational Learning Theory
Chapter 7: Computational Learning Theory Sample Complexity for Infinite Hypothesis Spaces. The Vapnik-Chervonenkis dimension of H, VC(H), allows us to typically place a fairly tight bound on the sample complexity A set of instances S is shattered by hypothesis H if and only if for every dichotomy of S (2 |S| total), there exists some hypothesis in H consistent with this dichotomy.
PDF Computational Learning Theory: Shattering Supervised Learning: The
This lecture: Computational Learning Theory. The Theory of Generalization. Probably Approximately Correct (PAC) learning. Positive and negative learnability results. Agnostic Learning. Shattering and the VC dimension. The previous analysis was restricted to finite hypothesis spaces. Some infinite hypothesis spaces are more expressive than others.
PDF Concept learning
Sample complexity. - a number of examples needed to learn. the concept satisfying PAC criterion. - A prerequisite to efficient PAC learnability. •. Time complexity. - the time it takes to find the concept. - Even if the sample complexity is OK, the learning. procedure may not be efficient (e.g. exponential fringe)
PDF Computational Learning Theory
Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004 Sample Complexity for infinite hypothesis spaces • Another measure of the complexity of H called Vapnik-Chervonenkis dimension, or VC(H) • We will use VC(H) instead of |H| • Results in tighter bounds • Allows characterizing sample complexity of infinite hypothesis spaces and is ...
Sample complexity of linear regression
Sample complexity of linear regression ... We now set out to "approximate" our infinite hypothesis class with a finite one. We'll do this in the most straightforward way: simply choose a resolution at which to discretize, and split up each dimension into equally spaced bins. ... Understanding Machine Learning: ...
PDF Foundations for Machine Learning
A hypothesis class H is Agnostic PAC learnable with respect to a set Z and a loss function ℓ:𝐻× →ℝ+, if there exists a function 𝑚𝐻:(0,1)×0,1→ℕand a learning algorithm with the following property: For every 𝜖,𝛿∈0,1, for every distribution D over Z, when running the learning algorithm on 𝑚 R𝑚𝐻(𝜖,𝛿)i.i.d.
PDF Computational Learning Theory
The size or complexity of the hypothesis space considered by the learner. The accuracy to which the target concept must be approximated. The probability that the learner will output a successful hypothesis. The manner in which training examples are presented to the learner. Introduction (cont.)
PDF Foundations of Machine Learning and Data Science
Foundations of Machine Learning and Data Science Two Core Aspects of Machine Learning Algorithm Design. How to optimize? ... Sample Complexity: Infinite Hypothesis Spaces Sauer's Lemma: HmOm Z G b g k L H[m] - max number of ways to split m points using concepts in H ... The VC-dimension of a hypothesis space H is the cardinality of
PDF Machine Learning, Chapter 7 CSE 574, Spring 2004
Two frameworks for analyzing learning algorithms. 1. Probably Approximately Correct (PAC) framework. Identify classes of hypotheses that can/cannot be learned from a polynomial number of training samples. Finite hypothesis space. Infinite hypotheses (VC dimension) Define natural measure of complexity for hypothesis spaces (VC dimension) that ...
PDF ml learning with finite hypothesis sets
Learning Bound for Finite H - Consistent Case. Theorem: let H be a finite set of functions from X to 1} and L an algorithm that for any target concept c H and sample S returns a consistent hypothesis hS : . Then, for any 0 , with probability at least. (hS ) = 0 >.
Sample Complexity for Finite Hypothesis Spaces
Fact: Every consistent learner outputs a hypothesis belonging to the version space. Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.
PDF Learnability, Sample Complexity, and Hypothesis Class Complexity for
The approach employs this framework to define the learnability of a hypothesis class for a given training set in learning theory. We study and analyze the learnability with the necessary sample complexity for a regression model within a class of hypothe-ses with respect to the ǫ-CoAC framework.
On the complexity of hypothesis space and the sample complexity for
The problem of learning a concept from examples in the model introduced by Valiant (1984) is discussed. According to the traditional ways of thinking, it is assumed that the learnability is independent of the occurrence probability of instance. By utilizing this probability, we propose the metric as a new measure to determine the complexity of hypothesis space. The metric measures the hardness ...
A Gentle Introduction to Computational Learning Theory
The VC dimension quantifies the complexity of a hypothesis space, e.g. the models that could be fit given a representation and learning algorithm. One way to consider the complexity of a hypothesis space (space of models that could be fit) is based on the number of distinct hypotheses it contains and perhaps how the space might be navigated.
PDF Sponsored Search Acution Design Via Machine Learning
Sample Complexity: Infinite Hypothesis Spaces Realizable Case •Not too easy to interpret sometimes hard to calculate exactly, but can get a good bound using "VC-dimension •VC-dimension is roughly the point at which H stops looking like it contains all functions, so hope for solving for m. If Hm=2m, then m≥m ϵ (….)
Quantifying the Complexity and Learnability of Strategic Classification
Image generated by the author using DALL-E 3. In the first article in this series, we formally defined the strategic classification problem, denoted Sᴛʀᴀᴄ H, R, c , as a generalization of canonical binary classification. We did so based on the paper PAC-Learning for Strategic Classification (Sundaram, Vullikanti, Xu, & Yao, 2021). Along the way, we explored why we should care about ...