COUSERA Machine Learning Week05

Ats

27 min readNov 30, 2020

ソフトウェアエンジニアが Cousera の機械学習コーシ（5週目）に参加して学んだことをメモ代わりに共有します。そろそろ数学的にわからないところが増えてきました。

4週目の記事もあるので良かったら読んでみてください。

COUSERA Machine Learning Week04

ソフトウェアエンジニアが Cousera…

atsss.medium.com

Cost Function and Backpropagation

この章では、ニューラルネットワークのコスト関数をどのように計算するかを説明します。

Cost Function

このコースのニューラルネットワークでは、2つの種類の分類問題を扱っていきます。一つ目はバイナリ分類です。それは y の取りうる値は 0 か 1 のどちらかだけの場合の分類問題です。バイナリ分類なら、最終的な出力ユニットは一つだけです。 2つ目のタイプの分類問題はマルチクラスの分類問題です。分類するクラス数を K 個と表現します。Kは通常は、3以上なのが、マルチクラス分類問題です。何故なら、もし 2 クラスしか無い時は one vs all法を使う必要が無く、バイナリ分類で十分だからです。また、ニューラルネットワークの問題では、L はレイヤー数、sl は l 番目のユニット数を表します。

次に、ニューラルネットワークのコスト関数を説明します。ニューラルネットワークで使うコスト関数は、ロジスティック回帰で使った物を一般化し、下記の数式で表現できます。

正則化の項では、ロジスティック回帰の時と同様に i=0 のバイアス項を除いて計算します。

最後にこの節のサマリーを載せておきます。

Let’s first define a few variables that we will need to use:
- L = total number of layers in the network
- sl = number of units (not counting bias unit) in layer l
- K = number of output units/classes
Recall that in neural networks, we may have many output nodes. We denote hΘ(x)k as being a hypothesis that results in the kth output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:

For neural networks, it is going to be slightly more complicated:

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Backpropagation Algorithm

この節では、コスト関数の最小化を試みるアルゴリズムの説明をします。ニューラルネットワークのコスト関数の最小化には、バックプロパゲーション（誤差逆伝搬法）というアルゴリズムが使われています。正直、すごい難しいです。僕は、講義だけでは分からなかったので、ひたすら下記の記事を参考にしました。僕の理解では、ニューラルネットワークでも誤差を最小化するために最急降下法を使います。その最急降下法を行う時に起こる偏微分は、層が何層になっても、各重みに対するその微分はその次の層（より出力側の層）の微分を使って求めることができます。これを、これを微分の連鎖律（チェーンルール）などと呼びます。なので、出力側から重みであるΘを求めると、自ずと入力側の重み（Θ）も求まります。僕が数学的に証明できないので、気になる人は下記の記事を読むと良いと思います。

誤差逆伝播法をはじめからていねいに - Qiita

この記事は、私が機械学習、特に「誤差逆伝播法」 (あるいは「バックプロパゲーション (Backpropagation)」) を勉強するうえで辿った道筋を記録しておくものです。誤差逆伝播法に関する分かりやすい解説は、ここ Qiita…

qiita.com

バックプロパゲーション（誤差逆伝搬法）をイチから理解する

Deep…

imagingsolution.net

中学生でもなんとなくわかるDeepLearningのアルゴリズム

人工知能（AI）や DeepLearning（ディープラーニング），この頃よく聞きますよね．しかし，いまいち何なのかよくわからないという人は多いのではないでしょうか．…

mieruca-ai.com

Coursera Machine Learning -Week5-のまとめ（ニューラルネットワークの学習） - Qiita

この記事についてこの記事は、CourseraのMachine Learningのビデオ講義＋宿題についてまとめたものです。今週はWeek5（ニューラルネットワークの学習）についてです。 Week4（ニューラルネットワークの表...

qiita.com

理解が乏しいですが、バックプロパゲーションを用いたニューラルネットワークの最急降下法の手順をまとめます。（数式をうまく説明できてなくてすみません。。。）

1. まず、フィードフォワードプロセスを実行します。具体的には、入力層から出力層に向かって各レイヤにおけるz,aを求めます。具体的には、以下のように計算します。

2. ここから、バックプロパゲーションが始まります。まず、出力層における、誤差を計算します。

3. 次に、レイヤ２（隠れ層）における誤差を次式で計算します。

ここで、例えばg(z)がシグモイド関数なら、その微分g’(z)は次式のように表せます。

そして、1〜3を繰り返し、各トレーニングデータでレイヤ毎に計算した誤差を偏微分すると、下記の数式で表現できます。（ここが分かってないところです。。。）

よってコスト関数の偏微分が手に入ったので、最急降下法を使うことができます。オーバーフィッティングを防ぐために、j≧1の場合は正規化項を足しています。

最後にこの節のサマリーを載せておきます。

“Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we’ll look at the equations we use to compute the partial derivative of J(Θ):

To do so, we use the following algorithm:

Back propagation Algorithm
Given training set (x^{(1)}, y^{(1)}) … (x^{(m)}, y^{(m)})

For training example t =1 to m:

Where L is our total number of layers and a^{(L)}a(L) is the vector of outputs of the activation units for the last layer. So our “error values” for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:

The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g’, or g-prime, which is the derivative of the activation function g evaluated with the input values given by z^{(l)}.
The g-prime derivative terms can also be written out as:

Hence we update our new Δ matrix.

The capital-delta matrix D is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. Thus we get \frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)∂Θij(l)∂J(Θ)= D_{ij}^{(l)}Dij(l)

Backpropagation Intuition

この節では、バックプロパゲーションを数学的ではなく直感的に理解できるように説明します。おそらく多くの人が、バックプロパゲーションのアルゴリズムは難しいと感じていると思います。ただ、講義内で Andrew 氏も「実の所、私も長年、ひょっとしたら今日ですら、バックプロパゲーションが何やってるのかいまいち直感的に理解出来てないなぁ、と思う事はあるけれど、使う分には問題なくとてもしっかりと使えてきました。」と仰っています。機械学習の実装や運用に置いては、アルゴリズムの特性だけ抑えて、数学的な証明にはそんなに気を使わなくて良いのかもなと感じました。

上記の図が直感的な理解を示す図です。今回求めるコスト関数を δ(4)1 とすると、それは δ(4)1 = y-a(4)1 で表されます。ニューラルネットワークのフォワードプロパゲーションを使うと、x1 から sigmoid 関数の z を通じて、a1 へと情報が伝播していきます。この特性を利用して、δ(3)1 を δ(4)1 と Θ(2) を使って計算できます。それで計算できた δ を各レイヤーで最小になるように計算するのがバックプロパゲーションの数式の意味です。計算方法は具体的には説明しないですが、フォワードプロパゲーションで計算してることを、同じ係数（Θ）を使って逆向きにし、各レイヤーの δ を求めているというのが直感的な理解だと思います。

最後にこの節のサマリーを載せておきます。

Recall that the cost function for a neural network is:

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

Intuitively, δj(l) is the “error” for aj(l) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some δj(l):

Backpropagation in Practice

この章では具体的に、どうやってニューラルネットワークのバックプロパゲーションを実装していくかを説明します。

Unrolling Parameters

この節はニューラルネットワークのバックプロパゲーションにおいて、どのように fminunc() 関数を使うのかを説明します。３週目以前の Octave の fminunc() 関数では、下記のようにΘ とコスト関数と gradient（コスト関数の偏微分）が求められていました。そして、それらはベクトルでした。

fminunc(@costFunction, initialTheta, options)
function [jval, gradient] = costFunction ...

ニューラルネットワークでは、Θ と gardient がベクトルではなく、行列になるので、それらを fminunc 関数が受け取れる形に変換する必要があります。その方法が、下記の Octave の行列展開と reshape 関数を使った方法です。

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

このようにして、行列を一旦ベクトルに変換して、それをまた行列に戻すという方法で、fminunc に引数として渡します。

最後にこの節のサマリーを載せておきます。

With neural networks, we are working with sets of matrices:
Θ(1),Θ(2),Θ(3),…D(1),D(2),D(3),…
In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the “unrolled” versions as follows:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

To summarize:

Gradient Checking

この節ではグラディアントチェッキングとういう方法を説明します。バックプロパゲーションを使って、D（微分）を求めるの間違えやすい方法らしいです。この D が間違っていると、次に続く最急降下法に影響を及ぼします。なので、この D が間違っているかどうかを確かめる方法を、グラディアントチェッキングという方法です。具体的には、おおよそ 10 のマイナス４乗程度の ε を用いて、下記の数式を計算します。

この右辺が、おおよそ求めたい D（微分）の結果と一緒になります。これを使って、各 D が間違った値でないのを、小数点第3位ぐらいまで確かめます。最急降下法を使う場合は、このグラディアントチェッキングを毎回するのがおすすめらしいです。もしうまく行かなかった時に、問題の切り分けができるので、確かめ算として持っておくのは便利だなと思います。

ただ、このグラディアントチェッキングを使って、D を代替してしまうのは良くない方法です。この方法では、計算量的にとても高価で、微分を近似しようとするのには凄い遅いやり方です。それとは対照的に、バックプロパゲーションは微分を計算するのに、とても効率的な方法です。なので、グラディアントチェッキングを使っていると、モデルの計算がとても遅くなります。ニューラルネットワークにおいて、最急降下法を繰り返し実行する前には、グラディアントチェッキングのコードを消すことを忘れないようにした方が良いです。

最後にこの節のサマリーを載せておきます。

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:

A small value for ϵ such as ϵ=10^−4, guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.
Hence, we are only adding or subtracting epsilon to the Θj matrix. In octave we can do it as follows:

epsilon = 1e-4;for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.

Random Initialization

この節では、Θ のランダム化について説明します。線形回帰やロジスティック回帰では、θ をゼロで初期化して初めても全く問題ありませんでした。しかし、ニューラルネットワークでは Θ をゼロで初期化することが出来ません。何故なら、Θ をゼロで初期化すると、hiden layer（隠れ層）とinput layer が変わらずに output layer まで伝播するからです。つまり、線形回帰やロジスティック回帰では、もともと一層しかないので、ゼロで初期化しても問題ないです。一方、ニューラルネットワークでは、何層にも渡って計算が行われるが、その隠れ層が全く結果に影響を及ぼしていないことになります。なので、ニューラルネットワークでは Θ をゼロで初期化することが出来ません。もう少し具体的に説明すると、ゼロが問題であるのでは対照性があるのが問題です。なので、Θ を 1 で初期化しても、同じ問題に行き着きます。そこで、Θ をランダム化する必要が出てきます。そのために、Octave では下記のコードを使って初期化します。

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

ちなみに、このコードの INIT_EPSILON は、前節の ε とは全く関係がありません。初期化するために使う適当な小さい実数です。

最後にこの節のサマリーを載せておきます

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:

Hence, we initialize each Θij(l) to a random value between [−ϵ,ϵ]. Using the above formula guarantees that we get the desired bound. The same procedure applies to all the Θ’s. Below is some working code you could use to experiment.

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Putting it Together

この節では、今までニューラルネットワークについて学んだことを一つにまとめようと思います。まずは、ニューラルネットワークのレイヤーの選び方を説明します。

ニューラルネットワークを、「何層にして、各層に何個のユニットを置くか」をアーキテクチャと言います。このアーキテクチャは、基本的に層を増やして、ユニット数を増やすと精度が良くなります。しかし、デフォルトのアーキテクチャもあるのでそれを説明します。まず、入力層については、入力ユニットの総数はかっちりと定義されています。何故なら、フィーチャーの集まり x を決定してしまえば、入力ユニットの数は、フィーチャーx(i)の次元によって決定されるからです。同様にして、出力層もあなたの扱ってる分類問題のクラスの数によって決められます。そして、隠れユニットと隠れレイヤーの数の選択については、もっとも普通の基本形は隠れレイヤは一層にします。また、ユニット数はだいたい x の次元と同じ程度、あるいは x の次元の2〜4倍程度の適当な数が良いです。問題がない限りは、x の次元数と同じにしておくと良いと思います。そして、隠れレイヤー数を2層以上にする場合、普通の基本形は各レイヤの隠れユニットは同じ数にします。

次に、バックプロパゲーションを用いた最急降下法の計算手順を下記にまとめます。

Θ をランダム化して初期化します
フォワードプロパゲーションを用いて最終的な y まで計算します
コスト関数を実装します
バックプロパゲーションを用いて各レイヤーの偏微分を算出します
グラディアントチェッキングを使って、バックプロパゲーションの偏微分の結果が誤りでないことを確認します
fminunc へ各引数を渡します

上記の手順でニューラルネットワークのアルゴリズムを実装します。

最後にこの節のサマリーを載せておきます

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
- Number of input units = dimension of features x(i)
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
Training a Neural Network
1. Randomly initialize the weights
2. Implement forward propagation to get hΘ(x(i)) for any x(i)
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
  Perform forward propagation and backpropagation using example (x(i),y(i))
  (Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our neural network:

Ideally, you want hΘ(x(i)) ≈ y(i). This will minimize our cost function. However, keep in mind that J(Θ) is not convex and thus we can end up in a local minimum instead.

最後に今週の分の宿題のコードを載せておきます。ベクトルと行列の計算が混同してとても時間がかかりました。

atsss/cousera_ml

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

6週目の記事もあるので良かったら読んでみてください。

COUSERA Machine Learning Week06

ソフトウェアエンジニアが Cousera…

atsss.medium.com

COUSERA Machine Learning Week05

COUSERA Machine Learning Week04

ソフトウェアエンジニアが Cousera…

Cost Function and Backpropagation

Cost Function

Backpropagation Algorithm

誤差逆伝播法をはじめからていねいに - Qiita

この記事は、私が機械学習、特に「誤差逆伝播法」 (あるいは「バックプロパゲーション (Backpropagation)」) を勉強するうえで辿った道筋を記録しておくものです。誤差逆伝播法に関する分かりやすい解説は、ここ Qiita…

バックプロパゲーション（誤差逆伝搬法）をイチから理解する

Deep…

中学生でもなんとなくわかるDeepLearningのアルゴリズム

人工知能（AI）や DeepLearning（ディープラーニング），この頃よく聞きますよね．しかし，いまいち何なのかよくわからないという人は多いのではないでしょうか．…

Coursera Machine Learning -Week5-のまとめ（ニューラルネットワークの学習） - Qiita

この記事についてこの記事は、CourseraのMachine Learningのビデオ講義＋宿題についてまとめたものです。今週はWeek5（ニューラルネットワークの学習）についてです。 Week4（ニューラルネットワークの表...

Backpropagation Intuition

Backpropagation in Practice

Unrolling Parameters

Gradient Checking

Random Initialization

Putting it Together

atsss/cousera_ml

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

COUSERA Machine Learning Week06

ソフトウェアエンジニアが Cousera…

Written by Ats

No responses yet

COUSERA Machine Learning Week05

COUSERA Machine Learning Week04

ソフトウェアエンジニアが Cousera…

Cost Function and Backpropagation

Cost Function

Backpropagation Algorithm

誤差逆伝播法をはじめからていねいに - Qiita

この記事は、私が機械学習、特に「 誤差逆伝播法」 (あるいは「バックプロパゲーション (Backpropagation)」) を勉強するうえで辿った道筋を記録しておくものです。 誤差逆伝播法に関する分かりやすい解説は、ここ Qiita…

バックプロパゲーション（誤差逆伝搬法）をイチから理解する

Deep…

中学生でもなんとなくわかるDeepLearningのアルゴリズム

人工知能（AI）や DeepLearning（ディープラーニング），この頃よく聞きますよね． しかし，いまいち何なのかよくわからないという人は多いのではないでしょうか．…

Coursera Machine Learning -Week5-のまとめ（ニューラルネットワークの学習） - Qiita

この記事について この記事は、CourseraのMachine Learningのビデオ講義＋宿題についてまとめたものです。 今週はWeek5（ニューラルネットワークの学習）についてです。 Week4（ニューラルネットワークの表...

Backpropagation Intuition

Backpropagation in Practice

Unrolling Parameters

Gradient Checking

Random Initialization

Putting it Together

atsss/cousera_ml

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

COUSERA Machine Learning Week06

ソフトウェアエンジニアが Cousera…

Written by Ats

No responses yet

この記事は、私が機械学習、特に「誤差逆伝播法」 (あるいは「バックプロパゲーション (Backpropagation)」) を勉強するうえで辿った道筋を記録しておくものです。誤差逆伝播法に関する分かりやすい解説は、ここ Qiita…

人工知能（AI）や DeepLearning（ディープラーニング），この頃よく聞きますよね．しかし，いまいち何なのかよくわからないという人は多いのではないでしょうか．…

この記事についてこの記事は、CourseraのMachine Learningのビデオ講義＋宿題についてまとめたものです。今週はWeek5（ニューラルネットワークの学習）についてです。 Week4（ニューラルネットワークの表...