Kernel de tangente neural

No estudo de redes neurais artificiais (RNAs), o kernel de tangente neural (KTN) é um kernel que descreve a evolução de redes neurais artificiais profundas durante seu treinamento por gradiente descendente . Ele permite que RNAs sejam estudadas usando algoritmos do tipo Máquina de vetores de suporte.

Para a maioria das arquiteturas de rede neural, no limite da largura da camada, o KTN se torna constante. Isso permite que declarações simples de forma fechada sejam feitas sobre previsões de rede neural, dinâmicas de treinamento, generalização e superfícies de perda. Por exemplo, ele garante que RNAs largas o suficiente convergem para um mínimo global quando treinados para minimizar uma perda empírica. O KTN de redes de grande largura também está relacionado a vários outros limites de largura de redes neurais.

O KTN foi lançado em 2018 por Arthur Jacot, Franck Gabriel e Clément Hongler.^[1] Também estava implícito em alguns trabalhos contemporâneos.^[2]^[3]^[4]

Definição

Caso de saída escalar

Uma RNA com saída escalar consiste em uma família de funções $f\left(\cdot ,\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ parametrizado por um vetor de parâmetros $\theta \in \mathbb {R} ^{P}$ .

O KTN é um kernel $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ definido por $\Theta \left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f\left(x;\theta \right)\partial _{\theta _{p}}f\left(y;\theta \right).$

Em uma SVM, o KTN $\Theta$ é um kernel associado a uma feature $\left(x\mapsto \partial _{\theta _{p}}f\left(x;\theta \right)\right)_{p=1,\ldots ,P}$ .

Caso de saída vetorial

Uma RNA com saída vetorial de tamanho $n_{\mathrm {out} }$ consiste em uma família de funções $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} ^{n_{\mathrm {out} }}$ parametrizada por um vetor de parâmetros $\theta \in \mathbb {R} ^{P}$ .

Neste caso o KTN $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to {\mathcal {M}}_{n_{\mathrm {out} }}\left(\mathbb {R} \right)$ é um SVM de saída vetorial com valores de $n_{\mathrm {out} }\times n_{\mathrm {out} }$ e matrizes definidas por $\Theta _{k,l}\left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f_{k}\left(x;\theta \right)\partial _{\theta _{p}}f_{l}\left(y;\theta \right).$

Derivação

Ao otimizar os parâmetros $\theta \in \mathbb {R} ^{P}$ de uma RNA para minimizar uma perda empírica através da método do gradiente, o KTN determina a dinâmica da função de saída da RNA $f_{\theta }$ durante todo o treinamento.

Caso de saída escalar

Para um dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ com rótulos escalares $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R}$ e uma função de perda $c:\mathbb {R} \times \mathbb {R} \to \mathbb {R}$ associada a uma perda empírica, definida em funções $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ é dada por ${\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).$

Ao treinar uma RNA $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ é treinado para se ajustar ao conjunto de dados (ou seja, minimizar ${\mathcal {C}}$ ) via método do gradiente por tempo contínuo os parâmetros $\left(\theta \left(t\right)\right)_{t\geq 0}$ evoluem através da função diferencial ordinária:

$\partial _{t}\theta \left(t\right)=-\nabla {\mathcal {C}}\left(f\left(\cdot ;\theta \right)\right).$

Durante o treinamento, a função de saída da RNA segue a evolução de uma equação diferencial dada em termos de KTN:

$\partial _{t}f\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \right)\partial _{w}c\left(w,z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.$

Esta equação mostra como o KTN conduz a dinâmica de $f\left(\cdot ;\theta \left(t\right)\right)$ no espaço das funções $\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ durante o treinamento.

Caso de saída vetorial

Para um dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ com vetores $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {out} }}$ e uma função de perda $c:\mathbb {R} ^{n_{\mathrm {out} }}\times \mathbb {R} ^{n_{\mathrm {out} }}\to \mathbb {R}$ a perda empírica correspondente em funções $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} ^{n_{\mathrm {out} }}$ é definida por:

${\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).$

O treinamento de $f_{\theta \left(t\right)}$ através do método do gradiente por tempo contínuo produz a seguinte evolução na função do espaço gerada pelo KTN:

$\partial _{t}f_{k}\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\sum _{l=1}^{n_{\mathrm {out} }}\Theta _{k,l}\left(x,x_{i};\theta \right)\partial _{w_{l}}c\left(\left(w_{1},\ldots ,w_{n_{\mathrm {out} }}\right),z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.$

Interpretação

O KTN $\Theta \left(x,x_{i};\theta \right)$ representa a influência da perda de gradiente $\partial _{w}c\left(w,z_{i}\right){\big |}_{w=f\left(x_{i};\theta \right)}$ com respeito ao exemplo $i$ sobre a evolução da saída (produção) da RNA $f\left(x;\theta \right)$ através de uma etapa do método do gradiente: no caso escalar, se lê:

$f\left(x;\theta \left(t+\epsilon \right)\right)-f\left(x;\theta \left(t\right)\right)\approx \epsilon \sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \left(t\right)\right)\partial _{w}c\left(w,z_{i}\right){\big |}_{w=f\left(x_{i};\theta \right)}.$

Em particular, cada ponto de dados $x_{i}$ influencia a evolução do resultado $f\left(x;\theta \right)$ para cada $x$ ao longo do treinamento, de modo que é capturada pelo KTN $\Theta \left(x,x_{i};\theta \right)$ .

Grande limite de largura

Trabalhos teóricos e empíricos recentes em aprendizagem profunda mostraram que o desempenho das RNAs melhora estritamente à medida que a largura de suas camadas aumenta.^[5]^[6] Para várias arquiteturas de RNA o KTN fornece uma visão precisa sobre o treinamento neste regime de grandes larguras.^[1]^[7]^[8]^[9]^[10]^[11]

Referências

↑ ^a ^b Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018), Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K., eds., «Neural Tangent Kernel: Convergence and Generalization in Neural Networks» (PDF), Curran Associates, Inc., Advances in Neural Information Processing Systems 31: 8571–8580, Bibcode:2018arXiv180607572J, arXiv:1806.07572 , consultado em 27 de novembro de 2019
↑ Li, Yuanzhi; Liang, Yingyu (2018). «Learning overparameterized neural networks via stochastic gradient descent on structured data». Advances in Neural Information Processing Systems
↑ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2018). «A convergence theory for deep learning via overparameterization». International Conference on Machine Learning
↑ Du, Simon S; Zhai, Xiyu; Poczos, Barnabas; Aarti, Singh (2019). «Gradient descent provably optimizes over-parameterized neural networks». International Conference on Learning Representations
↑ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (15 de fevereiro de 2018). «Sensitivity and Generalization in Neural Networks: an Empirical Study». Bibcode:2018arXiv180208760N. arXiv:1802.08760
↑ Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (4 de novembro de 2016). «An Analysis of Deep Neural Network Models for Practical Applications». Bibcode:2016arXiv160507678C. arXiv:1605.07678
↑ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (9 de novembro de 2018). «A Convergence Theory for Deep Learning via Over-Parameterization». International Conference on Machine Learning (em inglês): 242–252. arXiv:1811.03962
↑ Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu (24 de maio de 2019). «Gradient Descent Finds Global Minima of Deep Neural Networks». International Conference on Machine Learning (em inglês): 1675–1685. arXiv:1811.03804
↑ Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (15 de fevereiro de 2018). «Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent». arXiv:1902.06720
↑ Arora, Sanjeev; Du, Simon S; Hu, Wei; Li, Zhiyuan; Salakhutdinov, Russ R; Wang, Ruosong (2019), «On Exact Computation with an Infinitely Wide Neural Net», NeurIPS: 8139–8148, arXiv:1904.11955
↑ Huang, Jiaoyang; Yau, Horng-Tzer (17 de setembro de 2019). «Dynamics of Deep Neural Networks and Neural Tangent Hierarchy». arXiv:1909.08156

Este artigo sobre computação é um esboço. Você pode ajudar a Wikipédia expandindo-o.

[:0-1] Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018), Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K., eds., «Neural Tangent Kernel: Convergence and Generalization in Neural Networks» (PDF), Curran Associates, Inc., Advances in Neural Information Processing Systems 31: 8571–8580, Bibcode:2018arXiv180607572J, arXiv:1806.07572 , consultado em 27 de novembro de 2019

[2] Li, Yuanzhi; Liang, Yingyu (2018). «Learning overparameterized neural networks via stochastic gradient descent on structured data». Advances in Neural Information Processing Systems

[3] Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2018). «A convergence theory for deep learning via overparameterization». International Conference on Machine Learning

[4] Du, Simon S; Zhai, Xiyu; Poczos, Barnabas; Aarti, Singh (2019). «Gradient descent provably optimizes over-parameterized neural networks». International Conference on Learning Representations

[5] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (15 de fevereiro de 2018). «Sensitivity and Generalization in Neural Networks: an Empirical Study». Bibcode:2018arXiv180208760N. arXiv:1802.08760

[6] Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (4 de novembro de 2016). «An Analysis of Deep Neural Network Models for Practical Applications». Bibcode:2016arXiv160507678C. arXiv:1605.07678

[:2-7] Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (9 de novembro de 2018). «A Convergence Theory for Deep Learning via Over-Parameterization». International Conference on Machine Learning (em inglês): 242–252. arXiv:1811.03962

[:5-8] Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu (24 de maio de 2019). «Gradient Descent Finds Global Minima of Deep Neural Networks». International Conference on Machine Learning (em inglês): 1675–1685. arXiv:1811.03804

[Lee-9] Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (15 de fevereiro de 2018). «Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent». arXiv:1902.06720

[:1-10] Arora, Sanjeev; Du, Simon S; Hu, Wei; Li, Zhiyuan; Salakhutdinov, Russ R; Wang, Ruosong (2019), «On Exact Computation with an Infinitely Wide Neural Net», NeurIPS: 8139–8148, arXiv:1904.11955

[11] Huang, Jiaoyang; Yau, Horng-Tzer (17 de setembro de 2019). «Dynamics of Deep Neural Networks and Neural Tangent Hierarchy». arXiv:1909.08156

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]