'Deep Learning/Natural Language Processing' 카테고리의 글 목록

신경망 기반의 자연어 처리를 공부하였습니다. ( 최근 동향 )

2001 | Neural Language Models
2008 | Multi-task Learning
2013 | Word Embeddings
2013 | Neural networks for NLP
2014 | Sequence-to-sequence Models
2015 | Attention
2015 | Memory-based Networks
2018 | Pretrained Language Models

2013 | Neural networks for NLP

텍스트의 Sparse vector 표현, 소위 bag-of-words model은 NLP에서 오랜 역사를 가지고 있습니다. word embeddings의 조밀한 벡터 표현은 2001년에 사용되었습니다. Mikolov et al. 에 의해 2013년에 제안된 hidden layer를 제거함으로써 목표를 근사화하고 이러한 word embeddings를 더 효율적으로 학습하도록 만들었습니다.

이런 변경은 간단했지만 효율적인 word2vec 구현과 함께 word embeddings에 대한 large-scale 학습이 가능하도록 하였습니다.

Word2Vec은 CBOW(Continuous Bag-of Word)와 Skip-gram의 두 가지로 나누어집니다.

이 두 가지는 목적이 다름으로써 차이점이 발생합니다.

1. CBOW

- 주변 단어를 기반으로 중심 단어를 예측

2. Skip-gram

- 중심단어를 기반으로 주변 단어를 예측

Figure 3: Continuous bag-of-words and skip-gram architectures (Mikolov et al., 2013a; 2013b)

이러한 embeddings들은 a feed-forward neural network와 개념적으로 다르지 않지만 매우 큰 말뭉치에 대한 훈련을 통해 성별, 동사 시제, 국가-자본 관계 등을 포착할 수 있습니다.

Figure 4: Relations captured by word2vec (Mikolov et al., 2013a; 2013b)

이러한 관계와 그 이면의 의미는 word embedding에 대한 관심을 촉발시켰고, 많은 연구에서 이러한 선형 관계의 기원을 조사하였습니다. (Arora et al., 2016; Mimno & Thompson, 2017; Antoniak & Mimno, 2018; Wendlandt et al., 2018)

현재의 NLP에서 word embeddings을 강화한 것은 사전 훈련된 embeddings을 초기화로 사용했을 때 광범위한 downstream tasks에서 성능이 향상된다는 점입니다.

word2vec은 직관적이고 마법적인 특정을 가지고 있지만 이후 연구에 따르면 본질적으로 특별한 것은 없다는 사실이 밝혀졌습니다. word embeddings은 matrix factorization을 통해 학습을 할 수도 있습니다. (Pennington et al, 2014; Levy & Goldberg, 2014)

그리고 적절한 튜닝을 통해 SVD 및 LSA와 같은 고전적인 matrix factorization 접근법과 유사한 결과를 얻을 수도 있습니다.(Levy et al., 2015)

그 이후로 word embeddings의 다양한 측면을 탐색하기 위해 많은 연구가 진행되었습니다.

자세히 :ruder.io/word-embeddings-2017/

Word embeddings in 2017: Trends and future directions

Word embeddings are an integral part of current NLP models, but approaches that supersede the original word2vec have not been proposed. This post focuses on the deficiencies of word embeddings and how recent approaches have tried to resolve them.

ruder.io

nlp의 많은 발전에도 불구하고 word2vec은 여전히 인기 있는 선택이며 많이 사용되고 있습니다. 또한, Word2vec의 범위는 단어 수준을 넘어서까지 확장되었습니다. negative sampling을 적용한 skip-gram, local context를 기반으로 한 embeddings 학습, 문장 표현을 학습(Mikolov & Le, 2014; Kiros et al., 2015), NLP를 넘어선 networks(Grover & Leskovec, 2016), biological sequences(Asgari & Mofrad, 2015).

특히 흥미로운 방향 중 하나는 다른 언어의 word embeddings을 동일한 공간에 투영하여(zero-shot) 언어 간 전송을 가능하게 하는 것입니다. 완전한 비지도 방법으로 (유사한 언어의 한해서) 좋은 투영법을 배우는 것이 점점 가능해지고 있으며, 이는 low-resource 언어 및 비지도 기계 번역에 대한 애플리케이션을 엽니다.(Lample et al., 2018; Artetxe et al., 2018)

해당 내용은 사실과 다를 수 있습니다.

정정이 필요한 부분은 댓글로 작성 부탁드립니다. ( 혹은 reference추천도 감사합니다. )

감사합니다.

'Deep Learning > Natural Language Processing' 카테고리의 다른 글

History of Natural Language Processing(NLP) - Chapter.03 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.02 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.01 (0)	2021.03.25

신경망 기반의 자연어 처리를 공부하였습니다. ( 최근 동향 )

2001 | Neural Language Models
2008 | Multi-task Learning
2013 | Word Embeddings
2013 | Neural networks for NLP
2014 | Sequence-to-sequence Models
2015 | Attention
2015 | Memory-based Networks
2018 | Pretrained Language Models

2013 | Word Embeddings

텍스트의 Sparse vector 표현, 소위 bag-of-words model은 NLP에서 오랜 역사를 가지고 있습니다. word embeddings의 조밀한 벡터 표현은 2001년에 사용되었습니다. Mikolov et al. 에 의해 2013년에 제안된 hidden layer를 제거함으로써 목표를 근사화하고 이러한 word embeddings를 더 효율적으로 학습하도록 만들었습니다.

이런 변경은 간단했지만 효율적인 word2vec 구현과 함께 word embeddings에 대한 large-scale 학습이 가능하도록 하였습니다.

Word2Vec은 CBOW(Continuous Bag-of Word)와 Skip-gram의 두 가지로 나누어집니다.

이 두 가지는 목적이 다름으로써 차이점이 발생합니다.

1. CBOW

- 주변 단어를 기반으로 중심 단어를 예측

2. Skip-gram

- 중심단어를 기반으로 주변 단어를 예측

이러한 embeddings들은 a feed-forward neural network와 개념적으로 다르지 않지만 매우 큰 말뭉치에 대한 훈련을 통해 성별, 동사 시제, 국가-자본 관계 등을 포착할 수 있습니다.

이러한 관계와 그 이면의 의미는 word embedding에 대한 관심을 촉발시켰고, 많은 연구에서 이러한 선형 관계의 기원을 조사하였습니다. (Arora et al., 2016; Mimno & Thompson, 2017; Antoniak & Mimno, 2018; Wendlandt et al., 2018)

현재의 NLP에서 word embeddings을 강화한 것은 사전 훈련된 embeddings을 초기화로 사용했을 때 광범위한 downstream tasks에서 성능이 향상된다는 점입니다.

word2vec은 직관적이고 마법적인 특정을 가지고 있지만 이후 연구에 따르면 본질적으로 특별한 것은 없다는 사실이 밝혀졌습니다. word embeddings은 matrix factorization을 통해 학습을 할 수도 있습니다. (Pennington et al, 2014; Levy & Goldberg, 2014)

그리고 적절한 튜닝을 통해 SVD 및 LSA와 같은 고전적인 matrix factorization 접근법과 유사한 결과를 얻을 수도 있습니다.(Levy et al., 2015)

그 이후로 word embeddings의 다양한 측면을 탐색하기 위해 많은 연구가 진행되었습니다.

자세히 :ruder.io/word-embeddings-2017/

Word embeddings in 2017: Trends and future directions

Word embeddings are an integral part of current NLP models, but approaches that supersede the original word2vec have not been proposed. This post focuses on the deficiencies of word embeddings and how recent approaches have tried to resolve them.

ruder.io

nlp의 많은 발전에도 불구하고 word2vec은 여전히 인기 있는 선택이며 많이 사용되고 있습니다. 또한, Word2vec의 범위는 단어 수준을 넘어서까지 확장되었습니다. negative sampling을 적용한 skip-gram, local context를 기반으로 한 embeddings 학습, 문장 표현을 학습(Mikolov & Le, 2014; Kiros et al., 2015), NLP를 넘어선 networks(Grover & Leskovec, 2016), biological sequences(Asgari & Mofrad, 2015).

특히 흥미로운 방향 중 하나는 다른 언어의 word embeddings을 동일한 공간에 투영하여(zero-shot) 언어 간 전송을 가능하게 하는 것입니다. 완전한 비지도 방법으로 (유사한 언어의 한해서) 좋은 투영법을 배우는 것이 점점 가능해지고 있으며, 이는 low-resource 언어 및 비지도 기계 번역에 대한 애플리케이션을 엽니다.(Lample et al., 2018; Artetxe et al., 2018)

해당 내용은 사실과 다를 수 있습니다.

정정이 필요한 부분은 댓글로 작성 부탁드립니다. ( 혹은 reference추천도 감사합니다. )

감사합니다.

'Deep Learning > Natural Language Processing' 카테고리의 다른 글

History of Natural Language Processing(NLP) - Chapter.04 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.02 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.01 (0)	2021.03.25

신경망 기반의 자연어 처리를 공부하였습니다. ( 최근 동향 )

2001 | Neural Language Models
2008 | Multi-task Learning
2013 | Word Embeddings
2013 | Neural networks for NLP
2014 | Sequence-to-sequence Models
2015 | Attention
2015 | Memory-based Networks
2018 | Pretrained Language Models

언어 모델링은 일반적으로 RNN을 적용할 때 사용하는 언어입니다.

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

The Unreasonable Effectiveness of Recurrent Neural Networks

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters)

karpathy.github.io

많은 분들이 Andrej Karpathy blog를 보시지 않았을까 생각합니다. 저 또한 RNNs을 공부하면서 참고했던 곳 중 하나입니다. 이 단순한 게시물에서 제시한 언어 모델링이 많은 발전을 통해 현재의 모델에 도달하였습니다.

Word embeddings : word2vec의 목적은 언어 모델링을 단순화하는 것에 의미가 있습니다.

Sequence-to-sequence models : 이러한 모델들은 한 번에 한 단어를 예측하여 output sequence를 생성합니다.

Pretrained language models : 이 방법들은 transfer learning을 위한 언어 모델의 표현을 사용합니다.

최근의 자연어발전은 언어 모델 개발에 치중되어 있는데, real nlp을 위한 다른 방법과 모델이 필요성을 어필합니다. ( raw form을 이용한 학습은 한계가 존재할 것이라고 합니다. )

2008 | Multi-task Learning

Multi-task Learning(이하 MTL)는 여러 작업에 대해 훈련된 모델들 간에 파라미터를 공유하는 일반적인 방법입니다. 신경망에서는 다른 layer들의 가중치를 묶음으로써 쉽게 수행할 수 있습니다. 이러한 아이디어는 93년 Rich Caruana가 제안하였으며 당시에는 도로 추적 및 폐렴 예측에 적용되었습니다.

직관적으로 MTL은 모델이 많은 태스크에 유용한 표현을 학습하도록 합니다. 이것은 일반적으로 낮은 레벨의 표현들을 학습하고 모델의 attentions 또는 제한된 학습 데이터가 있을 때 유용합니다.

MTL은 2008년 Collobert와 Weston이 NLP용 신경망에 처음으로 적용하였습니다.

Figure 1: Sharing of word embedding matrices ( Collbert & Weston, 2008; Colobert et al., 2011)

위에서 보는 것과 같이 서로 다른 작업에 대해 훈련 된 두 모델 간의 look-up table(word embedding matrices)가 공유됩니다.

word embedding을 공유하면 모델 안에서 가장 많은 수의 파라미터를 구성하는 word embedding matrix의 일반적인 low-level 정보를 공유할 수 있습니다. word embedding를 사전 훈련하고 지난 몇 년간 채택된 텍스트에 대해 CNN을 사용하는 것과 같은 아이디어를 주도했습니다.

Facebook researchers win Test of Time Award at ICML 2018 - Facebook Research

We are pleased to announce that Facebook research scientists Ronan Collobert and Jason Weston won the 2018 International Conference on…

research.fb.com

MTL은 광범위한 NLP 태스크에서 사용되며 기존, NLP repertoire에서 유용한 도구가 되었다고 할 수 있습니다. 파라미터 공유는 일반적으로 정의되어 있긴 하지만, 다른 공유 패턴을 학습할 수 있습니다. 모델의 일반화 능력을 평가하기 위해서 multiple tasks평가가 증가함에 따라 MTL이 중요해지고 있고 전용 벤치마크도 제안되었습니다. (Wang et al., 2018; McCann et al., 2018).

MTL 자세히 : ruder.io/multi-task/

An Overview of Multi-Task Learning for Deep Learning

Multi-task learning is becoming more and more popular. This post gives a general overview of the current state of multi-task learning. In particular, it provides context for current neural network-based methods by discussing the extensive multi-task learni

ruder.io

해당 내용은 사실과 다를 수 있습니다.

정정이 필요한 부분은 댓글로 작성 부탁드립니다. ( 혹은 reference추천도 감사합니다. )

감사합니다.

'Deep Learning > Natural Language Processing' 카테고리의 다른 글

History of Natural Language Processing(NLP) - Chapter.04 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.03 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.01 (0)	2021.03.25

신경망 기반의 자연어 처리를 공부하였습니다. ( 최근 동향 )

2001 | Neural Language Models
2008 | Multi-task Learning
2013 | Word Embeddings
2013 | Neural networks for NLP
2014 | Sequence-to-sequence Models
2015 | Attention
2015 | Memory-based Networks
2018 | Pretrained Language Models

2001 | Neural Language Models

언어 모델링은 텍스트의 이전 단어가 주어졌을 때 다음 단어를 예측하는 모델입니다.

고전적인 접근 방식은 n-gram을 기반으로 하며 보이지 않는 n-gram을 처리하기 위해 평활화를 사용하기도 합니다. (Kneser & Ney, 1995)

첫 번째 neural language model은 Bengio가 제안하였습니다. ( feed-forward neural network )

Figure 1: Neural architecture (Bengio et al., 2001; 2003)

이 모델은 one-hidden layer feed-forward neural network이며, 시퀀스의 next word를 예측합니다.

Training is achieved by looking for $\theta$ that maximizes the training corpus penalized log-likelihood:

$$L = \frac{1}{T} \sum_{t} log f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta),$$

where $R(\theta)$ is a regularization term

모델의 output은 $f(w_t, w_{t-1}, ..., w_{t-n+1})$ 이고, softmax에 의해 계산되어진 확률 $p(w_t|w_{t-1}, ..., w_{t-n+1})$입니다.

*where $n$ is the number of previous words fed into the model.

우리가 word embedding이라고 부르는 개념은 벤지오 교수가 이때부터 소개/사용하였다고 합니다.

현재까지 이러한 architecture는 점진적으로 발달하였고 지금까지도 3가지 process를 중심으로 설계됩니다.

1. Embedding Layer

- index vector와 word embedding matrix를 multiplying (곱연산)함으로써 단어 임베딩을 생성하는 레이어입니다.

2. Intermediate Layer(s)

- 인풋의 중간 표현을 생성하는 하나 이상의 레이어

ex) a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of $n$ previous words

$n$ 이전 단어들의 word embedding 과 연결하는 비선형 fc layer

3. Softmax layer

- 단어에 대한 probability distribution을 생성하는 최종 레이어

하지만, 벤지오 교수는 2가지 문제점(개선점)에 대해서도 제시합니다.

1. Intermediate layer를 LSTM 으로 대체할 수 있다는 점

2. Softmax layer 계산 비용이 단어 수에 비례하므로 단어 수가 많은 경우 병목현상이 일어날 수 있다. ( 수십, 수백만의 단어 )

따라서, Large vocabulary에 대해서 softmax를 계산하는 것과 계산 비용을 연관하여 언어 모델을 만드는 것이 핵심과제 중 하나라고 제시하였습니다.

해당 내용은 사실과 다를 수 있습니다.

정정이 필요한 부분은 댓글로 작성 부탁드립니다. ( 혹은 reference추천도 감사합니다. )

감사합니다.

'Deep Learning > Natural Language Processing' 카테고리의 다른 글

History of Natural Language Processing(NLP) - Chapter.04 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.03 (0)	2021.03.25
History of Natural Language Processing(NLP) - Chapter.02 (0)	2021.03.25

noti note