=Paper= {{Paper |id=Vol-3092/p07 |storemode=property |title=Emotion Recognition from Tweets |pdfUrl=https://ceur-ws.org/Vol-3092/p07.pdf |volume=Vol-3092 |authors=Jakub Sydor,Szymon Cwynar |dblpUrl=https://dblp.org/rec/conf/system/SydorC21 }} ==Emotion Recognition from Tweets== https://ceur-ws.org/Vol-3092/p07.pdf

Emotion Recognition from Tweets
Jakub Sydor1 , Szymon Cwynar1
1
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND

Abstract
These days we face more and more often internet bullying. Our goal was to develop software, which would recognize
emotions from the bare text. Our project is based on Twitter posts, but it could be also used in every single platform, in
which users communicate via text messages. We use a few solutions to make our program as accurate as it possibly could
get. Firstly we picked a large database to get the biggest context, We used World2Vec to represent words as vectors, and
lastly, we used a neural network to predict output from sentences beyond our database. Our article is mostly about different
versions of the algorithm and comparison to choose the best approach to the problem. As we learned the biggest difference-
maker was the amount of hidden layers and number of neurons inside each one of them, type of activation function, and
training algorithm. We attach a big amount of plots to visualize each of our tries. In our article, we will try to show our
approaches and data which is connected to those approaches. We created functions to monitor our error, the accuracy
function to sum up our algorithm - how efficient it is, precision function - to diagnose what proportion of identifications
was correct, recall - a fraction of relevant instances that were retrieved and f1 which combines precision and recall to make
an average of it valued from 0 to 1.

Keywords
Artificial neural network, Word2vec, emotion, tweets

1. Introduction or data which possibly could make our algorithm less
reliable. Therefore we cleared it out of stuff like Names,
The assumption of our project was to create an algo- links, and mentions.
rithm based on an artificial neural network. Its main Next algorithm used in our program is Word2Vec
goal was to recognize whether the entry is neutral, neg- which is responsible for translating our sentences and
ative, or positive. The algorithm is learning on a base words into numbers. Every word is represented by a
that contains 1.6m tweets using backpropagation algo- 10-dimensional vector. The algorithm which stands be-
rithm. We decided to use neural network as classifiers as hind word2Vec is nothing else than an artificial neural
they have been reported in various interesting applica- network, which would be explained later. Because of the
tions [1, 2, 3, 4]. length of our one-word vector and the maximum length
In [5] neural networks are used in federated systems of twitter expression (280 words) we created an input
in which they resource information each other during layer which size is simply the result of the multiplication
training. Models of neural networks are also very effi- of those 2 values, which is 2800 neurons.
cient in detection threats over internet [6]. We can also Then we move to the heart of our program - the artifi-
find them as classifiers in images [7] and systems of IoT cial neural network. The whole structure is handwritten
to detect position of people [8, 9, 10]. by us, we don’t use any libraries. Its main functions are
We got our database from Kaggle but it was full of run and addlayer, which are responsible for adding lay-
unnecessary data like date or user. We cleared it and left ers and running the whole algorithm. The run function
only 2 columns - target and text, we got rid of columns returns 2 output neurons, which represent, by using soft-
that contained information like date, the user, or tweet max, for the probability of label. The first neuron is the
id. possibility of positive output and the second one for nega-
To make our algorithm work we needed to divide it tive. We also add a function that check the absolute value
into few subsections. The first of them is a section con- of its difference, if it’s small enough then the output is
nected to a database. Firstly, as mentioned before, we equal to neutral. The artificial network includes the input
dropped most of the columns, but secondly, we needed layer with 2800 neurons, a first hidden layer with 600
to make sure that our data is not containing unused data neurons, a second hidden layer with 200 neurons, third
hidden layer with 20 neurons, and output layer which
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, consists of 2 neurons.
Engineering and Mathematics. July 27–29, 2021, Catania, IT
" jakusyd988@student.polsl.pl (J. Sydor);
szymcwy664@student.polsl.pl (S. Cwynar)
~ https://github.com/Harasz/ (J. Sydor);
2. Data Base
https://github.com/SzymCwy/ (S. Cwynar)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Our database consists of 1 600 000 tweets, each item is
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) represented by 5 columns. One record includes the date

40
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

of the tweet, posting users nickname, the id of a tweet, neural network and it allows us to make mathematical
the content of the tweet, and the label - if it’s positive or operations on words. Gensims Wor2Vec implements two
negative. We needed to modify our database to contain functions - Continuous Bag Of Words(CBOW) and Skip-
only 2 of 5 columns because only text and labels will be Gram.
used. Except that we needed to adapt our database to our In the CBOW model surrounding words is combined to
use and clear it of meaningless text. predict the word they surround, while in the Skip-Gram
So firstly we loaded the database using ’pandas’ and we use a word to predict the context.
provided our data frame with labels to make the access
easier. Also, we’ve implemented a function whose main
task was to clear any irrelevant text, such as pronouns, 5. Mathematical representation of
conjunctions, links, and mentions, to make sure our al- Skip-Gram model
gorithm will learn properly.
Mathematically, we can describe n-word sentence
𝑤1 , . . . , 𝑤𝑛 , using skip-gram as following formula:
3. Algorithm overall
Our algorithms inputs are sentences that in the next steps 𝑛
∑︁
are converted into words. Using Word2Vec each word in 𝑆𝑘𝑖𝑝𝐺𝑟𝑎𝑚 = 𝑤𝑖1 , 𝑤𝑖2 , . . . , 𝑤𝑖𝑛 | 𝑖𝑗 − 𝑖𝑗−1 < 𝑘
the sentence is converted into ten-dimensional vectors 𝑗=1
(1)
and those vectors are inserted into the array, each vector
where k is max skip-distance and n to subsequence
as a separate element of our array. Those words are easily
length.
available because of two mechanisms, word to id and id to
For example, when we have the sentence ”I love to
word. Now using an artificial network, weighted sum and
write scripts” and k is equal to 1 and n to 2, that means
activation function the algorithm is filling every single
we will connect 2 words, which have a maximum of
neuron with proper values. In the output layer, we have
one word between them. Those connections would
2 neurons that, at the end of the algorithm, return two
be: {I,love},{I, to},{love, to},{love, write},{to, write},{to,
values between 0 and 1. Because of softmax, those values
scripts},{write,scripts}.
can be identified as the probability of each label. When
learning those two values are collated with expected
outcomes. That way we get the distance between our 6. Backward propagation
result and the real label and we used that values in the
algorithm of backward propagation to change all of the A backward propagation algorithm is used in our pro-
weights so that our program is getting more and more gram to modify the weights of each neuron to get the best
precise with each iteration. results. Our neurons have pregenerated weights from 0
to 1. To make our algorithm more precise by analyzing
the errors backward propagation algorithm correct na-
4. Word2Vec algorithm tive weights starting from the end of our artificial neural
network. As an input, it takes the probability of each la-
We are using the gensim library to implement the algo-
bel and the expected label. It calculates the error of each
rithm of Word2Vec. This part of our code lets us change
of the output neurons and those errors are propagated
words in the database to vectors, so they can be used
to previous layers. Each weight in our network is being
in our calculations. The algorithm as input data takes
modified based on the value of the error. This algorithm
the whole data frame with all sentences, each row rep-
has its limit so you need to be careful while setting its
resented as one sentence. Firstly sentences need to be di-
iterations. After a few runs values are being modified to
vided into words. Nextly we count how many times each
a lesser extent, so when those changes are minor, that’s
word occurs in the text and based on that information we
the sign to stop the algorithm.
create 2 dictionaries - word to id and id to word, which
would make the conversion from text to id and inversely
easier. In the built-in function, we need to specify the size 7. Activation function
of the vector, minimal number of occurrences, window,
and source of words. In our example we set minimum Activation function is inseparable element of every ar-
occurrence to 1, size of vector to 10 and window to 7, tificial neural network. We have lots of them available
to make sure our dictionary would be big, to connect but each of them is different. We use s-shaped function
big amounts word witch each other and also we needed - hiperbolic tangent as our activation function. It deter-
10-dimensional vector for every word so it would fit our mines the output of artificial neural network. All of the
input layer. Word2Vec is nothing else than an artificial output values are between 0 and -1. The advantage of

41
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

Figure 1: Graphical representation of the CBOW model and Skip-gram model [11].

Figure 2: Pseudo-code of the back-propagation algorithm in training ANN [12].

our activation function is mapping, all positive and nega- 8. Maths behind activation
tive values will be presented as strong values and those
which are close to 0 would be close to 0 on the tanh graph.
function
We also have chosen tanh function because is strongly Our activation function - Hyperbolic tangent might be
advised when neural network has only 2 outputs. represented as:

sinh 𝑥
tanh 𝑥 = (2)
cosh 𝑥

𝑒𝑥 − 𝑒−𝑥
tanh 𝑥 = (3)
𝑒𝑥 + 𝑒−𝑥

42
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

𝜕 (𝑒𝑥 + 𝑒−𝑥 )(𝑒𝑥 + 𝑒−𝑥 ) − (𝑒𝑥 − 𝑒−𝑥 )(𝑒𝑥 − 𝑒−𝑥 )
tanh 𝑥 = (4)
𝜕𝑥 (𝑒𝑥 + 𝑒−𝑥 )2

10. Inference
Inference in our algorithm is simply choosing an option
with a higher probability. Thanks to softmax we get on
our output layer two neurons with probabilities for each
label. Firstly we need to convert output as it is in a form
that cannot be compared to label from our database. The
output is a two-dimensional array, where first we got
the probability of positive tweet and the second as the
negative one. So we need to make a variable ’expected’,
so it would be represented as a two-dimensional array.
Next we check if the absolute value of the subtraction is
bigger than 0.1, if it’s not we make our entry neutral. If
one of the values is big enough we assign respectively
the label. As we compare these two values we also
Figure 3: Comprehension between sigmoid and tanh activa- calculate the accuracy of our algorithm. Below we
tion functions. [13].
represent the pseudocode of inference. [H] Input Data:
sentence label k, array of vectors j (sentence represented
as an array of vectors), Choosing the label of sentence
It’s domain is range from -1 to 1. Its monotonic func-
tion, which derivative is non monotonic. Derivative of 𝑘 == 0 : 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = [0, 1] 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = [1, 0]
tanh: neo = Artificial neural net output as probability of each
label
𝜕 absolute = Absolute value of subtraction both output
tanh 𝑥 = 1 − tanh2 𝑥 (5)
𝜕𝑥 values
absolute < 0.1: Neutral neo [0] < neo [1] Positive Nega-
9. Artificial Neural Network tive Inference Algorithm.

Our neural network algorithm is divided into to main
classes. NeuralNetwork class and Neuron class.
11. SoftMax
NeuralNetwork has 2 variables - layers and weight. Softmax is exponential function, which normalizes values
Which store respectively arrays of Neurons and theirs of our 2 output neurons to the sum of 1. We use that
weights. First funtion is addlayer which was written to function in our program to represent both of our output
allow creating layers with specific number of neurons in- neurons values as a probability of getting positive or
side it given as an argument. When used it adds elements negative label. Author names can have some kinds of
of Neuron class into the array of layers. Get size function marks and notes:
is used to return number of neurons in whole artificial
neural network, thanks to that function we are able to 𝑒𝑥𝑝(𝑥𝑖 )
properly use generate weights. With the result of the 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑥𝑖 ) = ∑︀ (6)
𝑗 𝑒𝑥𝑝(𝑥𝑗 )
previous function we use generate weights to create array
with randomly generated numbers from 0 to 1. Load Additionaly apart from calculating softmax, we need
weights function is responsible for assigning weights to get its derivative. Its used by backpropagation function
to neurons. The last of them is run, it firstly checks if during calculating difference between expected values
algorithm has the same amount of input neurons and and outputs from our net. We start by separately com-
inputs given by a user. If it returns true it starts to assign puting derivatives. First for the first neuron.
values to neurons.
The next class - Neuron, is responsible for calculating 𝜕𝑒𝑧1
𝜕𝑆(𝑧1 ) · (𝑒𝑧1 + 𝑒𝑧2 ) − 𝜕𝑧𝜕1 (𝑒𝑧1 + 𝑒𝑧2 ) · 𝑒𝑧1
weighted sum and using activation function to assign =
𝜕𝑧1

values to each neuron. 𝜕𝑧1 (𝑒𝑧1 + 𝑒𝑧2 )2
(7)

43
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

so we have: True Positives: 10 False Positives: 3
False Negatives: 2 True Negatives: 15
𝜕
𝑆(𝑧1 ) = 𝑆(𝑧1 ) × (1 − 𝑆(𝑧1 )) (8)
𝜕𝑧1
Now for the second one. 13. Recall function
Recall function is very similar to precision function. The
𝜕𝑒𝑧2
𝜕𝑆(𝑧2 ) 𝜕𝑧1
· (𝑒𝑧1 + 𝑒𝑧2 ) − 𝜕𝑧𝜕1 (𝑒𝑧1 + 𝑒𝑧2 ) · 𝑒𝑧2 only difference is that we compare true positive values
=
𝜕𝑧1 (𝑒𝑧1 + 𝑒𝑧2 )2 to false negatives (incorrect predictions - model predict
(9) incorrectly negative class). Our model is most efficient
so we have: when recall factor is 1.0 - that means there are no false
negatives.
𝜕 The equation is also very similar:
𝑆(𝑧2 ) = −𝑆(𝑧1 ) × 𝑆(𝑧2 ) (10)
𝜕𝑧1
Conclusion for N outputs 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
(17)
𝑇𝑃 + 𝐹𝑁
𝑒𝑧𝑖 And when we calculate Recall one the same set of data
𝑆(𝑧1 ) = ∑︀𝑁 𝑧 (11)
𝑗=1 𝑒
𝑗
as Precision that’s our outcome:
General formula for softmax derivative for N outputs:
True Positives: 10 False Positives: 3
{︃ False Negatives: 2 True Negatives: 15
𝜕 𝑆(𝑧𝑖 ) × (1 − 𝑆(𝑧𝑖 )) if 𝑖=𝑗
𝑆(𝑧𝑖 ) = (12)
𝜕𝑧𝑗 −𝑆(𝑧𝑖 ) × 𝑆(𝑧𝑗 ) if 𝑖 ̸= 𝑗
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (18)
𝜕 𝑇𝑃 + 𝐹𝑁
If we are computing 𝑆(𝑧𝑖 ) the output is always
𝜕𝑧𝑖
𝑆(𝑧𝑖 ) × (1 − 𝑆(𝑧𝑖 )). However when we are computing 10
𝑅𝑒𝑐𝑎𝑙𝑙 = (19)
𝜕 10 + 2
𝑆(𝑧𝑖 ) the output changes to −𝑆(𝑧𝑖 ) × 𝑆(𝑧𝑗 )
𝜕𝑧𝑗
𝑅𝑒𝑐𝑎𝑙𝑙 = 10/12 ≈ 0.833 (20)

12. Precision function
14. Comparison of recall and
Precision is a function which shows the proportion of
true positive identifications. If we analyse retrieval of
precision
information, precision is fraction of correct results di- The comparison of those 2 function is very difficult be-
vided by all returned results. We calculate precision from cause of tension between them. That means if you im-
two variables: TP and FP. Which respectively stands for prove one of them, the second one is reducing its preci-
true positive and false positive. By the term true positive sion. Using data above we got:
we mean outcome where the model correctly predicted Precision ≈ 0.769
positive class and false positive as incorrect prediction Recall ≈ 0.833
of positive class.
Precision is given by the following formula: For data:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (13) True Positives: 10 False Positives: 3
𝑇𝑃 + 𝐹𝑃 False Negatives: 2 True Negatives: 15
When precision rate is equal to 1.0, that means that
model produces no false positives.
Example of calculating Precision: When we decrease number of FP and FN increases: We
got:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (14)
𝑇𝑃 + 𝐹𝑃 True Positives: 10 False Positives: 1
False Negatives: 4 True Negatives: 15
10
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (15)
10 + 3
Precision ≈ 0.91
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 10/13 ≈ 0.769 (16) Recall ≈ 0.71

44
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

And when we do the opposite thing, we decrease num-
ber of FN and increase number of FP: We got:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝛽 = (1 + 𝛽 2 ) × (26)
(𝛽 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) + 𝑟𝑒𝑐𝑎𝑙𝑙
True Positives: 10 False Positives: 4
False Negatives: 1 True Negatives: 15
(1 + 𝛽 2 ) × 𝑇 𝑃
Precision ≈ 0.71 𝐹𝛽 = (27)
(1 + 𝛽 2 ) × 𝑇 𝑃 + 𝛽 2 * 𝐹 𝑁 + 𝐹 𝑃
Recall ≈ 0.91
When 𝛽 is equal to 2 recall weights are higher than
So we got to conclusion that they are not quite com- precision, however when its equal to 0.5 weights of pre-
parable, but there is another method which uses both of cision are higher than recall. Example of calculating 𝐹𝛽
them in the calculations and it’s named F1 score. score

True Positives: 10 False Positives: 3
15. F1 score False Negatives: 2 True Negatives: 15

Firstly the name of F1 score, also known as F-measure,
is believed to refer to different F funcion, which was with 𝛽 = 2:
concluded in Van Rijsbegens Book, when introduced to
(1 + 4) × 10
the Fourth Message Understanding Conference. 𝐹𝛽 = (28)
F1 score is measurement of test’s accuracy. It is cal- (1 + 4) × 10 + 4 * 2 + 3
culated from the recall and precision. F-meause is the 40
harmonic mean (the reciprocal of the arithmetic mean of 𝐹𝛽 = (29)
55
the reciprocals of the given set of observations) of Preci- 𝐹𝛽 ≈ 0.73
sion and Recall. It can be modified by additional weights,
valuing precision or recall more than other.
The highest value of F1 score is 1.0 which indicates
the best precision and recall and 0 indicates that one of
precision or recall is equal to 0.
F-measue is also known as Sørensen–Dice coefficient
or Dice similarity coefficient (DSC).
2
𝐹1 = (21)
𝑟𝑒𝑐𝑎𝑙𝑙−1 + 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛−1
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 × (22)
𝑟𝑒𝑐𝑎𝑙𝑙 + 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑇𝑃
𝐹1 = (23)
𝑇 𝑃 + 12 (𝐹 𝑃 + 𝐹 𝑁 ) Figure 4: Sigmoid activation function [13].
Example of calculating F1 score:

True Positives: 10 False Positives: 3
False Negatives: 2 True Negatives: 15 16. Experiments
We’ve experimented with
10
𝐹1 = (24) • Activation Function
10 + 21 (3 + 2)
• Artificial neural network learning algorithms
10 • Structure of artificial neural network
𝐹1 = (25)
12.5
𝐹1 = 0.8 We’ve chosen Hyperbolic tangent as our main activa-
tion function. We decided on that function after compar-
𝐹𝛽 Score is used when we want recall to be consid- ison of three functions - ReLU, Sigmoid and Tanh, as we
ered 𝛽 times more important than precision, where 𝛽 is thaught it would fit our algorithm best. We have tested all
positive real factor. of them by accuracy and f1 score functions. Also we’ve

45
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

Figure 7: Accuracy/Cost for Test Over Time [15].
Figure 5: ReLU activation function [14].

17. Conclusions
In our work we have tested application of neural net-
works to word processing purposes. We have used spe-
cial library to work with tweets. Our idea was tested
and results show we have good model which is able to
work with tweets. in future works we will try to develop
furhter our project to make it also compare tweets be-
tween various authors. We will also work to apply other
models and ideas to compar them to this presented neural
network.

References
[1] S. Brusca, G. Capizzi, G. Lo Sciuto, G. Susi, A
Figure 6: Comparison of activation functions [13]. new design methodology to predict wind farm en-
ergy production by means of a spiking neural net-
work–based system, International Journal of Nu-
read articles that proves that this type of function is the merical Modelling: Electronic Networks, Devices
best to neural network with 2 neuron in output layer. and Fields 32 (2019). doi:10.1002/jnm.2267.
Thing that outweigh the decision was its shape. Because [2] G. Capizzi, C. Napoli, L. Paternò, An innovative
of tanh function we are able to easily spot negative values hybrid neuro-wavelet method for reconstruction of
and those which are close to 0. missing data in astronomical photometric surveys,
We also have used softmax function to represent values Lecture Notes in Computer Science (including sub-
on our output neurons as probability of each label. Apart series Lecture Notes in Artificial Intelligence and
from that we have used derivative of soft max in algo- Lecture Notes in Bioinformatics) 7267 LNAI (2012)
rithm of back propagation to decrease the error. 21–29. doi:10.1007/978-3-642-29347-4_3.
We tried Particle Swarm Optimization and backward [3] G. Capizzi, F. Bonanno, C. Napoli, Hybrid neural
propagation as our learning algorithms. After reading networks architectures for soc and voltage pre-
articles and running some test, we decided to use back- diction of new generation batteries storage, 2011.
ward propagation as it was easier to use with softmax doi:10.1109/ICCEP.2011.6036301.
and also was more efficient than PSO. [4] C. Napoli, F. Bonanno, G. Capizzi, An hybrid neuro-
After many tries we ended our tests with 3 hidden wavelet approach for long-term prediction of solar
layers: first - 600, second 200, third 10, as input and wind, Proceedings of the International Astronomi-
output layer number of neurons is constant - 2800 inputs cal Union 6 (2010) 153–155.
and 2 outputs. [5] D. Połap, M. Woźniak, Meta-heuristic as manager

46
Jakub Sydor et al. CEUR Workshop Proceedings 40–47

in federated learning approaches for image process-
ing purposes, Applied Soft Computing 113 (2021)
107872.
[6] M. Wozniak, J. Silka, M. Wieczorek, M. Alrashoud,
Recurrent neural network model for iot and net-
working malware threat detection, IEEE Transac-
tions on Industrial Informatics 17 (2021) 5583–5594.
[7] X. Liu, S. Chen, L. Song, M. Woźniak, S. Liu, Self-
attention negative feedback network for real-time
image super-resolution, Journal of King Saud
University-Computer and Information Sciences
(2021).
[8] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessen-
ing stress and anxiety-related behaviors by means
of ai-driven drones for aromatherapy, volume 2594,
2020, pp. 7–12.
[9] M. Woźniak, M. Wieczorek, J. Siłka, D. Połap, Body
pose prediction based on motion sensor data and
recurrent neural network, IEEE Transactions on
Industrial Informatics 17 (2020) 2101–2111.
[10] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
caro, Yolov3-based mask and face recognition al-
gorithm for individual protection applications, in:
CEUR Workshop Proc., 2020, pp. 41–45.
[11] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting simi-
larities among languages for machine translation,
arXiv preprint arXiv:1309.4168 (2013).
[12] H. Guo, H. Nguyen, D.-A. Vu, X.-N. Bui, Forecast-
ing mining capital cost for open-pit mining projects
based on artificial neural network approach, Re-
sources Policy (2019) 101474.
[13] S. Sharma, S. Sharma, A. Athaiya, Activation func-
tions in neural networks, towards data science 6
(2017) 310–316.
[14] K. Sarkar, Relu: Not a differentiable function: Why
used in gradient based optimization? and other
generalizations of relu, Data Science Group, IITR
(2018).
[15] J. D. Seo, Unfair back propagation with tensorflow
[manual back propagation with tf], 2018.