diff --git a/04_naive_bayes.Rmd b/04_naive_bayes.Rmd index 7bcd4557..7ba713fb 100644 --- a/04_naive_bayes.Rmd +++ b/04_naive_bayes.Rmd @@ -4,7 +4,7 @@ editor_options: wrap: 72 --- -# Spam filter +# Spam filter: An introductory classification model ## Naive Bayes: Spam or Ham? @@ -24,7 +24,7 @@ using Random Random.seed!(123) ``` -Nobody likes spam emails. How can Bayes help? In this chapter, we'll +Nobody likes spam emails. How can Bayes' theorem help? In this chapter, we'll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails @@ -32,11 +32,13 @@ and classify them as either spam or ham (the word for non-spam emails) based on their content. What we will be implementing here is a *supervised learning model*, in -other words, a classification model that has been trained on previously -classified data. Think of it like a machine to which you can give some -input, like an email, and will give you some label to that input, like -spam or ham. This machine has a lot of tiny knobs, and based on their -particular configuration it will output some label for each input. +other words, a classification a model that can "learn" +to associate the target variable (email type) with the input variables +(words contained in the email). Think of it like a machine that is feeded +with an email and will output a label associated to it, like +spam or ham. This machine has a lot of tiny knobs -also called parameters of +the model- and based on their particular configuration, the model will output +some label or another. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and @@ -44,22 +46,31 @@ if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a *naive Bayes* approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the -chapter. In a naive Bayes model, Bayes' theorem is the main tool for +chapter. In such a model, Bayes' theorem is the main tool for classifying, and it is *naive* because we make very loose assumptions -about the data we are analyzing. This will be clearer once we dive into -the implementation. +about the data we are analyzing. When creating models we usually take +decisions to make our lives easier, and they usually come at the expense +of making the model simpler. It is usually a good practice to start building +simpler models and add complexity only as needed. This will be clearer once +we dive into the implementation. +In summary, a successful spam filter model might infer from the training +data that emails containing the word “discount” have a high probability +of being spam. -## The Training Data +## The data For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough -corpus of emails that have been pre-classified as spam or ham. The +corpus [^1] of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person's spam may be another person's ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive. +[^1]: A large and structured set of text data, very oftenly used to train +language models. + Fortunately, there are a lot of very good datasets available online. We'll use the "Email Spam Classification Dataset CSV" from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv), @@ -198,16 +209,16 @@ discover $P(email|spam)$. The new email looks like this: The new email contains the words *win* and *product*, which are rather common in our example's training data. We would therefore expect $P(email|spam)$, the probability of the new email being generated by the -words encountered in the training spam email set, to be relatively high. +words encountered in the training spam email set, to be relatively high. [^2] -(The word \\emph{win} appears in the form \\emph{won} in the training +[^2]: The word **win** appears in the form **won** in the training set, but that's OK. The standard linguistic technique of -\\emph{lemmatization} groups together any related forms of a word and -treats them as the same word.) +**lemmatization** groups together any related forms of a word and +treats them as the same word. Mathematically, the way to calculate $P(email|spam)$ is to take each word in our target email, calculate the probability of it appearing in -spam emails based on our training set, and multiply those probabilties +spam emails based on our training set, and multiply those probabilities together. $P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$ @@ -218,9 +229,9 @@ the training ham email set: $P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$ -The multiplication of each of the probabilities associated with a -particular word here stems from the naive assumption that all the words -in the email are statistically independent. In reality, this assumption +The multiplication of each of the word probabilities here stands from the +naive supposition that all the words in the email are conditionally +independent given the class (spam or ham). In reality, this assumption isn't necessarily true. In fact, it's most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem @@ -229,8 +240,15 @@ requires. The probability of a given word $word_i$ being in a given category is calculated like so: -$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$ -$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$ +\begin{equation} + \tag{1.1} + P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}} +\end{equation} + +\begin{equation} + \tag{1.2} + P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}} +\end{equation} These formulas tell us exactly what we have to calculate from our data. We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each @@ -242,7 +260,8 @@ in the dataset. The variable $\alpha$ is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn't appeared in the spam category in our training dataset, for example, we don't want to assign it zero -probability of appearing in new spam emails. +probability of appearing in new spam emails. See the [appendix](#appendix-alpha) +for more details. As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia *struct*, with attributes for the @@ -298,7 +317,7 @@ modifies its arguments in-place (in this case, the spam filter struct itself). This function *fits* our model to the data, a typical procedure in data science and machine learning areas. -```{julia, results = TRUE} +```{julia, results = FALSE} function fit!(model::BayesSpamFilter, x_train, y_train, voc) model.vocabulary = voc model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0) @@ -337,9 +356,9 @@ testing portion to evaluate the model's accuracy later. Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We'll define a few more functions to help with this process. First, we need a function -implementing the TAL formula that we discussed earlier. +implementing formulas $(1.1)$ and $(1.2)$. -```{julia, results = TRUE} +```{julia, results = FALSE} function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α) ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary)) spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary)) @@ -432,14 +451,16 @@ five emails in the test data. predictions[1:5] ``` -Of the first five emails, one (the third) was classified as spam, and +Of the first five emails, the third and the fifth were classified as spam, while the rest were classified as ham. ## Evaluating the Accuracy Looking at the predictions themselves is pretty meaningless; what we -really want to know is the model's accuracy. We'll define another -function to calculate this. +really want is to have some metric which can help us evaluate the effectiveness +of our model in a quantitative manner. Usually, the first approach to this is +to calculate the model's accuracy. +We'll define another function for this calculation. ```{julia, results = FALSE} function spam_filter_accuracy(predictions, actual) @@ -484,16 +505,14 @@ function that builds a confusion matrix for our spam filter: ```{julia, results = FALSE} function spam_filter_confusion_matrix(y_test, predictions) - # 2x2 matrix is instantiated with zeros confusion_matrix = zeros((2, 2)) - confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test)) - confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test)) + confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test)) + confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test)) - # Now we convert the confusion matrix into a DataFrame - confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[]) + confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) @@ -549,7 +568,7 @@ functions to fit the spam filter object to the data. Finally, we made predictions on new data and evaluated our model's performance by calculating the accuracy and making a confusion matrix. -## Appendix - A little more about alpha +## Appendix - A little more about alpha {#appendix-alpha} As we have seen, to calculate the probability of the email being a spam email, we should use diff --git a/04_naive_bayes/tmp.jl b/04_naive_bayes/tmp.jl index 0354f58f..8278d1a1 100644 --- a/04_naive_bayes/tmp.jl +++ b/04_naive_bayes/tmp.jl @@ -128,13 +128,13 @@ function spam_filter_confusion_matrix(y_test, predictions) # 2x2 matrix is instantiated with zeros confusion_matrix = zeros((2, 2)) - confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test)) - confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test)) + confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test)) + confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test)) # Now we convert the confusion matrix into a DataFrame - confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[]) + confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) diff --git a/docs/404.html b/docs/404.html index 3850d081..2713a0a2 100644 --- a/docs/404.html +++ b/docs/404.html @@ -1,25 +1,24 @@ -
The page you requested cannot be found (perhaps it was moved or renamed).
-You may want to try searching to find the page's new location, or use - the table of contents to find the page you are looking for.
-The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+