diff --git a/04_naive_bayes.Rmd b/04_naive_bayes.Rmd index 7bcd4557..7ba713fb 100644 --- a/04_naive_bayes.Rmd +++ b/04_naive_bayes.Rmd @@ -4,7 +4,7 @@ editor_options: wrap: 72 --- -# Spam filter +# Spam filter: An introductory classification model ## Naive Bayes: Spam or Ham? @@ -24,7 +24,7 @@ using Random Random.seed!(123) ``` -Nobody likes spam emails. How can Bayes help? In this chapter, we'll +Nobody likes spam emails. How can Bayes' theorem help? In this chapter, we'll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails @@ -32,11 +32,13 @@ and classify them as either spam or ham (the word for non-spam emails) based on their content. What we will be implementing here is a *supervised learning model*, in -other words, a classification model that has been trained on previously -classified data. Think of it like a machine to which you can give some -input, like an email, and will give you some label to that input, like -spam or ham. This machine has a lot of tiny knobs, and based on their -particular configuration it will output some label for each input. +other words, a classification a model that can "learn" +to associate the target variable (email type) with the input variables +(words contained in the email). Think of it like a machine that is feeded +with an email and will output a label associated to it, like +spam or ham. This machine has a lot of tiny knobs -also called parameters of +the model- and based on their particular configuration, the model will output +some label or another. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and @@ -44,22 +46,31 @@ if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a *naive Bayes* approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the -chapter. In a naive Bayes model, Bayes' theorem is the main tool for +chapter. In such a model, Bayes' theorem is the main tool for classifying, and it is *naive* because we make very loose assumptions -about the data we are analyzing. This will be clearer once we dive into -the implementation. +about the data we are analyzing. When creating models we usually take +decisions to make our lives easier, and they usually come at the expense +of making the model simpler. It is usually a good practice to start building +simpler models and add complexity only as needed. This will be clearer once +we dive into the implementation. +In summary, a successful spam filter model might infer from the training +data that emails containing the word “discount” have a high probability +of being spam. -## The Training Data +## The data For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough -corpus of emails that have been pre-classified as spam or ham. The +corpus [^1] of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person's spam may be another person's ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive. +[^1]: A large and structured set of text data, very oftenly used to train +language models. + Fortunately, there are a lot of very good datasets available online. We'll use the "Email Spam Classification Dataset CSV" from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv), @@ -198,16 +209,16 @@ discover $P(email|spam)$. The new email looks like this: The new email contains the words *win* and *product*, which are rather common in our example's training data. We would therefore expect $P(email|spam)$, the probability of the new email being generated by the -words encountered in the training spam email set, to be relatively high. +words encountered in the training spam email set, to be relatively high. [^2] -(The word \\emph{win} appears in the form \\emph{won} in the training +[^2]: The word **win** appears in the form **won** in the training set, but that's OK. The standard linguistic technique of -\\emph{lemmatization} groups together any related forms of a word and -treats them as the same word.) +**lemmatization** groups together any related forms of a word and +treats them as the same word. Mathematically, the way to calculate $P(email|spam)$ is to take each word in our target email, calculate the probability of it appearing in -spam emails based on our training set, and multiply those probabilties +spam emails based on our training set, and multiply those probabilities together. $P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$ @@ -218,9 +229,9 @@ the training ham email set: $P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$ -The multiplication of each of the probabilities associated with a -particular word here stems from the naive assumption that all the words -in the email are statistically independent. In reality, this assumption +The multiplication of each of the word probabilities here stands from the +naive supposition that all the words in the email are conditionally +independent given the class (spam or ham). In reality, this assumption isn't necessarily true. In fact, it's most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem @@ -229,8 +240,15 @@ requires. The probability of a given word $word_i$ being in a given category is calculated like so: -$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$ -$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$ +\begin{equation} + \tag{1.1} + P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}} +\end{equation} + +\begin{equation} + \tag{1.2} + P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}} +\end{equation} These formulas tell us exactly what we have to calculate from our data. We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each @@ -242,7 +260,8 @@ in the dataset. The variable $\alpha$ is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn't appeared in the spam category in our training dataset, for example, we don't want to assign it zero -probability of appearing in new spam emails. +probability of appearing in new spam emails. See the [appendix](#appendix-alpha) +for more details. As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia *struct*, with attributes for the @@ -298,7 +317,7 @@ modifies its arguments in-place (in this case, the spam filter struct itself). This function *fits* our model to the data, a typical procedure in data science and machine learning areas. -```{julia, results = TRUE} +```{julia, results = FALSE} function fit!(model::BayesSpamFilter, x_train, y_train, voc) model.vocabulary = voc model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0) @@ -337,9 +356,9 @@ testing portion to evaluate the model's accuracy later. Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We'll define a few more functions to help with this process. First, we need a function -implementing the TAL formula that we discussed earlier. +implementing formulas $(1.1)$ and $(1.2)$. -```{julia, results = TRUE} +```{julia, results = FALSE} function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α) ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary)) spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary)) @@ -432,14 +451,16 @@ five emails in the test data. predictions[1:5] ``` -Of the first five emails, one (the third) was classified as spam, and +Of the first five emails, the third and the fifth were classified as spam, while the rest were classified as ham. ## Evaluating the Accuracy Looking at the predictions themselves is pretty meaningless; what we -really want to know is the model's accuracy. We'll define another -function to calculate this. +really want is to have some metric which can help us evaluate the effectiveness +of our model in a quantitative manner. Usually, the first approach to this is +to calculate the model's accuracy. +We'll define another function for this calculation. ```{julia, results = FALSE} function spam_filter_accuracy(predictions, actual) @@ -484,16 +505,14 @@ function that builds a confusion matrix for our spam filter: ```{julia, results = FALSE} function spam_filter_confusion_matrix(y_test, predictions) - # 2x2 matrix is instantiated with zeros confusion_matrix = zeros((2, 2)) - confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test)) - confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test)) + confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test)) + confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test)) - # Now we convert the confusion matrix into a DataFrame - confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[]) + confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) @@ -549,7 +568,7 @@ functions to fit the spam filter object to the data. Finally, we made predictions on new data and evaluated our model's performance by calculating the accuracy and making a confusion matrix. -## Appendix - A little more about alpha +## Appendix - A little more about alpha {#appendix-alpha} As we have seen, to calculate the probability of the email being a spam email, we should use diff --git a/04_naive_bayes/tmp.jl b/04_naive_bayes/tmp.jl index 0354f58f..8278d1a1 100644 --- a/04_naive_bayes/tmp.jl +++ b/04_naive_bayes/tmp.jl @@ -128,13 +128,13 @@ function spam_filter_confusion_matrix(y_test, predictions) # 2x2 matrix is instantiated with zeros confusion_matrix = zeros((2, 2)) - confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test)) - confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test)) - confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test)) + confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test)) + confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test)) + confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test)) # Now we convert the confusion matrix into a DataFrame - confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[]) + confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) diff --git a/docs/404.html b/docs/404.html index 3850d081..2713a0a2 100644 --- a/docs/404.html +++ b/docs/404.html @@ -1,25 +1,24 @@ - Page not found | Data Science in Julia for Hackers - + - + - + - + - + @@ -27,282 +26,99 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -314,335 +130,165 @@
@@ -659,12 +305,12 @@

-
-

Page not found

-

The page you requested cannot be found (perhaps it was moved or renamed).

-

You may want to try searching to find the page's new location, or use - the table of contents to find the page you are looking for.

-
+
+

Page not found

+

The page you requested cannot be found (perhaps it was moved or renamed).

+

You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.

+
@@ -674,71 +320,71 @@

Page not found

- - - - - - - - - - - - + + + + + + + + + + + + - \ No newline at end of file + diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt index fe912b21..32b16235 100644 --- a/docs/reference-keys.txt +++ b/docs/reference-keys.txt @@ -139,3 +139,11 @@ making-predictions evaluating-the-accuracy appendix---a-little-more-about-alpha section +appendix +appendix-a +appendix-a---lemmatization +appendix-b---a-little-more-about-alpha +appendix-b +appendix-alpha +spam-filter-an-introductory-classification-model +the-data diff --git a/docs/search_index.json b/docs/search_index.json index 0759aae4..c56ebe62 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1,12 +1 @@ -[ - [ - "spam-filter.html", - "Chapter 4 Spam filter 4.1 Naive Bayes: Spam or Ham? 4.2 The Training Data 4.3 Preprocessing the Data 4.4 The Naive Bayes Approach 4.5 Training the Model 4.6 Making Predictions 4.7 Evaluating the Accuracy 4.8 Summary 4.9 Appendix - A little more about alpha", - " Chapter 4 Spam filter 4.1 Naive Bayes: Spam or Ham? Nobody likes spam emails. How can Bayes help? In this chapter, we’ll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails and classify them as either spam or ham (the word for non-spam emails) based on their content. What we will be implementing here is a supervised learning model, in other words, a classification model that has been trained on previously classified data. Think of it like a machine to which you can give some input, like an email, and will give you some label to that input, like spam or ham. This machine has a lot of tiny knobs, and based on their particular configuration it will output some label for each input. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a naive Bayes approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the chapter. In a naive Bayes model, Bayes’ theorem is the main tool for classifying, and it is naive because we make very loose assumptions about the data we are analyzing. This will be clearer once we dive into the implementation. 4.2 The Training Data For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough corpus of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person’s spam may be another person’s ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive. Fortunately, there are a lot of very good datasets available online. We’ll use the “Email Spam Classification Dataset CSV” from Kaggle, a website where data science enthusiasts and practitioners publish datasets, participate in competitions, and share their knowledge. The dataset’s description included online helps us make sense of its contents: The .csv file contains 5,172 rows, one row for each email. There are 3,002 columns. The first column indicates Email name. The name has been set with numbers and not recipients’ name to protect privacy. The last column has the labels for prediction: for spam, for not spam. The remaining 3,000 columns are the 3,000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells. Let’s take a look at the data. The following code snippet outputs a view of the first and last rows of the dataset. raw_df = CSV.read("./04_naive_bayes/data/emails.csv", DataFrame) ## 5172×3002 DataFrame ## Row │ Email No. the to ect and for of a you ho ⋯ ## │ String15 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 In ⋯ ## ──────┼───────────────────────────────────────────────────────────────────────── ## 1 │ Email 1 0 0 1 0 0 0 2 0 ⋯ ## 2 │ Email 2 8 13 24 6 6 2 102 1 ## 3 │ Email 3 0 0 1 0 0 0 8 0 ## 4 │ Email 4 0 5 22 0 5 1 51 2 ## 5 │ Email 5 7 6 17 1 5 2 57 0 ⋯ ## 6 │ Email 6 4 5 1 4 2 3 45 1 ## 7 │ Email 7 5 3 1 3 2 1 37 0 ## 8 │ Email 8 0 2 2 3 1 2 21 6 ## ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ## 5166 │ Email 5166 1 0 1 0 3 1 12 1 ⋯ ## 5167 │ Email 5167 1 0 1 1 0 0 4 0 ## 5168 │ Email 5168 2 2 2 3 0 0 32 0 ## 5169 │ Email 5169 35 27 11 2 6 5 151 4 ## 5170 │ Email 5170 0 0 1 1 0 0 11 0 ⋯ ## 5171 │ Email 5171 2 7 1 0 2 1 28 2 ## 5172 │ Email 5172 22 24 5 1 6 5 148 8 ## 2993 columns and 5157 rows omitted As you can see, the output informs the amount of rows and columns and the type of each column and allows us to see a sample of the data. 4.3 Preprocessing the Data Before we use the data to train our filter, we need to preprocess it a little bit. First, we should filter out very common words, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. all_words = names(raw_df)[2:end-1] all_words_text = join(all_words, " ") document = StringDocument(all_words_text) prepare!(document, strip_articles) prepare!(document, strip_pronouns) vocabulary = split(TextAnalysis.text(document)) clean_words_df = raw_df[!, vocabulary] data_matrix = Matrix(clean_words_df)' In the first line, we create a variable all_words to store a list of all the words present in the emails. As our dataset has a column for each word, we do this by storing the names of every column with the names function, except for the first and last column, which are for email id and for the spam or ham label, respectively. Let’s move on to the second and third lines of the code. We would like to filter out some words that are very common in the English language, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. In the third line, we create a StringDocument, which is a struct provided by TextAnalysis.jl and we use its built-in methods to remove articles and pronouns from the list of words we created before. This is done by calling the prepare function two times, with two different flags: strip_articles and strip_pronouns. What follows is just code to recover our original DataFrame with only the relevant columns, i.e., the words that were not filtered. A clean_words_df DataFrame is created selecting those columns only. Finally, we turn our DataFrame into a matrix with its rows and columns transposed. This is just the convention used by the packages we are working with to make our analysis; each column is one data realization. Next, we need to divide the data in two: a training set and a testing set. This is standard practice when working with models that learn from data, like the one we’re going to implement. We’re going to train the model on the training set, and then evaluate the model’s accuracy by having it make predictions on the testing set. In Julia, the package MLDataUtils.jl has some nice functionalities for data manipulations like this. labels = raw_df.Prediction (x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7) The function splitobs splits our dataset into a training set and a testing set, and shuffleobs randomizes the order of the data in the split. We pass a labels array to our split function so it knows how to properly split the dataset. Now we can turn our attention to building the spam filter. 4.4 The Naive Bayes Approach As we mentioned, what we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes’ theorem. We’re going to use a naive Bayes classifier to create our spam filter. We’re going to use a classifier to create our spam filter. This method is going to treat each email just as a collection of words, with no regard for the order in which they appear. This means we won’t take into account semantic considerations like the particular relationship between words and their context. Our strategy will be to estimate a probability of an incoming email being ham or spam and make a decision based on that. Our general approach can be summarized as: \\(P(spam|email) \\propto P(email|spam)P(spam)\\) \\(P(ham|email) \\propto P(email|ham)P(ham)\\) Notice we use the \\(\\propto\\) sign, meaning proportional to, instead of the = sign because the denominator from Bayes’s theorem is missing. In this case, we won’t need to calculate it, as it’s the same for both probabilities and all we’re going to care about is a comparison of these two probabilities. In this naive approach, where semantics aren’t taken into account and each email is just a collection of words, the conditional probability \\(P(email|spam)\\) means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. Let’s take a quick example. Imagine for a moment that our training set of emails consists just of these three emails, all labeled as spam: Email 1: ‘Are you interested in buying my product?’ Email 2: ‘Congratulations! You’ve won $1000!’ Email 3: ‘Check out this product!’ Also imagine we receive a new, unclassified email and we want to discover \\(P(email|spam)\\). The new email looks like this: New email: ‘Apply and win all these products!’ The new email contains the words win and product, which are rather common in our example’s training data. We would therefore expect \\(P(email|spam)\\), the probability of the new email being generated by the words encountered in the training spam email set, to be relatively high. (The word \\emph{win} appears in the form \\emph{won} in the training set, but that’s OK. The standard linguistic technique of \\emph{lemmatization} groups together any related forms of a word and treats them as the same word.) Mathematically, the way to calculate \\(P(email|spam)\\) is to take each word in our target email, calculate the probability of it appearing in spam emails based on our training set, and multiply those probabilties together. \\(P(email|spam) = \\prod_{i=1}^{n}P(word_i|spam)\\) We use a similar calculation to determine \\(P(email|ham)\\), the probability of the new email being generated by the words encountered in the training ham email set: \\(P(email|ham) = \\prod_{i=1}^{n}P(word_i|ham)\\) The multiplication of each of the probabilities associated with a particular word here stems from the naive assumption that all the words in the email are statistically independent. In reality, this assumption isn’t necessarily true. In fact, it’s most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires. The probability of a given word \\(word_i\\) being in a given category is calculated like so: \\[P(word_i|spam) = \\frac{N_{word_i|spam} + \\alpha}{N_{spam} + \\alpha N_{vocabulary}}\\] \\[P(word_i|ham) = \\frac{N_{word_i|ham} + \\alpha}{N_{ham} + \\alpha N_{vocabulary}}\\] These formulas tell us exactly what we have to calculate from our data. We need the numbers \\(N_{word_i|spam}\\) and \\(N_{word_i|ham}\\) for each word, meaning the number of times that \\(word_i\\) is used in the spam and ham categories, respectively. \\(N_{spam}\\) and \\(N_{ham}\\) are the total number of words used in the spam and ham categories (including all word repetitions), and \\(N_{vocabulary}\\) is the total number of unique words in the dataset. The variable \\(\\alpha\\) is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn’t appeared in the spam category in our training dataset, for example, we don’t want to assign it zero probability of appearing in new spam emails. As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia struct, with attributes for the pieces of data we’ll need to access over and over during the prediction process. Here’s the implementation: mutable struct BayesSpamFilter words_count_ham::Dict{String, Int64} words_count_spam::Dict{String, Int64} N_ham::Int64 N_spam::Int64 vocabulary::Array{String} BayesSpamFilter() = new() end The relevant attributes of the struct are words_count_ham and words_count_spam, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets; N_ham and N_spam, the total number of words appearing in each category; and vocabulary, an array of all the unique words in the dataset. The line BayesSpamFilter() = new() is the constructor of this struct. Because the constructor is empty, all the attributes will be undefined when we instantiate the filter. We’ll have to define some functions to fill these variables with values that are relevant to our particular problem. First, here’s a function word_count that counts the occurrences of each word in the ham and spam categories. Now we are going to define some functions that will be important for our filter implementation. function words_count(word_data, vocabulary, labels, spam=0) count_dict = Dict{String,Int64}() n_emails = size(word_data)[2] for (i, word) in enumerate(vocabulary) count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam]) end return count_dict end The function word_count counts the occurrences of each word in the ham and spam categories. One of its parameters is word_data, which we defined before and is a matrix where each column is an email and each row is a word. Next, we’ll define a fit! function for our spam filter struct. Notice we’re using the bang (!) convention here to indicate a function that modifies its arguments in-place (in this case, the spam filter struct itself). This function fits our model to the data, a typical procedure in data science and machine learning areas. function fit!(model::BayesSpamFilter, x_train, y_train, voc) model.vocabulary = voc model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0) model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1) model.N_ham = sum(values(model.words_count_ham)) model.N_spam = sum(values(model.words_count_spam)) return end ## fit! (generic function with 1 method) What we mean by fitting the model to the data is mainly filling all the undefined parameters in our struct with values informed by the training data. To do this, we use the words_count function we defined earlier. Notice that we’re only fitting the model to the training portion of the data, since we’re reserving the testing portion to evaluate the model’s accuracy. 4.5 Training the Model Now it’s time to instantiate our spam filter and fit the model to the training data. With the struct and helper functions we’ve defined, the process is quite straightforward. spam_filter = BayesSpamFilter() fit!(spam_filter, x_train, y_train, vocabulary) We create an instance of our BayesSpamFilter struct and pass it to our fit! function along with the data. Notice that we’re only passing in the training portion of the dataset, since we want to reserve the testing portion to evaluate the model’s accuracy later. 4.6 Making Predictions Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We’ll define a few more functions to help with this process. First, we need a function implementing the TAL formula that we discussed earlier. function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α) ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary)) spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary)) return ham_prob, spam_prob end ## word_spam_probability (generic function with 1 method) This function calculates \\(P(word_i|spam)\\) and \\(P(word_i|ham)\\) for a given word. We’ll call it for each word of an incoming email within another function, spam_predict, to calculate the probability of that email being spam or ham. function spam_predict(email, model::BayesSpamFilter, α, tol=100) ngrams_email = ngrams(StringDocument(email)) email_words = keys(ngrams_email) n_vocabulary = length(model.vocabulary) ham_prior = model.N_ham / (model.N_ham + model.N_spam) spam_prior = model.N_spam / (model.N_ham + model.N_spam) if length(email_words) > tol word_freq = values(ngrams_email) sort_idx = sortperm(collect(word_freq), rev=true) email_words = collect(email_words)[sort_idx][1:tol] end email_ham_probability = BigFloat(1) email_spam_probability = BigFloat(1) for word in intersect(email_words, model.vocabulary) word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α) email_ham_probability *= word_ham_prob email_spam_probability *= word_spam_prob end return ham_prior * email_ham_probability, spam_prior * email_spam_probability end This function takes as input a new email that we want to classify as spam or ham, our fitted model, an \\(α\\) value (which we’ve already discussed), and a tolerance value tol. The latter sets the maximum number of unique words in an email that we’ll look at. We saw that the calculations for \\(P(email|spam)\\) and \\(P(email|ham)\\) require the multiplication of each \\(P(word_i|spam)\\) and \\(P(word_i|ham)\\) term. When emails consist of a large number of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This isn’t desirable; we need values of \\(P(email|spam)\\) and \\(P(email|ham)\\) that are larger than zero in order to multiply them by \\(P(spam)\\) and \\(P(ham)\\), respectively, and compare these values to make a prediction. To avoid probabilities of zero, we’ll only consider up to the tol most frequently used words in the email. Finally, we arrive to the point of actually testing our model. We create another function to manage the process. This function classifies each email into Ham (represented by the number 0) or Spam (represented by the number 1) function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200) N = length(y_test) predictions = Array{Int64,1}(undef, N) for i in 1:N email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...) pham, pspam = spam_predict(email, model, α, tol) pred = argmax([pham, pspam]) - 1 predictions[i] = pred end predictions end This function takes in the testing portion of the data and our trained model. We call our spam_predict function for each email in the testing data and use the maximum (argmax) of the two returned probability values to predict (pred) if the email is spam or ham. We return the predictions as an array of values, which will contain zeros for ham emails, and ones for spam emails. Here we call the function to make predictions about the test data: predictions = get_predictions(x_test, y_test, spam_filter, 1) Let’s take a look at the predicted classifications of just the first five emails in the test data. predictions[1:5] ## 5-element Vector{Int64}: ## 0 ## 0 ## 1 ## 0 ## 1 Of the first five emails, one (the third) was classified as spam, and the rest were classified as ham. 4.7 Evaluating the Accuracy Looking at the predictions themselves is pretty meaningless; what we really want to know is the model’s accuracy. We’ll define another function to calculate this. function spam_filter_accuracy(predictions, actual) N = length(predictions) correct = sum(predictions .== actual) accuracy = correct / N accuracy end This function compares the predicted classifications with the actual classifications of the test data, counts the number of correct predictions, and divides this number by the total number of test emails, giving us an accuracy measurement. Here we call the function: spam_filter_accuracy(predictions, y_test) ## 0.9510309278350515 The output indicates our model is about 95 percent accurate. It appears our model is performing very well! Such a high accuracy rate is quite astonishing for a model so naive and simple. In fact, it may be a little too good to be true, because we have to take into account one more thing. Our model classifies emails into spam or ham, but the amount of ham emails in our data set is considerably higher than the spam ones. Let’s see the percentages: sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction]) ## 0.2900232018561485 To calculate the proportion of spam to ham emails, we sum over the Prediction column of the dataset remembering it only consists of 0s and 1s, and then we divide by the total amount of emails.This type of classification problem, where there’s an unequal distribution of classes in the dataset, is called imbalanced. With unbalanced data, a better way to see how the model is performing is to construct a confusion matrix, an \\(N \\times N\\) matrix, where \\(N\\) is the number of target classes (in our case, 2, for spam and ham). The matrix compares the actual values for each class with those predicted by the model. Here’s a function that builds a confusion matrix for our spam filter: function spam_filter_confusion_matrix(y_test, predictions) # 2x2 matrix is instantiated with zeros confusion_matrix = zeros((2, 2)) confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test)) confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test)) confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test)) confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test)) # Now we convert the confusion matrix into a DataFrame confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) return confusion_df end Now let’s call our function to build the confusion matrix for our model. confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions) ## 2×3 DataFrame ## Row │ prediction ham_mail spam_mail ## │ String Float64 Float64 ## ─────┼─────────────────────────────────────────── ## 1 │ Model predicted Ham 1056.0 33.0 ## 2 │ Model predicted Spam 43.0 420.0 Row 1 of the confusion matrix shows us all the times our model classified emails to be ham; 1,056 of those classifications were correct and 36 were incorrect. Similarly, the spam_mail column shows us the classifications for all the spam emails; 36 were misidentified as ham, and 427 were correctly identified as spam. Now that we have the confusion matrix, we can calculate the accuracy of the model segmented by category. ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail]) ## 0.9608735213830755 spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail]) ## 0.9271523178807947 With these values now we have a more fine-grained measure of the accuracy of our model. Now we know that our spam filter doesn’t have the same degree of accuracy for spam and for ham emails. As a consequence of the imbalance in our data, ham emails will be classified as such more accurately than spam emails. Still, with both percentages above 90, the accuracy is pretty good for a model so simple and naive. Models like these can be used like a baseline for creating more complex ones on top of them. 4.8 Summary In this chapter, we’ve used a naive Bayes approach to build a simple email spam filter. We walked through the whole process of training, testing, and evaluating a learning model. First, we obtained a dataset of emails already classified as spam or ham and preprocessed the data. Then we considered the theoretical framework for our naive analysis. Using Bayes’s theorem on the data available, we assigned a probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. We defined a Julia struct for the spam filter object and created functions to fit the spam filter object to the data. Finally, we made predictions on new data and evaluated our model’s performance by calculating the accuracy and making a confusion matrix. 4.9 Appendix - A little more about alpha As we have seen, to calculate the probability of the email being a spam email, we should use \\(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\\) where P(wordnp|spam) stands for the probability of the word that is not presen t in our dataset. What probability should be assigned to this word? One way to handle this could be to simply ignore that term in the multiplication. In other words, assigning P(wordnp|spam)=1. Without thinking about it too much, we can conclude that this doesn’t make any sense, since that would mean that the probability to find that word in a spam (or ham, too) email would be equal to 1. A more logically consistent approach would be to assign 0 probability to that word. But there is a problem: with \\(P(wordnp\\|spam)=0\\) we can quickly see that \\(P(word0|spam)P(word1|spam)...P(wordnp|spam)=0\\) This is the motivation for introducing the smoothing parameter α into our equation. In a real-word scenario, we should expect that words not present in our training set will appear, and altough it makes sense that they don’t have a high probability, it can’t be 0. When such a word appears, the probability assigned for it will be simply \\(P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary\\) In summary, α is just a smoothing parameter, so that the probability of finding a word that is not in our dataset, doesn’t go down to 0. Since we want to keep the probability for these words low enough, it makes sense to use α=1 " - ], - [ - "404.html", - "Page not found", - " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. " - ] -] \ No newline at end of file +[["spam-filter-an-introductory-classification-model.html", "Chapter 4 Spam filter: An introductory classification model 4.1 Naive Bayes: Spam or Ham? 4.2 The data 4.3 Preprocessing the Data 4.4 The Naive Bayes Approach 4.5 Training the Model 4.6 Making Predictions 4.7 Evaluating the Accuracy 4.8 Summary 4.9 Appendix - A little more about alpha", " Chapter 4 Spam filter: An introductory classification model 4.1 Naive Bayes: Spam or Ham? Nobody likes spam emails. How can Bayes’ theorem help? In this chapter, we’ll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails and classify them as either spam or ham (the word for non-spam emails) based on their content. What we will be implementing here is a supervised learning model, in other words, a classification a model that can “learn” to associate the target variable (email type) with the input variables (words contained in the email). Think of it like a machine that is feeded with an email and will output a label associated to it, like spam or ham. This machine has a lot of tiny knobs -also called parameters of the model- and based on their particular configuration, the model will output some label or another. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a naive Bayes approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the chapter. In such a model, Bayes’ theorem is the main tool for classifying, and it is naive because we make very loose assumptions about the data we are analyzing. When creating models we usually take decisions to make our lives easier, and they usually come at the expense of making the model simpler. It is usually a good practice to start building simpler models and add complexity only as needed. This will be clearer once we dive into the implementation. In summary, a successful spam filter model might infer from the training data that emails containing the word “discount” have a high probability of being spam. 4.2 The data For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough corpus 1 of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person’s spam may be another person’s ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive. Fortunately, there are a lot of very good datasets available online. We’ll use the “Email Spam Classification Dataset CSV” from Kaggle, a website where data science enthusiasts and practitioners publish datasets, participate in competitions, and share their knowledge. The dataset’s description included online helps us make sense of its contents: The .csv file contains 5,172 rows, one row for each email. There are 3,002 columns. The first column indicates Email name. The name has been set with numbers and not recipients’ name to protect privacy. The last column has the labels for prediction: for spam, for not spam. The remaining 3,000 columns are the 3,000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells. Let’s take a look at the data. The following code snippet outputs a view of the first and last rows of the dataset. raw_df = CSV.read("./04_naive_bayes/data/emails.csv", DataFrame) ## 5172×3002 DataFrame ## Row │ Email No. the to ect and for of a you ho ⋯ ## │ String15 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 In ⋯ ## ──────┼───────────────────────────────────────────────────────────────────────── ## 1 │ Email 1 0 0 1 0 0 0 2 0 ⋯ ## 2 │ Email 2 8 13 24 6 6 2 102 1 ## 3 │ Email 3 0 0 1 0 0 0 8 0 ## 4 │ Email 4 0 5 22 0 5 1 51 2 ## 5 │ Email 5 7 6 17 1 5 2 57 0 ⋯ ## 6 │ Email 6 4 5 1 4 2 3 45 1 ## 7 │ Email 7 5 3 1 3 2 1 37 0 ## 8 │ Email 8 0 2 2 3 1 2 21 6 ## ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ## 5166 │ Email 5166 1 0 1 0 3 1 12 1 ⋯ ## 5167 │ Email 5167 1 0 1 1 0 0 4 0 ## 5168 │ Email 5168 2 2 2 3 0 0 32 0 ## 5169 │ Email 5169 35 27 11 2 6 5 151 4 ## 5170 │ Email 5170 0 0 1 1 0 0 11 0 ⋯ ## 5171 │ Email 5171 2 7 1 0 2 1 28 2 ## 5172 │ Email 5172 22 24 5 1 6 5 148 8 ## 2993 columns and 5157 rows omitted As you can see, the output informs the amount of rows and columns and the type of each column and allows us to see a sample of the data. 4.3 Preprocessing the Data Before we use the data to train our filter, we need to preprocess it a little bit. First, we should filter out very common words, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. all_words = names(raw_df)[2:end-1] all_words_text = join(all_words, " ") document = StringDocument(all_words_text) prepare!(document, strip_articles) prepare!(document, strip_pronouns) vocabulary = split(TextAnalysis.text(document)) clean_words_df = raw_df[!, vocabulary] data_matrix = Matrix(clean_words_df)' In the first line, we create a variable all_words to store a list of all the words present in the emails. As our dataset has a column for each word, we do this by storing the names of every column with the names function, except for the first and last column, which are for email id and for the spam or ham label, respectively. Let’s move on to the second and third lines of the code. We would like to filter out some words that are very common in the English language, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. In the third line, we create a StringDocument, which is a struct provided by TextAnalysis.jl and we use its built-in methods to remove articles and pronouns from the list of words we created before. This is done by calling the prepare function two times, with two different flags: strip_articles and strip_pronouns. What follows is just code to recover our original DataFrame with only the relevant columns, i.e., the words that were not filtered. A clean_words_df DataFrame is created selecting those columns only. Finally, we turn our DataFrame into a matrix with its rows and columns transposed. This is just the convention used by the packages we are working with to make our analysis; each column is one data realization. Next, we need to divide the data in two: a training set and a testing set. This is standard practice when working with models that learn from data, like the one we’re going to implement. We’re going to train the model on the training set, and then evaluate the model’s accuracy by having it make predictions on the testing set. In Julia, the package MLDataUtils.jl has some nice functionalities for data manipulations like this. labels = raw_df.Prediction (x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7) The function splitobs splits our dataset into a training set and a testing set, and shuffleobs randomizes the order of the data in the split. We pass a labels array to our split function so it knows how to properly split the dataset. Now we can turn our attention to building the spam filter. 4.4 The Naive Bayes Approach As we mentioned, what we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes’ theorem. We’re going to use a naive Bayes classifier to create our spam filter. We’re going to use a classifier to create our spam filter. This method is going to treat each email just as a collection of words, with no regard for the order in which they appear. This means we won’t take into account semantic considerations like the particular relationship between words and their context. Our strategy will be to estimate a probability of an incoming email being ham or spam and make a decision based on that. Our general approach can be summarized as: \\(P(spam|email) \\propto P(email|spam)P(spam)\\) \\(P(ham|email) \\propto P(email|ham)P(ham)\\) Notice we use the \\(\\propto\\) sign, meaning proportional to, instead of the = sign because the denominator from Bayes’s theorem is missing. In this case, we won’t need to calculate it, as it’s the same for both probabilities and all we’re going to care about is a comparison of these two probabilities. In this naive approach, where semantics aren’t taken into account and each email is just a collection of words, the conditional probability \\(P(email|spam)\\) means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. Let’s take a quick example. Imagine for a moment that our training set of emails consists just of these three emails, all labeled as spam: Email 1: ‘Are you interested in buying my product?’ Email 2: ‘Congratulations! You’ve won $1000!’ Email 3: ‘Check out this product!’ Also imagine we receive a new, unclassified email and we want to discover \\(P(email|spam)\\). The new email looks like this: New email: ‘Apply and win all these products!’ The new email contains the words win and product, which are rather common in our example’s training data. We would therefore expect \\(P(email|spam)\\), the probability of the new email being generated by the words encountered in the training spam email set, to be relatively high. 2 Mathematically, the way to calculate \\(P(email|spam)\\) is to take each word in our target email, calculate the probability of it appearing in spam emails based on our training set, and multiply those probabilities together. \\(P(email|spam) = \\prod_{i=1}^{n}P(word_i|spam)\\) We use a similar calculation to determine \\(P(email|ham)\\), the probability of the new email being generated by the words encountered in the training ham email set: \\(P(email|ham) = \\prod_{i=1}^{n}P(word_i|ham)\\) The multiplication of each of the word probabilities here stands from the naive supposition that all the words in the email are conditionally independent given the class (spam or ham). In reality, this assumption isn’t necessarily true. In fact, it’s most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires. The probability of a given word \\(word_i\\) being in a given category is calculated like so: \\[\\begin{equation} \\tag{1.1} P(word_i|spam) = \\frac{N_{word_i|spam} + \\alpha}{N_{spam} + \\alpha N_{vocabulary}} \\end{equation}\\] \\[\\begin{equation} \\tag{1.2} P(word_i|ham) = \\frac{N_{word_i|ham} + \\alpha}{N_{ham} + \\alpha N_{vocabulary}} \\end{equation}\\] These formulas tell us exactly what we have to calculate from our data. We need the numbers \\(N_{word_i|spam}\\) and \\(N_{word_i|ham}\\) for each word, meaning the number of times that \\(word_i\\) is used in the spam and ham categories, respectively. \\(N_{spam}\\) and \\(N_{ham}\\) are the total number of words used in the spam and ham categories (including all word repetitions), and \\(N_{vocabulary}\\) is the total number of unique words in the dataset. The variable \\(\\alpha\\) is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn’t appeared in the spam category in our training dataset, for example, we don’t want to assign it zero probability of appearing in new spam emails. See the appendix for more details. As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia struct, with attributes for the pieces of data we’ll need to access over and over during the prediction process. Here’s the implementation: mutable struct BayesSpamFilter words_count_ham::Dict{String, Int64} words_count_spam::Dict{String, Int64} N_ham::Int64 N_spam::Int64 vocabulary::Array{String} BayesSpamFilter() = new() end The relevant attributes of the struct are words_count_ham and words_count_spam, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets; N_ham and N_spam, the total number of words appearing in each category; and vocabulary, an array of all the unique words in the dataset. The line BayesSpamFilter() = new() is the constructor of this struct. Because the constructor is empty, all the attributes will be undefined when we instantiate the filter. We’ll have to define some functions to fill these variables with values that are relevant to our particular problem. First, here’s a function word_count that counts the occurrences of each word in the ham and spam categories. Now we are going to define some functions that will be important for our filter implementation. function words_count(word_data, vocabulary, labels, spam=0) count_dict = Dict{String,Int64}() n_emails = size(word_data)[2] for (i, word) in enumerate(vocabulary) count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam]) end return count_dict end The function word_count counts the occurrences of each word in the ham and spam categories. One of its parameters is word_data, which we defined before and is a matrix where each column is an email and each row is a word. Next, we’ll define a fit! function for our spam filter struct. Notice we’re using the bang (!) convention here to indicate a function that modifies its arguments in-place (in this case, the spam filter struct itself). This function fits our model to the data, a typical procedure in data science and machine learning areas. function fit!(model::BayesSpamFilter, x_train, y_train, voc) model.vocabulary = voc model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0) model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1) model.N_ham = sum(values(model.words_count_ham)) model.N_spam = sum(values(model.words_count_spam)) return end What we mean by fitting the model to the data is mainly filling all the undefined parameters in our struct with values informed by the training data. To do this, we use the words_count function we defined earlier. Notice that we’re only fitting the model to the training portion of the data, since we’re reserving the testing portion to evaluate the model’s accuracy. 4.5 Training the Model Now it’s time to instantiate our spam filter and fit the model to the training data. With the struct and helper functions we’ve defined, the process is quite straightforward. spam_filter = BayesSpamFilter() fit!(spam_filter, x_train, y_train, vocabulary) We create an instance of our BayesSpamFilter struct and pass it to our fit! function along with the data. Notice that we’re only passing in the training portion of the dataset, since we want to reserve the testing portion to evaluate the model’s accuracy later. 4.6 Making Predictions Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We’ll define a few more functions to help with this process. First, we need a function implementing formulas \\((1.1)\\) and \\((1.2)\\). function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α) ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary)) spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary)) return ham_prob, spam_prob end This function calculates \\(P(word_i|spam)\\) and \\(P(word_i|ham)\\) for a given word. We’ll call it for each word of an incoming email within another function, spam_predict, to calculate the probability of that email being spam or ham. function spam_predict(email, model::BayesSpamFilter, α, tol=100) ngrams_email = ngrams(StringDocument(email)) email_words = keys(ngrams_email) n_vocabulary = length(model.vocabulary) ham_prior = model.N_ham / (model.N_ham + model.N_spam) spam_prior = model.N_spam / (model.N_ham + model.N_spam) if length(email_words) > tol word_freq = values(ngrams_email) sort_idx = sortperm(collect(word_freq), rev=true) email_words = collect(email_words)[sort_idx][1:tol] end email_ham_probability = BigFloat(1) email_spam_probability = BigFloat(1) for word in intersect(email_words, model.vocabulary) word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α) email_ham_probability *= word_ham_prob email_spam_probability *= word_spam_prob end return ham_prior * email_ham_probability, spam_prior * email_spam_probability end This function takes as input a new email that we want to classify as spam or ham, our fitted model, an \\(α\\) value (which we’ve already discussed), and a tolerance value tol. The latter sets the maximum number of unique words in an email that we’ll look at. We saw that the calculations for \\(P(email|spam)\\) and \\(P(email|ham)\\) require the multiplication of each \\(P(word_i|spam)\\) and \\(P(word_i|ham)\\) term. When emails consist of a large number of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This isn’t desirable; we need values of \\(P(email|spam)\\) and \\(P(email|ham)\\) that are larger than zero in order to multiply them by \\(P(spam)\\) and \\(P(ham)\\), respectively, and compare these values to make a prediction. To avoid probabilities of zero, we’ll only consider up to the tol most frequently used words in the email. Finally, we arrive to the point of actually testing our model. We create another function to manage the process. This function classifies each email into Ham (represented by the number 0) or Spam (represented by the number 1) function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200) N = length(y_test) predictions = Array{Int64,1}(undef, N) for i in 1:N email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...) pham, pspam = spam_predict(email, model, α, tol) pred = argmax([pham, pspam]) - 1 predictions[i] = pred end predictions end This function takes in the testing portion of the data and our trained model. We call our spam_predict function for each email in the testing data and use the maximum (argmax) of the two returned probability values to predict (pred) if the email is spam or ham. We return the predictions as an array of values, which will contain zeros for ham emails, and ones for spam emails. Here we call the function to make predictions about the test data: predictions = get_predictions(x_test, y_test, spam_filter, 1) Let’s take a look at the predicted classifications of just the first five emails in the test data. predictions[1:5] ## 5-element Vector{Int64}: ## 0 ## 0 ## 1 ## 0 ## 1 Of the first five emails, the third and the fifth were classified as spam, while the rest were classified as ham. 4.7 Evaluating the Accuracy Looking at the predictions themselves is pretty meaningless; what we really want is to have some metric which can help us evaluate the effectiveness of our model in a quantitative manner. Usually, the first approach to this is to calculate the model’s accuracy. We’ll define another function for this calculation. function spam_filter_accuracy(predictions, actual) N = length(predictions) correct = sum(predictions .== actual) accuracy = correct / N accuracy end This function compares the predicted classifications with the actual classifications of the test data, counts the number of correct predictions, and divides this number by the total number of test emails, giving us an accuracy measurement. Here we call the function: spam_filter_accuracy(predictions, y_test) ## 0.9510309278350515 The output indicates our model is about 95 percent accurate. It appears our model is performing very well! Such a high accuracy rate is quite astonishing for a model so naive and simple. In fact, it may be a little too good to be true, because we have to take into account one more thing. Our model classifies emails into spam or ham, but the amount of ham emails in our data set is considerably higher than the spam ones. Let’s see the percentages: sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction]) ## 0.2900232018561485 To calculate the proportion of spam to ham emails, we sum over the Prediction column of the dataset remembering it only consists of 0s and 1s, and then we divide by the total amount of emails.This type of classification problem, where there’s an unequal distribution of classes in the dataset, is called imbalanced. With unbalanced data, a better way to see how the model is performing is to construct a confusion matrix, an \\(N \\times N\\) matrix, where \\(N\\) is the number of target classes (in our case, 2, for spam and ham). The matrix compares the actual values for each class with those predicted by the model. Here’s a function that builds a confusion matrix for our spam filter: function spam_filter_confusion_matrix(y_test, predictions) confusion_matrix = zeros((2, 2)) confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test)) confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test)) confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test)) confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test)) confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[]) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2])) confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2])) return confusion_df end Now let’s call our function to build the confusion matrix for our model. confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions) ## 2×3 DataFrame ## Row │ prediction ham_mail spam_mail ## │ String Real Real ## ─────┼─────────────────────────────────────────── ## 1 │ Model predicted Ham 1056.0 33.0 ## 2 │ Model predicted Spam 43.0 420.0 Row 1 of the confusion matrix shows us all the times our model classified emails to be ham; 1,056 of those classifications were correct and 36 were incorrect. Similarly, the spam_mail column shows us the classifications for all the spam emails; 36 were misidentified as ham, and 427 were correctly identified as spam. Now that we have the confusion matrix, we can calculate the accuracy of the model segmented by category. ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail]) ## 0.9608735213830755 spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail]) ## 0.9271523178807947 With these values now we have a more fine-grained measure of the accuracy of our model. Now we know that our spam filter doesn’t have the same degree of accuracy for spam and for ham emails. As a consequence of the imbalance in our data, ham emails will be classified as such more accurately than spam emails. Still, with both percentages above 90, the accuracy is pretty good for a model so simple and naive. Models like these can be used like a baseline for creating more complex ones on top of them. 4.8 Summary In this chapter, we’ve used a naive Bayes approach to build a simple email spam filter. We walked through the whole process of training, testing, and evaluating a learning model. First, we obtained a dataset of emails already classified as spam or ham and preprocessed the data. Then we considered the theoretical framework for our naive analysis. Using Bayes’s theorem on the data available, we assigned a probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. We defined a Julia struct for the spam filter object and created functions to fit the spam filter object to the data. Finally, we made predictions on new data and evaluated our model’s performance by calculating the accuracy and making a confusion matrix. 4.9 Appendix - A little more about alpha As we have seen, to calculate the probability of the email being a spam email, we should use \\(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\\) where P(wordnp|spam) stands for the probability of the word that is not presen t in our dataset. What probability should be assigned to this word? One way to handle this could be to simply ignore that term in the multiplication. In other words, assigning P(wordnp|spam)=1. Without thinking about it too much, we can conclude that this doesn’t make any sense, since that would mean that the probability to find that word in a spam (or ham, too) email would be equal to 1. A more logically consistent approach would be to assign 0 probability to that word. But there is a problem: with \\(P(wordnp\\|spam)=0\\) we can quickly see that \\(P(word0|spam)P(word1|spam)...P(wordnp|spam)=0\\) This is the motivation for introducing the smoothing parameter α into our equation. In a real-word scenario, we should expect that words not present in our training set will appear, and altough it makes sense that they don’t have a high probability, it can’t be 0. When such a word appears, the probability assigned for it will be simply \\(P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary\\) In summary, α is just a smoothing parameter, so that the probability of finding a word that is not in our dataset, doesn’t go down to 0. Since we want to keep the probability for these words low enough, it makes sense to use α=1 A large and structured set of text data, very oftenly used to train language models.↩︎ The word win appears in the form won in the training set, but that’s OK. The standard linguistic technique of lemmatization groups together any related forms of a word and treats them as the same word.↩︎ "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/spam-filter.html b/docs/spam-filter.html index 60e4bdde..be5bd6e8 100644 --- a/docs/spam-filter.html +++ b/docs/spam-filter.html @@ -6,7 +6,7 @@ Chapter 4 Spam filter | Data Science in Julia for Hackers - + @@ -190,7 +190,7 @@
  • 4.6 Making Predictions
  • 4.7 Evaluating the Accuracy
  • 4.8 Summary
  • -
  • 4.9 Appendix - A little more about alpha
  • +
  • 4.9 Appendix - A little more about alpha
  • 5 Probabilistic programming
  • 6.3 Summary
  • +
  • 6.4 References
  • 7 Football simulation
  • 7.2 Simulate possible realities
  • 7.3 Summary
  • -
  • 7.4 References
  • +
  • 7.4 References
  • 8 Basketball shots
  • 9 Optimal pricing @@ -241,7 +242,7 @@
  • 9.3 Maximizing profit
  • 9.4 Summary
  • -
  • 9.5 References
  • +
  • 9.5 References
  • 10 Image classification
  • 11 Ultima online
  • 11.2 Summary
  • -
  • 11.3 References
  • +
  • 11.3 References
  • 12 Ultima continued
  • 12.3 Summary
  • -
  • 12.4 References
  • +
  • 12.4 References
  • 13 Time series
  • 13.3 Summary
  • -
  • 13.4 References
  • +
  • 13.4 References
  • Published with bookdown
  • @@ -315,11 +316,13 @@

    4.1 Naive Bayes: Spam or Ham?

    What we will be implementing here is a supervised learning model, in -other words, a classification model that has been trained on previously -classified data. Think of it like a machine to which you can give some -input, like an email, and will give you some label to that input, like -spam or ham. This machine has a lot of tiny knobs, and based on their -particular configuration it will output some label for each input. +other words, a classification a model that can “learn” +to associate the target variable (email type) with the input variables +(words contained in the email). Think of it like a machine that is feeded +with an email and will output a label associated to it, like +spam or ham. This machine has a lot of tiny knobs -also called parameters of +the model- and based on their particular configuration, the model will output +some label or another. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and @@ -327,10 +330,16 @@

    4.1 Naive Bayes: Spam or Ham?naive Bayes approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the -chapter. In a naive Bayes model, Bayes’ theorem is the main tool for +chapter. In such a model, Bayes’ theorem is the main tool for classifying, and it is naive because we make very loose assumptions -about the data we are analyzing. This will be clearer once we dive into -the implementation.

    +about the data we are analyzing. When creating models we usually take +decisions to make our lives easier, and they usually come at the expense +of making the model simpler. It is usually a good practice to start building +simpler models and add complexity only as needed. This will be clearer once +we dive into the implementation. +In summary, a successful spam filter model might infer from the training +data that emails containing the word “discount” have a high probability +of being spam.

    4.2 The Training Data

    @@ -482,11 +491,7 @@

    4.4 The Naive Bayes ApproachThe new email contains the words win and product, which are rather common in our example’s training data. We would therefore expect \(P(email|spam)\), the probability of the new email being generated by the -words encountered in the training spam email set, to be relatively high.

    -

    (The word \emph{win} appears in the form \emph{won} in the training -set, but that’s OK. The standard linguistic technique of -\emph{lemmatization} groups together any related forms of a word and -treats them as the same word.)

    +words encountered in the training spam email set, to be relatively high.
    1

    Mathematically, the way to calculate \(P(email|spam)\) is to take each word in our target email, calculate the probability of it appearing in spam emails based on our training set, and multiply those probabilties @@ -496,17 +501,23 @@

    4.4 The Naive Bayes Approach

    \(P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)\)

    -

    The multiplication of each of the probabilities associated with a -particular word here stems from the naive assumption that all the words -in the email are statistically independent. In reality, this assumption +

    The multiplication of each of the word probabilities here stands from the +naive supposition that all the words in the email are conditionally +independent given the class (spam or ham). In reality, this assumption isn’t necessarily true. In fact, it’s most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires.

    The probability of a given word \(word_i\) being in a given category is calculated like so:

    -

    \[P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}\] -\[P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}\]

    +

    \[\begin{equation} + \tag{1.1} + P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}} +\end{equation}\]

    +

    \[\begin{equation} + \tag{1.2} + P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}} +\end{equation}\]

    These formulas tell us exactly what we have to calculate from our data. We need the numbers \(N_{word_i|spam}\) and \(N_{word_i|ham}\) for each word, meaning the number of times that \(word_i\) is used in the spam and @@ -517,7 +528,8 @@

    4.4 The Naive Bayes Approach +probability of appearing in new spam emails. See the appendix +for more details.

    As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia struct, with attributes for the pieces of data we’ll need to access over and over during the prediction @@ -568,7 +580,6 @@

    4.4 The Naive Bayes Approach model.N_spam = sum(values(model.words_count_spam)) return end

    -
    ## fit! (generic function with 1 method)

    What we mean by fitting the model to the data is mainly filling all the undefined parameters in our struct with values informed by the training data. To do this, we use the words_count function we defined earlier. @@ -581,8 +592,8 @@

    4.5 Training the Model
    spam_filter = BayesSpamFilter()
    -fit!(spam_filter, x_train, y_train, vocabulary)
    +
    spam_filter = BayesSpamFilter()
    +fit!(spam_filter, x_train, y_train, vocabulary)

    We create an instance of our BayesSpamFilter struct and pass it to our fit! function along with the data. Notice that we’re only passing in the training portion of the dataset, since we want to reserve the @@ -593,40 +604,39 @@

    4.6 Making Predictions
    function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
    -    ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
    -    spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
    -    return ham_prob, spam_prob
    -end
    -
    ## word_spam_probability (generic function with 1 method)
    +implementing formulas \((1.1)\) and \((1.2)\).

    +
    function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
    +    ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
    +    spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
    +    return ham_prob, spam_prob
    +end

    This function calculates \(P(word_i|spam)\) and \(P(word_i|ham)\) for a given word. We’ll call it for each word of an incoming email within another function, spam_predict, to calculate the probability of that email being spam or ham.

    -
    function spam_predict(email, model::BayesSpamFilter, α, tol=100)
    -    ngrams_email = ngrams(StringDocument(email))
    -    email_words = keys(ngrams_email)
    -    n_vocabulary = length(model.vocabulary)
    -    ham_prior = model.N_ham / (model.N_ham + model.N_spam)
    -    spam_prior = model.N_spam / (model.N_ham + model.N_spam)
    -
    -    if length(email_words) > tol
    -        word_freq = values(ngrams_email)
    -        sort_idx = sortperm(collect(word_freq), rev=true)
    -        email_words = collect(email_words)[sort_idx][1:tol]
    -    end
    -
    -    email_ham_probability = BigFloat(1)
    -    email_spam_probability = BigFloat(1)
    -
    -    for word in intersect(email_words, model.vocabulary)
    -        word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
    -        email_ham_probability *= word_ham_prob
    -        email_spam_probability *= word_spam_prob
    -    end
    -    return ham_prior * email_ham_probability, spam_prior * email_spam_probability
    -end
    +
    function spam_predict(email, model::BayesSpamFilter, α, tol=100)
    +    ngrams_email = ngrams(StringDocument(email))
    +    email_words = keys(ngrams_email)
    +    n_vocabulary = length(model.vocabulary)
    +    ham_prior = model.N_ham / (model.N_ham + model.N_spam)
    +    spam_prior = model.N_spam / (model.N_ham + model.N_spam)
    +
    +    if length(email_words) > tol
    +        word_freq = values(ngrams_email)
    +        sort_idx = sortperm(collect(word_freq), rev=true)
    +        email_words = collect(email_words)[sort_idx][1:tol]
    +    end
    +
    +    email_ham_probability = BigFloat(1)
    +    email_spam_probability = BigFloat(1)
    +
    +    for word in intersect(email_words, model.vocabulary)
    +        word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
    +        email_ham_probability *= word_ham_prob
    +        email_spam_probability *= word_spam_prob
    +    end
    +    return ham_prior * email_ham_probability, spam_prior * email_spam_probability
    +end

    This function takes as input a new email that we want to classify as spam or ham, our fitted model, an \(α\) value (which we’ve already discussed), and a tolerance value tol. The latter sets the maximum @@ -645,18 +655,18 @@

    4.6 Making Predictions
    function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
    -    N = length(y_test)
    -    predictions = Array{Int64,1}(undef, N)
    -    for i in 1:N
    -        email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
    -        pham, pspam = spam_predict(email, model, α, tol)
    -        pred = argmax([pham, pspam]) - 1
    -        predictions[i] = pred
    -    end
    -
    -    predictions
    -end
    +
    function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
    +    N = length(y_test)
    +    predictions = Array{Int64,1}(undef, N)
    +    for i in 1:N
    +        email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
    +        pham, pspam = spam_predict(email, model, α, tol)
    +        pred = argmax([pham, pspam]) - 1
    +        predictions[i] = pred
    +    end
    +
    +    predictions
    +end

    This function takes in the testing portion of the data and our trained model. We call our spam_predict function for each email in the testing data and use the maximum (argmax) of the two returned probability @@ -664,17 +674,17 @@

    4.6 Making Predictions
    predictions = get_predictions(x_test, y_test, spam_filter, 1)
    +
    predictions = get_predictions(x_test, y_test, spam_filter, 1)

    Let’s take a look at the predicted classifications of just the first five emails in the test data.

    -
    predictions[1:5]
    +
    predictions[1:5]
    ## 5-element Vector{Int64}:
     ##  0
     ##  0
     ##  1
     ##  0
     ##  1
    -

    Of the first five emails, one (the third) was classified as spam, and +

    Of the first five emails, the third and the fifth were classified as spam, while the rest were classified as ham.

    @@ -682,17 +692,17 @@

    4.7 Evaluating the AccuracyLooking at the predictions themselves is pretty meaningless; what we really want to know is the model’s accuracy. We’ll define another function to calculate this.

    -
    function spam_filter_accuracy(predictions, actual)
    -    N = length(predictions)
    -    correct = sum(predictions .== actual)
    -    accuracy = correct / N
    -    accuracy
    -end
    +
    function spam_filter_accuracy(predictions, actual)
    +    N = length(predictions)
    +    correct = sum(predictions .== actual)
    +    accuracy = correct / N
    +    accuracy
    +end

    This function compares the predicted classifications with the actual classifications of the test data, counts the number of correct predictions, and divides this number by the total number of test emails, giving us an accuracy measurement. Here we call the function:

    -
    spam_filter_accuracy(predictions, y_test)
    +
    spam_filter_accuracy(predictions, y_test)
    ## 0.9510309278350515

    The output indicates our model is about 95 percent accurate. It appears our model is performing very well! Such a high accuracy rate is quite @@ -701,7 +711,7 @@

    4.7 Evaluating the Accuracy -
    sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
    +
    sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
    ## 0.2900232018561485

    To calculate the proportion of spam to ham emails, we sum over the Prediction column of the dataset remembering it only consists of 0s @@ -713,27 +723,25 @@

    4.7 Evaluating the Accuracy -
    function spam_filter_confusion_matrix(y_test, predictions)
    -    # 2x2 matrix is instantiated with zeros
    -    confusion_matrix = zeros((2, 2))
    -
    -    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
    -    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
    -    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
    -    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
    -
    -    # Now we convert the confusion matrix into a DataFrame 
    -    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
    -    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
    -    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
    -
    -    return confusion_df
    -end
    +
    function spam_filter_confusion_matrix(y_test, predictions)
    +    confusion_matrix = zeros((2, 2))
    +
    +    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test))
    +    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test))
    +    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test))
    +    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test))
    +
    +    confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[])
    +    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
    +    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
    +
    +    return confusion_df
    +end

    Now let’s call our function to build the confusion matrix for our model.

    -
    confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
    +
    confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
    ## 2×3 DataFrame
     ##  Row │ prediction            ham_mail  spam_mail
    -##      │ String                Float64   Float64
    +##      │ String                Real      Real
     ## ─────┼───────────────────────────────────────────
     ##    1 │ Model predicted Ham     1056.0       33.0
     ##    2 │ Model predicted Spam      43.0      420.0
    @@ -744,9 +752,9 @@

    4.7 Evaluating the Accuracy

    Now that we have the confusion matrix, we can calculate the accuracy of the model segmented by category.

    -
    ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
    +
    ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
    ## 0.9608735213830755
    -
    spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
    +
    spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
    ## 0.9271523178807947

    With these values now we have a more fine-grained measure of the accuracy of our model. Now we know that our spam filter doesn’t have the @@ -773,8 +781,8 @@

    4.8 Summary -

    4.9 Appendix - A little more about alpha

    +
    +

    4.9 Appendix - A little more about alpha

    As we have seen, to calculate the probability of the email being a spam email, we should use

    \(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\)

    @@ -801,6 +809,15 @@

    4.9 Appendix - A little more abou sense to use α=1

    +

    +
    +
    +
      +
    1. The word win appears in the form won in the training +set, but that’s OK. The standard linguistic technique of +lemmatization groups together any related forms of a word and +treats them as the same word.↩︎

    2. +