Multinomial Naive Bayes Classifier
multinomial_naive_bayes.Rd
multinomial_naive_bayes
is used to fit the Multinomial Naive Bayes model.
Arguments
- x
numeric matrix with integer predictors (matrix or dgCMatrix from Matrix package).
- y
class vector (character/factor/logical).
- prior
vector with prior probabilities of the classes. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.
- laplace
value used for Laplace smoothing (additive smoothing). Defaults to 0.5.
- ...
not used.
Value
multinomial_naive_bayes
returns an object of class "multinomial_naive_bayes" which is a list with following components:
- data
list with two components:
x
(matrix with predictors) andy
(class variable).- levels
character vector with values of the class variable.
- laplace
amount of Laplace smoothing (additive smoothing).
- params
matrix with class conditional parameter estimates.
- prior
numeric vector with prior probabilities.
- call
the call that produced this object.
Details
This is a specialized version of the Naive Bayes classifier, where the features represent frequencies generated by a multinomial distribution.
Sparse matrices of class "dgCMatrix" (Matrix package) are supported in order to speed up calculation times.
Please note that the Multinomial Naive Bayes is not available through the naive_bayes
function.
References
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press (Chapter 13). Available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Author
Michal Majka, michalmajka@hotmail.com
Examples
# library(naivebayes)
### Simulate the data:
set.seed(1)
cols <- 3 # words
rows <- 10000 # all documents
rows_spam <- 100 # spam documents
prob_word_non_spam <- prop.table(runif(cols))
prob_word_spam <- prop.table(runif(cols))
M1 <- t(rmultinom(rows_spam, size = cols, prob = prob_word_spam))
M2 <- t(rmultinom(rows - rows_spam, size = cols, prob = prob_word_non_spam))
M <- rbind(M1, M2)
colnames(M) <- paste0("word", 1:cols) ; rownames(M) <- paste0("doc", 1:rows)
head(M)
#> word1 word2 word3
#> doc1 3 0 0
#> doc2 2 0 1
#> doc3 0 0 3
#> doc4 1 1 1
#> doc5 1 1 1
#> doc6 1 1 1
y <- c(rep("spam", rows_spam), rep("non-spam", rows - rows_spam))
### Train the Multinomial Naive Bayes
laplace <- 1
mnb <- multinomial_naive_bayes(x = M, y = y, laplace = laplace)
summary(mnb)
#>
#> =========================== Multinomial Naive Bayes ============================
#>
#> - Call: multinomial_naive_bayes(x = M, y = y, laplace = laplace)
#> - Laplace: 1
#> - Classes: 2
#> - Samples: 10000
#> - Features: 3
#> - Prior probabilities:
#> - non-spam: 0.99
#> - spam: 0.01
#>
#> --------------------------------------------------------------------------------
# Classification
head(predict(mnb, newdata = M, type = "class")) # head(mnb %class% M)
#> [1] non-spam non-spam non-spam non-spam non-spam non-spam
#> Levels: non-spam spam
# Posterior probabilities
head(predict(mnb, newdata = M, type = "prob")) # head(mnb %prob% M)
#> non-spam spam
#> doc1 0.9184347 0.081565299
#> doc2 0.9622849 0.037715133
#> doc3 0.9924244 0.007575636
#> doc4 0.9927723 0.007227708
#> doc5 0.9927723 0.007227708
#> doc6 0.9927723 0.007227708
# Parameter estimates
coef(mnb)
#> non-spam spam
#> word1 0.2190688 0.4521452
#> word2 0.3099014 0.1188119
#> word3 0.4710299 0.4290429
# Compare
round(cbind(non_spam = prob_word_non_spam, spam = prob_word_spam), 3)
#> non_spam spam
#> [1,] 0.219 0.452
#> [2,] 0.307 0.100
#> [3,] 0.473 0.447
### Sparse data: train the Multinomial Naive Bayes
library(Matrix)
M_sparse <- Matrix(M, sparse = TRUE)
class(M_sparse) # dgCMatrix
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
# Fit the model with sparse data
mnb_sparse <- multinomial_naive_bayes(M_sparse, y, laplace = laplace)
# Classification
head(predict(mnb_sparse, newdata = M_sparse, type = "class"))
#> [1] non-spam non-spam non-spam non-spam non-spam non-spam
#> Levels: non-spam spam
# Posterior probabilities
head(predict(mnb_sparse, newdata = M_sparse, type = "prob"))
#> non-spam spam
#> doc1 0.9184347 0.081565299
#> doc2 0.9622849 0.037715133
#> doc3 0.9924244 0.007575636
#> doc4 0.9927723 0.007227708
#> doc5 0.9927723 0.007227708
#> doc6 0.9927723 0.007227708
# Parameter estimates
coef(mnb_sparse)
#> non-spam spam
#> word1 0.2190688 0.4521452
#> word2 0.3099014 0.1188119
#> word3 0.4710299 0.4290429