Data Science - Individual Assignment 1 Week 7
Individual Assignment Data Science
Topic | Naïve Bayesian |
---|---|
Subtopic | Implementation |
Reference | LN |
Below is the data of employees of a private company who commute to the office using either private or public transportation.
No. | Gender | Employee Age | Salary | Marital Status | Transportation |
---|---|---|---|---|---|
1 | Male | 20 | 8,000,000 | Single | Private Vehicle |
2 | Male | 35 | 14,000,000 | Single | Public Transport |
3 | Female | 26 | 10,000,000 | Single | Public Transport |
4 | Female | 27 | 12,000,000 | Married | Private Vehicle |
5 | Male | 21 | 9,000,000 | Single | Private Vehicle |
6 | Male | 22 | 11,000,000 | Single | Private Vehicle |
7 | Female | 32 | 15,000,000 | Married | Public Transport |
8 | Female | 26 | 8,000,000 | Married | Public Transport |
9 | Male | 25 | 9,000,000 | Single | Public Transport |
10 | Female | 20 | 10,000,000 | Single | Private Vehicle |
11 | Female | 27 | 12,000,000 | Single | ? |
12 | Male | 35 | 14,000,000 | Married | ? |
Using the Naïve Bayesian method, determine the type of transportation for 2 new data. Please use or modify the code lines from LN material Week 7 page.
Note: Provide the program code and a screenshot of the program output.
Solution
Naïve Bayesian for Employee Transportation Classification:
1. Introduction: In this case, we will utilize the Naïve Bayesian method to classify the type of transportation used by employees in a company based on several attributes such as gender, age, salary, marital status, and previous transportation choice.
2. Employee Data: Below is the employee data that will be used to train the model and predict the type of transportation for two new data entries:
No. | Gender | Employee Age | Salary | Marital Status | Transportation |
---|---|---|---|---|---|
1 | Male | 20 | 8,000,000 | Single | Private Vehicle |
2 | Male | 35 | 14,000,000 | Single | Public Transport |
3 | Female | 26 | 10,000,000 | Single | Public Transport |
4 | Female | 27 | 12,000,000 | Married | Private Vehicle |
5 | Male | 21 | 9,000,000 | Single | Private Vehicle |
6 | Male | 22 | 11,000,000 | Single | Private Vehicle |
7 | Female | 32 | 15,000,000 | Married | Public Transport |
8 | Female | 26 | 8,000,000 | Married | Public Transport |
9 | Male | 25 | 9,000,000 | Single | Public Transport |
10 | Female | 20 | 10,000,000 | Single | Private Vehicle |
11 | Female | 27 | 12,000,000 | Single | ? |
12 | Male | 35 | 14,000,000 | Married | ? |
3. Naïve Bayesian Implementation:
3.1. Import Library and Tokenization Function: In this step, we import the necessary libraries and define a specific function for tokenization.
from typing import NamedTuple, Set
import re
The tokenize
function is implemented to break the text into tokens (unique words) and return a set of these tokens. Tokenization is crucial in this context as it transforms the text, such as employee attributes, into a form that can be further processed by the Naïve Bayesian model.
def tokenize(text: str) -> Set[str]:
text = text.lower() # Convert text to lowercase,
all_words = re.findall("[a-z0-9']+", text) # extract words, and
return set(all_words) # return a set of unique words.
This function contributes to data preparation for model training and prediction by providing a token representation of the text.
3.2. Definition of Message Class and NaiveBayesClassifier: In this step, we define the Message
class to represent messages or data used in model training. The NaiveBayesClassifier
class is an implementation of the Naïve Bayesian model for classification.
from typing import List, Tuple, Dict, Iterable
import math
from collections import defaultdict
class Message(NamedTuple):
text: str
is_spam: bool
class NaiveBayesClassifier:
def __init__(self, k: float = 0.5) -> None:
self.k = k
self.tokens: Set[str] = set()
self.token_spam_counts: Dict[str, int] = defaultdict(int)
self.token_ham_counts: Dict[str, int] = defaultdict(int)
self.spam_messages = self.ham_messages = 0
def train(self, messages: Iterable[Message]) -> None:
for message in messages:
if message.is_spam:
self.spam_messages += 1
else:
self.ham_messages += 1
for token in tokenize(message.text):
self.tokens.add(token)
if message.is_spam:
self.token_spam_counts[token] += 1
else:
self.token_ham_counts[token] += 1
def _probabilities(self, token: str) -> Tuple[float, float]:
spam = self.token_spam_counts[token]
ham = self.token_ham_counts[token]
p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
p_token_ham = (ham + self.k) / (self.ham_messages + 2 * self.k)
return p_token_spam, p_token_ham
def predict(self, text: str) -> float:
text_tokens = tokenize(text)
log_prob_if_spam = log_prob_if_ham = 0.0
for token in self.tokens:
prob_if_spam, prob_if_ham = self._probabilities(token)
if token in text_tokens:
log_prob_if_spam += math.log(prob_if_spam)
log_prob_if_ham += math.log(prob_if_ham)
else:
log_prob_if_spam += math.log(1.0 - prob_if_spam)
log_prob_if_ham += math.log(1.0 - prob_if_ham)
prob_if_spam = math.exp(log_prob_if_spam)
prob_if_ham = math.exp(log_prob_if_ham)
return prob_if_spam / (prob_if_spam + prob_if_ham)
This class organizes the model and methods to train the model with the provided data.
3.3. Training Model: Next, we train the Naïve Bayesian model using the tokenized employee data.
# Training data from the table
training_data = [
("male 20 8,000,000 single private vehicle", False),
("male 35 14,000,000 single public transport", True),
("female 26 10,000,000 single public transport", True),
("female 27 12,000,000 married private vehicle", False),
("male 21 9,000,000 single private vehicle", False),
("male 22 11,000,000 single private vehicle", False),
("female 32 15,000,000 married public transport", True),
("female 26 8,000,000 married public transport", True),
("male 25 9,000,000 single public transport", True),
("female 20 10,000,000 single private vehicle", False),
]
# Organizing training data into Message format
messages = [Message(text, is_spam) for text, is_spam in training_data]
# Training the model
model = NaiveBayesClassifier(k=0.5)
model.train(messages)
This step creates and trains the model with the tokenized data to understand patterns and correlations between the attributes.
4. Prediction for New Data: Subsequently, we use the trained model to predict the type of
transportation for two new employee data entries.
text_data_11 = "Female 27 12,000,000 Single ?"
text_data_12 = "Male 35 14,000,000 Married ?"
prediction_11 = model.predict(text_data_11)
prediction_12 = model.predict(text_data_12)
# Setting a threshold, e.g., if probability > 0.5, then private vehicle; if probability <= 0.5, then public transport.
result_11 = "Private Vehicle" if prediction_11 > 0.5 else "Public Transport"
result_12 = "Private Vehicle" if prediction_12 > 0.5 else "Public Transport"
print("Prediction for data entry 11:", result_11)
print("Prediction for data entry 12:", result_12)
This step provides an example of using the model to predict the type of transportation for two new employee data entries based on the trained model.
5. Prediction Results: The prediction results indicate the type of transportation predicted for the new data entries 11 and 12 based on the trained Naïve Bayesian model.
References
LECTURE NOTES Data Science Week ke 7 Naïve Bayesian
Rebekz. (n.d.). datascience_course/naive_bayes.ipynb at main · rebekz/datascience_course. GitHub. https://github.com/rebekz/datascience_course/blob/main/naive_bayes.ipynb
Jayadi, R., Firmantyo, H. M., Dzaka, M. T., Suaidy, M. F., & Putra, A. M. (2019). Employee Performance Prediction using Naïve Bayes. International Journal of Advanced Trends in Computer Science and Engineering, 8(6), 3031–3035. https://doi.org/10.30534/ijatcse/2019/59862019
Miniconda — miniconda documentation. (n.d.).