Data Science - Individual Assignment 1 Week 7

Individual Assignment Data Science


Problem Statement

Topic Naïve Bayesian
Subtopic Implementation
Reference LN

Below is the data of employees of a private company who commute to the office using either private or public transportation.

No. Gender Employee Age Salary Marital Status Transportation
1 Male 20 8,000,000 Single Private Vehicle
2 Male 35 14,000,000 Single Public Transport
3 Female 26 10,000,000 Single Public Transport
4 Female 27 12,000,000 Married Private Vehicle
5 Male 21 9,000,000 Single Private Vehicle
6 Male 22 11,000,000 Single Private Vehicle
7 Female 32 15,000,000 Married Public Transport
8 Female 26 8,000,000 Married Public Transport
9 Male 25 9,000,000 Single Public Transport
10 Female 20 10,000,000 Single Private Vehicle
11 Female 27 12,000,000 Single ?
12 Male 35 14,000,000 Married ?

Using the Naïve Bayesian method, determine the type of transportation for 2 new data. Please use or modify the code lines from LN material Week 7 page.

Note: Provide the program code and a screenshot of the program output.

 

Solution

Naïve Bayesian for Employee Transportation Classification:

1. Introduction: In this case, we will utilize the Naïve Bayesian method to classify the type of transportation used by employees in a company based on several attributes such as gender, age, salary, marital status, and previous transportation choice.

 

2. Employee Data: Below is the employee data that will be used to train the model and predict the type of transportation for two new data entries:

No. Gender Employee Age Salary Marital Status Transportation
1 Male 20 8,000,000 Single Private Vehicle
2 Male 35 14,000,000 Single Public Transport
3 Female 26 10,000,000 Single Public Transport
4 Female 27 12,000,000 Married Private Vehicle
5 Male 21 9,000,000 Single Private Vehicle
6 Male 22 11,000,000 Single Private Vehicle
7 Female 32 15,000,000 Married Public Transport
8 Female 26 8,000,000 Married Public Transport
9 Male 25 9,000,000 Single Public Transport
10 Female 20 10,000,000 Single Private Vehicle
11 Female 27 12,000,000 Single ?
12 Male 35 14,000,000 Married ?

3. Naïve Bayesian Implementation:

3.1. Import Library and Tokenization Function: In this step, we import the necessary libraries and define a specific function for tokenization.

from typing import NamedTuple, Set
import re

The tokenize function is implemented to break the text into tokens (unique words) and return a set of these tokens. Tokenization is crucial in this context as it transforms the text, such as employee attributes, into a form that can be further processed by the Naïve Bayesian model.

def tokenize(text: str) -> Set[str]:
   text = text.lower()                         # Convert text to lowercase,
   all_words = re.findall("[a-z0-9']+", text)  # extract words, and
   return set(all_words)                       # return a set of unique words.

This function contributes to data preparation for model training and prediction by providing a token representation of the text.

 

3.2. Definition of Message Class and NaiveBayesClassifier: In this step, we define the Message class to represent messages or data used in model training. The NaiveBayesClassifier class is an implementation of the Naïve Bayesian model for classification.

from typing import List, Tuple, Dict, Iterable
import math
from collections import defaultdict

class Message(NamedTuple):
   text: str
   is_spam: bool

class NaiveBayesClassifier:
   def __init__(self, k: float = 0.5) -> None:
       self.k = k
       self.tokens: Set[str] = set()
       self.token_spam_counts: Dict[str, int] = defaultdict(int)
       self.token_ham_counts: Dict[str, int] = defaultdict(int)
       self.spam_messages = self.ham_messages = 0

   def train(self, messages: Iterable[Message]) -> None:
       for message in messages:
           if message.is_spam:
               self.spam_messages += 1
           else:
               self.ham_messages += 1

           for token in tokenize(message.text):
               self.tokens.add(token)
               if message.is_spam:
                   self.token_spam_counts[token] += 1
               else:
                   self.token_ham_counts[token] += 1

   def _probabilities(self, token: str) -> Tuple[float, float]:
       spam = self.token_spam_counts[token]
       ham = self.token_ham_counts[token]
       p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
       p_token_ham = (ham + self.k) / (self.ham_messages + 2 * self.k)
       return p_token_spam, p_token_ham

   def predict(self, text: str) -> float:
       text_tokens = tokenize(text)
       log_prob_if_spam = log_prob_if_ham = 0.0

       for token in self.tokens:
           prob_if_spam, prob_if_ham = self._probabilities(token)

           if token in text_tokens:
               log_prob_if_spam += math.log(prob_if_spam)
               log_prob_if_ham += math.log(prob_if_ham)
           else:
               log_prob_if_spam += math.log(1.0 - prob_if_spam)
               log_prob_if_ham += math.log(1.0 - prob_if_ham)

       prob_if_spam = math.exp(log_prob_if_spam)
       prob_if_ham = math.exp(log_prob_if_ham)
       return prob_if_spam / (prob_if_spam + prob_if_ham)

This class organizes the model and methods to train the model with the provided data.

 

3.3. Training Model: Next, we train the Naïve Bayesian model using the tokenized employee data.

# Training data from the table
training_data = [
  ("male 20 8,000,000 single private vehicle", False),
  ("male 35 14,000,000 single public transport", True),
  ("female 26 10,000,000 single public transport", True),
  ("female 27 12,000,000 married private vehicle", False),
  ("male 21 9,000,000 single private vehicle", False),
  ("male 22 11,000,000 single private vehicle", False),
  ("female 32 15,000,000 married public transport", True),
  ("female 26 8,000,000 married public transport", True),
  ("male 25 9,000,000 single public transport", True),
  ("female 20 10,000,000 single private vehicle", False),
]

# Organizing training data into Message format
messages = [Message(text, is_spam) for text, is_spam in training_data]

# Training the model
model = NaiveBayesClassifier(k=0.5)
model.train(messages)

This step creates and trains the model with the tokenized data to understand patterns and correlations between the attributes.

 

4. Prediction for New Data: Subsequently, we use the trained model to predict the type of

transportation for two new employee data entries.

text_data_11 = "Female 27 12,000,000 Single ?"
text_data_12 = "Male 35 14,000,000 Married ?"

prediction_11 = model.predict(text_data_11)
prediction_12 = model.predict(text_data_12)

# Setting a threshold, e.g., if probability > 0.5, then private vehicle; if probability <= 0.5, then public transport.
result_11 = "Private Vehicle" if prediction_11 > 0.5 else "Public Transport"
result_12 = "Private Vehicle" if prediction_12 > 0.5 else "Public Transport"

print("Prediction for data entry 11:", result_11)
print("Prediction for data entry 12:", result_12)

This step provides an example of using the model to predict the type of transportation for two new employee data entries based on the trained model.

 

5. Prediction Results: The prediction results indicate the type of transportation predicted for the new data entries 11 and 12 based on the trained Naïve Bayesian model.

 

References

LECTURE NOTES Data Science Week ke 7 Naïve Bayesian

Rebekz. (n.d.). datascience_course/naive_bayes.ipynb at main · rebekz/datascience_course. GitHub. https://github.com/rebekz/datascience_course/blob/main/naive_bayes.ipynb

Jayadi, R., Firmantyo, H. M., Dzaka, M. T., Suaidy, M. F., & Putra, A. M. (2019). Employee Performance Prediction using Naïve Bayes. International Journal of Advanced Trends in Computer Science and Engineering, 8(6), 3031–3035. https://doi.org/10.30534/ijatcse/2019/59862019

Miniconda — miniconda documentation. (n.d.).https://docs.conda.io/projects/miniconda/en/latest/

Share this post

Author
Dikhi Martin

Dikhi MartinSoftware Engineer