Data Science - Individual Assignment 1 Week 7

10 Month ago By Dikhi Martin Binus

Problem Statement

Topic	Naïve Bayesian
Subtopic	Implementation
Reference	LN

Below is the data of employees of a private company who commute to the office using either private or public transportation.

No.	Gender	Employee Age	Salary	Marital Status	Transportation
1	Male	20	8,000,000	Single	Private Vehicle
2	Male	35	14,000,000	Single	Public Transport
3	Female	26	10,000,000	Single	Public Transport
4	Female	27	12,000,000	Married	Private Vehicle
5	Male	21	9,000,000	Single	Private Vehicle
6	Male	22	11,000,000	Single	Private Vehicle
7	Female	32	15,000,000	Married	Public Transport
8	Female	26	8,000,000	Married	Public Transport
9	Male	25	9,000,000	Single	Public Transport
10	Female	20	10,000,000	Single	Private Vehicle
11	Female	27	12,000,000	Single	?
12	Male	35	14,000,000	Married	?

Using the Naïve Bayesian method, determine the type of transportation for 2 new data. Please use or modify the code lines from LN material Week 7 page.

Note: Provide the program code and a screenshot of the program output.

Solution

Naïve Bayesian for Employee Transportation Classification:

1. Introduction: In this case, we will utilize the Naïve Bayesian method to classify the type of transportation used by employees in a company based on several attributes such as gender, age, salary, marital status, and previous transportation choice.

2. Employee Data: Below is the employee data that will be used to train the model and predict the type of transportation for two new data entries:

No.	Gender	Employee Age	Salary	Marital Status	Transportation
1	Male	20	8,000,000	Single	Private Vehicle
2	Male	35	14,000,000	Single	Public Transport
3	Female	26	10,000,000	Single	Public Transport
4	Female	27	12,000,000	Married	Private Vehicle
5	Male	21	9,000,000	Single	Private Vehicle
6	Male	22	11,000,000	Single	Private Vehicle
7	Female	32	15,000,000	Married	Public Transport
8	Female	26	8,000,000	Married	Public Transport
9	Male	25	9,000,000	Single	Public Transport
10	Female	20	10,000,000	Single	Private Vehicle
11	Female	27	12,000,000	Single	?
12	Male	35	14,000,000	Married	?

3. Naïve Bayesian Implementation:

3.1. Import Library and Tokenization Function: In this step, we import the necessary libraries and define a specific function for tokenization.

from typing import NamedTuple, Set
import re

The tokenize function is implemented to break the text into tokens (unique words) and return a set of these tokens. Tokenization is crucial in this context as it transforms the text, such as employee attributes, into a form that can be further processed by the Naïve Bayesian model.

def tokenize(text: str) -> Set[str]:
    text = text.lower()                         # Convert text to lowercase,
    all_words = re.findall("[a-z0-9']+", text)  # extract words, and
    return set(all_words)                       # return a set of unique words.

This function contributes to data preparation for model training and prediction by providing a token representation of the text.

3.2. Definition of Message Class and NaiveBayesClassifier: In this step, we define the Message class to represent messages or data used in model training. The NaiveBayesClassifier class is an implementation of the Naïve Bayesian model for classification.

from typing import List, Tuple, Dict, Iterable
import math
from collections import defaultdict

class Message(NamedTuple):
    text: str
    is_spam: bool

class NaiveBayesClassifier:
    def __init__(self, k: float = 0.5) -> None:
        self.k = k
        self.tokens: Set[str] = set()
        self.token_spam_counts: Dict[str, int] = defaultdict(int)
        self.token_ham_counts: Dict[str, int] = defaultdict(int)
        self.spam_messages = self.ham_messages = 0

    def train(self, messages: Iterable[Message]) -> None:
        for message in messages:
            if message.is_spam:
                self.spam_messages += 1
            else:
                self.ham_messages += 1

            for token in tokenize(message.text):
                self.tokens.add(token)
                if message.is_spam:
                    self.token_spam_counts[token] += 1
                else:
                    self.token_ham_counts[token] += 1

    def _probabilities(self, token: str) -> Tuple[float, float]:
        spam = self.token_spam_counts[token]
        ham = self.token_ham_counts[token]
        p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
        p_token_ham = (ham + self.k) / (self.ham_messages + 2 * self.k)
        return p_token_spam, p_token_ham

    def predict(self, text: str) -> float:
        text_tokens = tokenize(text)
        log_prob_if_spam = log_prob_if_ham = 0.0

        for token in self.tokens:
            prob_if_spam, prob_if_ham = self._probabilities(token)

            if token in text_tokens:
                log_prob_if_spam += math.log(prob_if_spam)
                log_prob_if_ham += math.log(prob_if_ham)
            else:
                log_prob_if_spam += math.log(1.0 - prob_if_spam)
                log_prob_if_ham += math.log(1.0 - prob_if_ham)

        prob_if_spam = math.exp(log_prob_if_spam)
        prob_if_ham = math.exp(log_prob_if_ham)
        return prob_if_spam / (prob_if_spam + prob_if_ham)

This class organizes the model and methods to train the model with the provided data.

3.3. Training Model: Next, we train the Naïve Bayesian model using the tokenized employee data.

# Training data from the table
training_data = [
    ("male 20 8,000,000 single private vehicle", False),
    ("male 35 14,000,000 single public transport", True),
    ("female 26 10,000,000 single public transport", True),
    ("female 27 12,000,000 married private vehicle", False),
    ("male 21 9,000,000 single private vehicle", False),
    ("male 22 11,000,000 single private vehicle", False),
    ("female 32 15,000,000 married public transport", True),
    ("female 26 8,000,000 married public transport", True),
    ("male 25 9,000,000 single public transport", True),
    ("female 20 10,000,000 single private vehicle", False),
]

# Organizing training data into Message format
messages = [Message(text, is_spam) for text, is_spam in training_data]

# Training the model
model = NaiveBayesClassifier(k=0.5)
model.train(messages)

This step creates and trains the model with the tokenized data to understand patterns and correlations between the attributes.

4. Prediction for New Data: Subsequently, we use the trained model to predict the type of

transportation for two new employee data entries.

text_data_11 = "Female 27 12,000,000 Single ?"
text_data_12 = "Male 35 14,000,000 Married ?"

prediction_11 = model.predict(text_data_11)
prediction_12 = model.predict(text_data_12)

# Setting a threshold, e.g., if probability > 0.5, then private vehicle; if probability <= 0.5, then public transport.
result_11 = "Private Vehicle" if prediction_11 > 0.5 else "Public Transport"
result_12 = "Private Vehicle" if prediction_12 > 0.5 else "Public Transport"

print("Prediction for data entry 11:", result_11)
print("Prediction for data entry 12:", result_12)

This step provides an example of using the model to predict the type of transportation for two new employee data entries based on the trained model.

5. Prediction Results: The prediction results indicate the type of transportation predicted for the new data entries 11 and 12 based on the trained Naïve Bayesian model.

References

LECTURE NOTES Data Science Week ke 7 Naïve Bayesian

Rebekz. (n.d.). datascience_course/naive_bayes.ipynb at main · rebekz/datascience_course. GitHub. https://github.com/rebekz/datascience_course/blob/main/naive_bayes.ipynb

Jayadi, R., Firmantyo, H. M., Dzaka, M. T., Suaidy, M. F., & Putra, A. M. (2019). Employee Performance Prediction using Naïve Bayes. International Journal of Advanced Trends in Computer Science and Engineering, 8(6), 3031–3035. https://doi.org/10.30534/ijatcse/2019/59862019

Miniconda — miniconda documentation. (n.d.).https://docs.conda.io/projects/miniconda/en/latest/

Author

Dikhi MartinSoftware Engineer

Script Savvy