Python ML Workflow: Data Prep to KNN Modeling

Classified in Electronics

Written on in English with a size of 3.92 KB

1. Essential Python Libraries

Import necessary libraries for data manipulation, modeling, and visualization:

  • import numpy as np
  • import pandas as pd
  • import matplotlib.pyplot as plt
  • From sklearn: Import necessary modules for preprocessing, dimensionality reduction, modeling, and selection (e.g., ColumnTransformer, PCA, SimpleImputer, KNeighborsClassifier, Pipeline, GridSearchCV, train_test_split, accuracy_score, SVC).

2. Load Data and Define X, Y

Load the dataset. For this example, we assume the use of a standard dataset:

data = load_breast_cancer()

3. Handle Missing Values (Imputation)

Check for missing values and apply imputation if necessary. We use the mean strategy for numerical features:

datos.isnull().sum() 

imputer = SimpleImputer(strategy='mean') # Or 'most_frequent'
X_imputed = imputer.fit_transform(X)

4. Data Transformation and Scaling

Define numerical (num) and categorical (cat) feature lists. Set up the preprocessing pipeline:

preprocessor = ColumnTransformer([
    ('num', MinMaxScaler(), num),
    ('cat', OneHotEncoder(drop='first'), cat)])

X_tr = preprocessor.fit_transform(X)

If the target variable (y) is categorical text, binarize it:

# Example for binarizing a target variable (if applicable)
y_bin = y.apply(lambda x: 0 if x in ('<=50K') else 1)

Dimensionality Reduction with PCA

Apply Principal Component Analysis (PCA) to reduce feature space:

pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X_tr)

5. Split Data into Train and Test Sets

Separate the processed data for training and testing the model:

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, random_state=1234, test_size=0.2)

6. K-Nearest Neighbors (KNN) Hyperparameter Tuning

Determine the optimal number of neighbors (k) using cross-validation or simple testing:

k_values = range(1, 4)
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    accuracies.append(accuracy)

plt.plot(k_values, accuracies, marker='o')
plt.title('KNN Accuracy vs. k Value')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

# Select the best k
best_k = k_values[np.argmax(accuracies)]
print(f"The best k value is: {best_k}")

7. Final Model Training and Prediction

Train the final KNN model using the best k:

knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)

y_pred = knn_best.predict(X_test)

Model Performance Assessment

Calculate the accuracy score:

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

8. Confusion Matrix Analysis

Generate and display the confusion matrix to evaluate classification performance:

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Interpretation:
# [[TN FP] 
#  [FN TP]]

Related entries: