Python ML Workflow: Data Prep to KNN Modeling
Classified in Electronics
Written on in
English with a size of 3.92 KB
1. Essential Python Libraries
Import necessary libraries for data manipulation, modeling, and visualization:
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt- From
sklearn: Import necessary modules for preprocessing, dimensionality reduction, modeling, and selection (e.g.,ColumnTransformer,PCA,SimpleImputer,KNeighborsClassifier,Pipeline,GridSearchCV,train_test_split,accuracy_score,SVC).
2. Load Data and Define X, Y
Load the dataset. For this example, we assume the use of a standard dataset:
data = load_breast_cancer()3. Handle Missing Values (Imputation)
Check for missing values and apply imputation if necessary. We use the mean strategy for numerical features:
datos.isnull().sum()
imputer = SimpleImputer(strategy='mean') # Or 'most_frequent'
X_imputed = imputer.fit_transform(X)4. Data Transformation and Scaling
Define numerical (num) and categorical (cat) feature lists. Set up the preprocessing pipeline:
preprocessor = ColumnTransformer([
('num', MinMaxScaler(), num),
('cat', OneHotEncoder(drop='first'), cat)])
X_tr = preprocessor.fit_transform(X)If the target variable (y) is categorical text, binarize it:
# Example for binarizing a target variable (if applicable)
y_bin = y.apply(lambda x: 0 if x in ('<=50K') else 1)Dimensionality Reduction with PCA
Apply Principal Component Analysis (PCA) to reduce feature space:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_tr)5. Split Data into Train and Test Sets
Separate the processed data for training and testing the model:
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, random_state=1234, test_size=0.2)6. K-Nearest Neighbors (KNN) Hyperparameter Tuning
Determine the optimal number of neighbors (k) using cross-validation or simple testing:
k_values = range(1, 4)
accuracies = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
accuracies.append(accuracy)
plt.plot(k_values, accuracies, marker='o')
plt.title('KNN Accuracy vs. k Value')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
# Select the best k
best_k = k_values[np.argmax(accuracies)]
print(f"The best k value is: {best_k}")7. Final Model Training and Prediction
Train the final KNN model using the best k:
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)
y_pred = knn_best.predict(X_test)Model Performance Assessment
Calculate the accuracy score:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')8. Confusion Matrix Analysis
Generate and display the confusion matrix to evaluate classification performance:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Interpretation:
# [[TN FP]
# [FN TP]]