Data Science Career Transition & Predictive Modeling
Classified in Mathematics
Written on in English with a size of 4.61 KB
Introduction: A Data Science Journey
My name is Amit Kadam, and I currently reside in Mumbai. I completed my Bachelor of Engineering (B.E.) degree in 2021. After graduation, the pandemic limited job opportunities, and my family faced financial challenges, so I took my first opportunity at Sterling as a Senior Associate, where I worked for 2.5 years.
Initially, I was responsible for document verification, but I was soon promoted to manage drug health screening processes. In this role, I handled candidate health reports, prepared data for analysis, and developed strong attention to detail and data-handling skills.
During this time, a friend who successfully transitioned into data science encouraged me to explore the field. I started by learning through YouTube tutorials and teaching myself Python, which quickly sparked my interest in data science. To learn more, I joined Exceller Classes, where I gained a clear and organized understanding of data science concepts and techniques. This motivated me to switch careers, combining my analytical skills with my passion for using data to solve problems.
Recently, during an internship, I worked on two projects: one to predict solar power generation and another to forecast crude oil prices using time series analysis. These projects helped me use machine learning techniques to build accurate predictive models. I’m now eager to apply my data science skills, including machine learning, statistical analysis, and data visualization, to make meaningful contributions in this field.
Solar Power Generation Prediction Project
Project Methodology & Data Preprocessing
I followed a structured approach for this project:
- Data Import & Initial Exploration: I imported the dataset into Jupyter Notebook using the Pandas library. I began by exploring the data structure with the
describe()
function to understand key statistics. - Missing Value Handling: I handled missing values by replacing them with the mean, a common approach for numerical data. Although the median is another option, I chose the mean in this case.
- Exploratory Data Analysis (EDA): I conducted EDA using a box plot to check for outliers, confirming there were none. I then performed a correlation matrix to examine relationships between variables and used scatter plots to visualize the relationship between two continuous variables. This helped me understand how environmental factors were interacting.
- Data Partitioning: Once I had a good understanding of the data, I partitioned it into training and testing sets using the
train_test_split()
function from Scikit-learn. - Feature Scaling: To ensure consistency in feature scaling, I standardized the independent variables using
StandardScaler()
, which centers the data by removing the mean and scaling it to unit variance.
Machine Learning Model Development & Evaluation
After preparing the data, I moved on to building and evaluating various machine learning models:
- Linear Regression: I started with Linear Regression, but the accuracy was not satisfactory.
- Random Forest: Next, I implemented a Random Forest model, which initially gave around 70% accuracy. To improve performance, I applied hyperparameter tuning using
GridSearchCV
and optimized parameters liken_estimators
,max_depth
, andmax_leaf_nodes
. After tuning, the Random Forest model achieved an R² of 0.9881 on the training set and 0.8878 on the testing set. - Lasso & Ridge Regression: I also explored Lasso and Ridge Regression models, but their accuracy was lower, around 65%.
- Gradient Boosting: Finally, I applied a Gradient Boosting model and tuned it to achieve the best results. The tuned Gradient Boosting model gave an R² score of 0.9476 on the training set and 0.9098 on the testing set, along with a lower RMSE of 2907.03 compared to the Random Forest model.
Conclusion & Model Selection
Given the superior performance of the tuned Gradient Boosting model, with accuracy above 90% and a low RMSE, I decided that this model would be the most suitable for deployment to provide reliable power generation forecasts with high accuracy.