{"id":31363,"date":"2021-01-19T07:50:21","date_gmt":"2021-01-19T12:50:21","guid":{"rendered":"https:\/\/centricconsulting.com\/?p=31363"},"modified":"2024-03-07T11:03:24","modified_gmt":"2024-03-07T16:03:24","slug":"machine-learning-my-journey-to-the-predictive-world-part-2","status":"publish","type":"post","link":"https:\/\/centricconsulting.com\/blog\/machine-learning-my-journey-to-the-predictive-world-part-2\/","title":{"rendered":"Machine Learning: My Journey to the Predictive World, Part 2"},"content":{"rendered":"
As I described in part 1 of my story<\/a>, in only a matter of months, my colleagues and I at Centric India<\/a> became machine learning<\/a> (ML) certified, created a team of people with diverse data and analytics<\/a> skillsets, and began preparing to participate as Centric India\u2019s first-ever ML team for our company-wide, 2020 Expedition: Data<\/a> event.<\/p>\n My teammates \u2014 Seema Bansal, Mehani Hakim, Shiv Mohan and Akshat Kulshrestha \u2014 and I had bonded and learned a lot about ML quickly, but to move ahead, we needed to create a vision around our project, test various solutions, and develop our ML strategies.<\/p>\n Our goal was to help Centric\u2019s client, a digital marketing firm, identify the characteristics of consultants who were at risk of leaving the firm. We wanted to develop ML and data-science-based retention programs for those consultants. Building on this basic idea, we created guidelines for ourselves to:<\/strong><\/p>\n Let us walk through each phase of our project.<\/p>\n To create the data pipeline, we used Azure Data Factory<\/a> (ADF). ADF can integrate various tables used in the query to make a dataset. Our motive was to use this dataset in the pipeline, which would then use with Copy Activity to synchronize the data in a comma-separated values (CSV) file.<\/strong> Once we ran the pipeline, we could retrieve the data from Storage Explorer.<\/p>\n Data preparation is key for any successful ML solution because whatever dataset we create must support the end decision.<\/p>\n The question for our team was, \u201cWhat exactly do we want to derive from the data? What is our desired result?\u201d Using Seema\u2019s experience, we had already started analyzing the data by going through the tables and columns to sample it. We wanted to find the crucial factors that could affect consultants\u2019 decision to stay or leave the organization, such as their job start\/stop dates, job titles, salary structures, commute distances, commissions and office culture.<\/p>\n The first challenge we faced was a failure of our data to cross-join among the tables. So, we reached out to business experts for more clarification of the data.<\/strong><\/p>\n After juggling many tables and columns, we made one master query by making common table expression (CTE) tables and storing procedures to collect the data in one dataset. Now we needed to assess the data\u2019s condition, including trends, outliers and exceptions, as well as incorrect, inconsistent, missing or skewed information.<\/p>\n We conducted an iterative process of data analysis and data cleansing. We then completed a round of testing by dividing the query into bits and pieces to ensure that whatever data we obtained from the master query was correct. Finally, we were ready to export the aggregated data into a CSV file.<\/p>\n A major step at this point was formatting the data. Because we were aggregating data from various sources, we risked experiencing anomalies or abbreviations. We formatted the data to ensure that the entire data set used the same input formatting protocols. Finally, we created a query that included various aggregated functions and fetched the required data for customer retention.<\/strong><\/p>\n With some initial problems solved and our data gathered and cleaned, we were ready to analyze it more deeply and ensure validity.<\/p>\n We used exploratory data analysis (EDA) at this stage to better sort and understand our data. EDA helped us sort data into bands and provided important visualization tools.<\/p>\n EDA is an approach to analyzing data sets to summarize their main characteristics, often using visual methods. You can use a statistical model or not, but EDA helps reveal what the data can tell us beyond the formal modeling or hypothesis-testing task.<\/p>\n In our project, the main EDA task was to:<\/strong><\/p>\n For example, below is our data visualization for the Service Duration and Age bands:<\/p>\n <\/a><\/p>\n <\/a><\/p>\n Similarly, we created EDAs with the various features, which allowed us to filter the data further and discover other key features that lead to higher attrition. We were then ready to make sure the algorithm would understand our data.<\/p>\n We used a process called feature engineering to make our outputs understandable to the algorithm. It resulted in improved model accuracy for data that the algorithm has not seen yet. We followed various techniques for our feature engineering efforts:<\/p>\n The heat map below visualizes the correlation among our features listed at the left and bottom sides of the image. Lighter colors indicate highly correlated features, and darker colors indicate lightly correlated features.<\/p>\n <\/a><\/p>\n <\/a><\/p>\n After performing the Feature Engineering operations, we were ready to apply the model to our data after splitting that into subsets.<\/p>\n At this stage, data scientists typically split data into subsets, training data and testing data. The goal is to fit the training data to make predictions on the test data.<\/p>\n When fitting our model, we faced the challenge of overfitting or underfitting. Overfitting<\/strong> occurs when you train a model \u201ctoo well.\u201d It is perfectly accurate on the training data, but it may not be accurate on untrained or new data. That means we can\u2019t generalize the results or make any inferences from other data. In contrast, underfitting<\/strong> occurs when the model does not fit the training data. Therefore, it misses the trends in the data and cannot be generalized to added information.<\/p>\n Finally, we were ready to apply the model to our trained data and predict the result for the Expedition: Data problem.<\/p>\n With our understanding of the various approaches to classification problems \u2014 such as logistic regression, support vector machine, k-fold cross-validation, naive Bayes classifier, random forest classifier and decision trees \u2014 we were ready to compare our results.<\/p>\n We decided to try logistic regression followed by k-fold cross-validation and then random forest classifier.<\/p>\n Let us talk briefly about these three approaches:<\/p>\n With the help of regression, cross-validation and random-forest classification, we created the confusion matrix, a receiver operating characteristic (ROC) curve, its score and its accuracy. The confusion matrix allows us to visualize an algorithm\u2019s performance and reveals the number of correct predictions for each class:<\/p>\n <\/a><\/p>\n Below are the results and ROC curve created from our prediction related to our use case:<\/p>\n <\/a><\/p>\n <\/a><\/p>\nPlanning and Execution<\/h2>\n
\n
Creating Our Data Pipeline<\/h3>\n
Finding and Preparing Our Data<\/h3>\n
Analyzing and Validating Our Data<\/h3>\n
\n
\n
\n
Enriching and Transforming our Data<\/h2>\n
\n
\n
\n
\n
\n
\n
\n
\n
Operationalizing Our Data Pipeline<\/h2>\n
Developing and Optimizing Our ML Model<\/h2>\n
\n
\n
Developing and Optimizing our ML Model with Azure Auto ML<\/h2>\n