Behind Ranking First in BUSINESS & AI’s 2nd Online Machine Learning Competition

96 Views

The importance of data preprocessing

The performance of a machine learning (ML) model for a given task depends on a variety
of parameters. Both the representation and quality of the sample data come first.
Pattern recognition during the training phase is more challenging when there is a lot of redundancy
and irrelevant information present (in other words: noisy and unreliable data). Together with
It is also widely recognized that the preprocessing time for ML problems is significant.
increased by the data preparation and filtering phases.

Data preprocessing includes feature extraction and selection, feature and data normalization.
cleaning. The product of data preprocessing is the final training set. That’s why it would be
efficient if a single set of data preprocessing algorithms were the best possible
performance of each data set. That’s a real challenge!

The problem that needs to be solved!

“A challenging anonymous regression problem with multiple numbers, categorical,
and textual variables will be central to the competition. The data has been
anonymized to “discourage” participants from developing
specific ideas for data sets and “nudge” them to find new approaches to automate their data set.
Predictive modeling and training processes. Textual features are encrypted.
without harming its predictive power. They can still be informative if you use the
correct way to obtain information from it. The main objective of this challenge is to find
the “best” regression process that predicts the target variable “y” using the data provided
It works as described below.”

My motivation to participate in the contest.

As an intern within the BUSINESS & AI company, I had the opportunity to learn new skills and…
delve into the world of business analytics and data science; acquire automation skills
data cleaning and prediction modeling.

In fact, BUSINESS & AI periodically organizes contests open to everyone
world! Push the limits of interns to the limit, that is:
Provides opportunities for excellence to AI enthusiasts. Therefore, there was no way he could do that.
lost the opportunity to participate; The first place certainly seemed tempting to me!

My approach to solving the problem.

To solve the problem in a very efficient way, these are the steps I followed:

Understand the problem and the data set.

Data preprocessing: data cleaning, removal of outliers, normalization, standardization, creation of dummy variables.
Feature Engineering: Feature Selection, Feature Transformation, and Feature Creation.
Select the modeling algorithm.
Parameter tuning via GridSearchcv.
Building the ML model pipeline.
Using an advanced learning approach.
Check the results by sending.
Regarding the second and third points (2 and 3), I worked on the different categorical, numerical and textual features separately. And I can tell you: it really made a difference! This creates three different pipelines that preprocess these different variables.

At the end of this, a pipeline I called ML_pipeline brought them all together with the regression technique I was using, which was ElasticNet.

Tools used during the match.

  1. Google Colab (as a framework).
  2. Python (as a coding language).
  3. scikit-learn (as the core library, supporting NumPy and SciPy, as well as a number of techniques such as
  4. support vector machines, random forests and k-neighbors).

What was the most challenging?

The most challenging part for me was dealing with the text variables. In fact, the characteristics of the text were coded by the company without harming its predictive power. Therefore, I had to continue cleaning up before I could use the pipeline I implemented for these types of functions.

Final notes

If you are passionate about data science, set big goals, challenge yourself, and never stop learning!
The secret is to have self-discipline and a strong will!

Make a distinction!

This div height required for enabling the sticky sidebar
Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views : Ad Clicks : Ad Views :