Data Preprocessing Comes First

Photo by M. B. M. on Unsplash

Although I’m new to Data Science field. But some juniors often ask me such questions.

  1. Which one should learn first between supervised and unsupervised?
  2. Should I solve a problem about regression or classification?
  3. Which model would be best for my dataset?
  4. Heard that random forest works better, is that true?

So for them I have something to say. How well your model’s going to do depends mostly on how you process the dataset. In my eyes almost 70% of a Data Science project result depends on preprocessing of dataset. Why am I saying this?

First of all whatever your dataset is, if you don’t understand your dataset well, you can’t process your dataset well thus your model can’t predict well. Even if you try something more advanced models like Stack Classifier or Bagger! It won’t change the result. Whatever model you apply, if your dataset is not well prepared, your model’s not going to learn perfectly.

So how do we understand a dataset? I’ll cover shortly.

  • Look for the missing values in your dataset whether they are random or you get some pattern.
  • Do your dataset has some outliers?
  • What are the relations in your dataset features? How much strong they are?
  • What is your target variable? Do you need any encoding of data?
  • Do your data need some scaling?
  • Is your target variable balanced?
  • Are your independent variables enough or need some feature engineering?
  • Which features are more important for your target variable?
  • And finally which model are you going for? Is your data likely to your selected model?

So after analyzing all these you can go for your model prediction. And hopefully a better result is waiting for you there. I’m no expert. Just shared some facts that I can share and mostly I can remind myself.




I’m a data science enthusiast. Always try to cope up with the upgraded technologies. Connect Me through

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Visualising India’s biggest problem and its Solution

Sentiment Analysis: A Way To Improve Your Business

Keeping Up With Data #65

What Lies Beyond Big Data

Algorithmic trading based on Technical Analysis in Python

Identify which Process is best

How to Fine-Tune BERT With NSP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Md. Hasibul Islam

Md. Hasibul Islam

I’m a data science enthusiast. Always try to cope up with the upgraded technologies. Connect Me through

More from Medium

Data PreProcessing

Get started with EDA

Resources to find datasets for your Next Data Science Project — Part 1

Linear Regression using Scikit Learn