Data Preprocessing Comes First

Photo by M. B. M. on Unsplash

Although I’m new to Data Science field. But some juniors often ask me such questions.

  1. Which one should learn first between supervised and unsupervised?
  2. Should I solve a problem about regression or classification?
  3. Which model would be best for my dataset?
  4. Heard that random forest works better, is that true?

So for them I have something to say. How well your model’s going to do depends mostly on how you process the dataset. In my eyes almost 70% of a Data Science project result depends on preprocessing of dataset. Why am I saying this?

First of all whatever your dataset is, if you don’t understand your dataset well, you can’t process your dataset well thus your model can’t predict well. Even if you try something more advanced models like Stack Classifier or Bagger! It won’t change the result. Whatever model you apply, if your dataset is not well prepared, your model’s not going to learn perfectly.

So how do we understand a dataset? I’ll cover shortly.

  • Look for the missing values in your dataset whether they are random or you get some pattern.
  • Do your dataset has some outliers?
  • What are the relations in your dataset features? How much strong they are?
  • What is your target variable? Do you need any encoding of data?
  • Do your data need some scaling?
  • Is your target variable balanced?
  • Are your independent variables enough or need some feature engineering?
  • Which features are more important for your target variable?
  • And finally which model are you going for? Is your data likely to your selected model?

So after analyzing all these you can go for your model prediction. And hopefully a better result is waiting for you there. I’m no expert. Just shared some facts that I can share and mostly I can remind myself.




I’m a data science enthusiast. Always try to cope up with the upgraded technologies. Connect Me through

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Mastering Master Data. What?

best neighborhood using K-Means cluster

Who will survive the Titanic ?

Product Development Spotlight: Meet Silvia Bakalova, Data Scientist at Leanplum

How can predictive models help domain experts be better at what they do? What is a predictive model?

Cost Benefit Principle

«Monthly Report» The Change of AIDUS QTS Profit Rate (July 31, 2020)

Neighbourhood Segmentation and Clustering using Foursquare API

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Md. Hasibul Islam

Md. Hasibul Islam

I’m a data science enthusiast. Always try to cope up with the upgraded technologies. Connect Me through

More from Medium

K-Means Clustering — Using Python, From the Scratch

Machine Learning Intro (part-1)

Predicting State of Health and Lifecycle of Li-ion Batteries (pt.2)

Iris dataset with 3 Different Classifiers 🌼🌸❀