Jul 12, 2020

2 min read

Data Preprocessing Comes First

Although I’m new to Data Science field. But some juniors often ask me such questions.

  1. Which one should learn first between supervised and unsupervised?
  2. Should I solve a problem about regression or classification?
  3. Which model would be best for my dataset?
  4. Heard that random forest works better, is that true?

So for them I have something to say. How well your model’s going to do depends mostly on how you process the dataset. In my eyes almost 70% of a Data Science project result depends on preprocessing of dataset. Why am I saying this?

First of all whatever your dataset is, if you don’t understand your dataset well, you can’t process your dataset well thus your model can’t predict well. Even if you try something more advanced models like Stack Classifier or Bagger! It won’t change the result. Whatever model you apply, if your dataset is not well prepared, your model’s not going to learn perfectly.

So how do we understand a dataset? I’ll cover shortly.

  • Look for the missing values in your dataset whether they are random or you get some pattern.
  • Do your dataset has some outliers?
  • What are the relations in your dataset features? How much strong they are?
  • What is your target variable? Do you need any encoding of data?
  • Do your data need some scaling?
  • Is your target variable balanced?
  • Are your independent variables enough or need some feature engineering?
  • Which features are more important for your target variable?
  • And finally which model are you going for? Is your data likely to your selected model?

So after analyzing all these you can go for your model prediction. And hopefully a better result is waiting for you there. I’m no expert. Just shared some facts that I can share and mostly I can remind myself.