Data Science: A short introduction.
Check Out Cool Merch Here >>>
Data Science: A short introduction.
This workflow is an adaptation from IBM, you can access the original article here.
Basically the workflow below is your tldr version of the article with a little spin on my own.
Data Ingestion/Collection: Nothing fancy here just where are you getting your data from. Whether you are scraping the data from an api or you custom built a water sensor IOT device or even handing out handwritten survey forms, this is the beginning of all workflows. Without data, there is no data science to talk about.
Data Storage: Be it a csv file, SQL or NoSQL database, or any other forms of storage, the data has to be persistently stored for later retrieval and use. You should always store raw data in the event there is a problem with the processing or there is a change in processing.
Data Processing: By processing, it means taking the raw data and cleaning it, essentially making it ready for use. There are a number of things you can do during processing, such as dealing with missing values, transforming values etc.
Data Analysis: Depending on the questions you want answered or you want to check if the data is ready for downstream processes such as modelling, you perform exploratory data analysis here.
Data Modelling: (Sometimes data analysis is enough to answer the main questions so you could skip this part) Usually you attempt to perform modelling for predictive analytics, there is a lot to do here, such as data preprocessing to ensure that the data is suitable for your models and there is also model optimization amongst other things.
Data Communication: This final step is communication to your stakeholders, even if the analysis is good but you can't present it well enough to get them to understand it; All the preceding steps can be rendered moot.
So that's my short summary (I will be updating this as I gain more experience in this field but I hope this was useful).
The reason why I highlighted data modelling in orange is because most people think it is the only part or the most important part. The more experience I gain in this field, the more I personally realize the importance of the entire pipeline; each of them cannot do without the others.
So that's my two cents on this issue.
Feel free to drop me some feedback if you agree or disagree with what I have written so far.
In the meanwhile have fun doing data science stuff!