Data science, ‘explained in under a minute’, looks like this.
You have data. To use this data to inform your decision-making, it needs to be relevant, well-organised, and preferably digital. Once your data is coherent, you proceed with analysing it, creating dashboards and reports to understand your business’s performance better. Then you set your sights to the future and start generating predictive analytics. With predictive analytics, you assess potential future scenarios and predict consumer behaviour in creative ways. Author’s note: You can learn more about how data science and business interact in our article 5 Business Basics for Data Scientists. But let’s start at the beginning.The Data in Data Science
Before anything else, there is always data. Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional, and big data. Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. Actually, the term “traditional” is something we are introducing for clarity. It helps emphasize the distinction between big data and other types of data. Big data, on the other hand, is… bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers. That said, let’s define the What-Where-and-Who in data science each is characterized by.What do you do to Data in Data Science?
Traditional data in Data Science
Traditional data is stored in relational database management systems. That said, before being ready for processing, all data goes through pre-processing. This is a necessary group of operations that convert raw data into a format that is more understandable and hence, useful for further processing. Common processes are:- Collect raw data and store it on a server
- Class-label the observations
- Data cleansing/data scrubbing
- Data balancing
- Data shuffling
Big Data in Data Science
When it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences. First of all, big data is stored on many servers and is infinitely more complex. In order to do data science with big data, pre-processing is even more crucial, as the complexity of the data is a lot larger. You will notice that conceptually, some of the steps are similar to traditional data pre-processing, but that’s inherent to working with data.- Collect the data
- Class-label the data
- Data cleansing
- Data masking
Add comment