Data science, ‘explained in under a minute’, looks like this.You have data. To use this data to inform your decision-making, it needs to be relevant, well-organised, and preferably digital. Once your data is coherent, you proceed with analysing it, creating dashboards and reports to understand your business’s performance better. Then you set your sights to the future and start generating predictive analytics. With predictive analytics, you assess potential future scenarios and predict consumer behaviour in creative ways. Author’s note: You can learn more about how data science and business interact in our article 5 Business Basics for Data Scientists. But let’s start at the beginning.
The Data in Data ScienceBefore anything else, there is always data. Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional, and big data. Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. Actually, the term “traditional” is something we are introducing for clarity. It helps emphasize the distinction between big data and other types of data. Big data, on the other hand, is… bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers. That said, let’s define the What-Where-and-Who in data science each is characterized by.
What do you do to Data in Data Science?
Traditional data in Data ScienceTraditional data is stored in relational database management systems.
- Collect raw data and store it on a server
- Class-label the observations
- Data cleansing/data scrubbing
- Data balancing
- Data shuffling
Big Data in Data ScienceWhen it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences. First of all, big data is stored on many servers and is infinitely more complex.
- Collect the data
- Class-label the data
- Data cleansing
- Data masking