Book review: BECOMING A DATA HEAD
How to think, speak, and understand Data Science, Statistics, and Machine Learning
Genres:
- Information management
- Data Science
- Data Analytics
Review posted on:
31.10.2022
The number of pages:
272 pages
Book rating:
3/5
Year the book was published:
First edition published 2021
Who should read this book:
- People who are data analysts and want to get into data science.
Book review: HOW TO TALK ABOUT DATA
Build your data fluency
Genres:
- Information Management
- Data Science
Review posted on:
27.11.2023
The number of pages:
272
Book rating:
/5
Year the book was published:
First edition published 2022
Who should read this book:
- Data analytics, data scientists, and managers wanting to understand and communicate better about data.
Why did I pick up this book and what did I expect to get out of it:
The authors promise not to get too much into the technical stuff, meaning statistics and programming language rather they will focus more on the mindset and preparation of a new data project or taking over a previous one. From experience, I have seen that many overlook (some intentionally, others because of lack of knowledge) the quality of data, what can be done with the data you have, and what data you need to have to be able to accomplish your project. I was not any different when I started out with data analytics. In many cases later when I got to know the data and its origin I started seeing what the report lacked or where it was not presenting the real story. Many data analytic books overlook this crucial phase.
I expect to get some insights on how to prepare for a new data project or what to look for when taking over an existing data project. Also, I expect to learn which method is best suited for a certain type of report.
My thoughts about the book:
I love that the authors at the start focus on the thought processes and the mindset a data analyst needs to acquire, instead of going directly into theory about data, models, analyses, etc… Right “off the bat” the authors provide us with a very simple thought experiment in which we can see how the lack of information, context or data influences our predictions. I very much liked that the authors added “thought exercises” in each chapter, so you got some examples of what the authors wanted to point out. And in many cases, they point out what many beginners or intermediary data analytics may miss in the data or lack of data.
The book was pretty simple to read overall if you take into account that data science is not that easy to write about in such a manner. But still, there were certain segments of the book that felt a little bit repetitive and you could get lost in them.
If you are starting out in data analytics or data science then this is one of those books that will help you out because the authors successfully point out some “traps” you can easily fall for. Also, you get a good overlook of what it takes to successfully execute a data project. On the other hand, the book lacks a concrete example of what it takes to deliver a project step by step. What I mean is that I miss a “main” case study from the authors from start to end which would contain all steps and the optimal tools to solve the problem and present the results.
If you picked up this book please let me know what you think about it in the comment section.
A short summary of the book:
The premise of “Becoming a Data Head” is to teach you to ask more probing questions about “what is the problem” and not what methodology will be used or on or with which app or tool the results will be presented. You will be “put” in various roles at different points. In the introduction chapter, you start out with the “Data science industrial complex”. It’s mostly about the mindset of a “data head” and the thought processes behind making analyses and predictions. The book consists of four parts.
In part one of the book, you will learn to think critically and ask the right questions about the data projects your organization takes on. Questions such as why is this problem important? Who does this problem affect? When is the project over? What if you don’t have the right data? And what happens if you don’t like the results? These are very important questions that you need answers to before you start any big data projects.
In the second part, you will learn to actively participate in important conversations about the data project and also to ask about statistical claims. The authors will share with you tips that will help you dig deeper into the issues surrounding your data. For example, is anyone aware of any misinterpreting correlations, what to do with outliers and missing data? You will also learn about probability and the issues this “brings to the table”.
In the third part of “Becoming a Data Head,” you will learn about machine learning, artificial intelligence, and deep learning. This will help you search for “hidden gems” in your data. But not all is as it seems, there are hidden pitfalls, not just gems. In this part of the book, you will go through different types of analysis and learning techniques and their pros and cons such as clustering, decision trees, regression, and text classification.
In the fourth part, you will learn from others’ mistakes that they made in their projects. The authors share the dangers of biases and elaborate on the most common ones as well as the most common pitfalls. One whole chapter in this part is dedicated to communication gaps as it is one of the main reasons why projects fail. You will also learn about the people and personalities involved in data projects.
Most chapters contain case studies to help you understand the premise of each chapter. Sometimes you will feel like there is no need for a case study but once you read the whole case study you will see that without the experience the authors have you would probably overlook some of the pitfalls most data analysts also overlook.
My notes from the book:
- Many times data projects are undertaken because companies like the sound of what they are implementing without fully understanding why the project itself is important.
- When starting a data project start with the problem to be solved, not the technology to be used.
-
The five questions you should ask and answer before trying to solve a data problem:
1. Why is this problem important?
2. Who does this problem affect?
3. What if we don't have the right data?
4. When is the project over?
5. What if we don't like the results? - By asking each stakeholder "Why is this problem important" you will get to understand how each person sees the problem.
- When you enter a project keep in mid that not having the right data is a possibility. So you should create contingencies to pivot to collect better data or if the data doesn't exist, then go back to the original question and attempt to redefine the project scope.
- If raw data is bad no amount of data cleaning wizardry, statistical methodology, or machine learning can help you.
- Data workers make hundreds of tiny decisions during projects. The cumulative can be substantial. Left to their own devices and without the guidance of domain expertise, data workers may continue chipping away at the data, removing complex and nuanced cases, until data is too detached from the reality it's trying to capture to be useful.
- When dealing with probabilities be mindful and recognize that your intuition can play tricks on you. Be careful when assuming independence. Know that all probabilities are conditional and ensure the probabilities have meaning.
- All supervised learning problems follow the same paradigm. Data with inputs and outputs called training data is fed into an algorithm that exploits the relationships between the inputs and outputs to create a model that makes predictions. The model can then take a new input and map it to a predicted output. When the output is a number, the supervised model is called a regression model. When the output is a label the model is called a classification model.
- Subject matter expertise and getting the right data into a supervised learning model is key to having a successful model
- When creating and testing your model it's recommended to train and learn from 80% of the observation in a dataset and test a model's performance on the other 20%.
- AI is reinforcing patterns from data collected in the past. It's not about creating something resembling human consciousness.
- When models make predictions, they perpetuate and reinforce underlying biases and stereotypes already manifested in the data.
- If raw data is bad no amount of data cleaning wizardry, statistical methodology, or machine learning can help you.
-
Questions you should ask about the data:
1. Who collected the data and how was the data collected?
2. Is there any sampling bias and what did you do with the outliers?
3. How did you deal with missing values?
4. Can data measure what you want it to measure? - When two variables are correlated it does not mean that one is causing the other. For example: Ice cream sales are correlated with shark attacks, both spike in the summer months. But reducing ice cream sales will not mitigate shark attacks.
- Data is messy so the resulting principal components will often lack clear meaning and may not be able to have descriptive nicknames. When others present already named components, challenge their definitions by asking to see the equations behind the groupings.
- Supervised learning is both powered and limited by the training data. Unfortunately many companies spend more time worrying about the latest supervised learning algorithm rather than thinking about how to collect relevant, accurate and enough data to feed the algorithm.