Book review: FROM DATA TO DECISION
A Handbook for the Modern Business Analyst
Genres:
- Information management
- Data Processing
- Business & Finance
- Analytics
Review posted on:
13.03.2022
The number of pages:
326 pages
Book rating:
4/5
Year the book was published:
First edition published 2018
Who should read this book:
- People interested in data and predictive analytics (marketing) with a basic or intermediary knowledge of data analytics and programming language R (or similar).
- Business analysts.
Why did I pick up this book and what did I expect to get out of it:
Picking up a book like this that has zero reviews is always tricky and that shouldn’t discourage you from not at least briefly listing through it to see if the content suits you or not. Based on the table of content and the introduction chapter I expected that the authors will take me on a “journey” on whichI will learn how to approach a project or a report, look at the data, and take into account the needs/questions of the stakeholders. After that, I expected to learn which analyses are even possible based on the data I have in front of me and the question I’m trying to solve. Also, I expected to learn the downsides and upsides of each analysis and how they compare to other analyses that might solve the same question.
For me to make the book even more satisfying the authors have to include case studies with steps I need to take for each analysis so that I don’t have to guess how they got the results they did (especially if I got different results as they did). At the end of the book or each chapter, there should also be some tasks you could do to practice what you have learned and the right answers to check if you have solved them correctly. And based on the title “From Data to Decision” I expect that by the end of the book I will be able to execute the right analysis and also to present the results in a way so that the end-users of the report will be able to make the right decision. Or if the data is insufficient that I will be able to explain why I can’t do the analysis and which data I need to be able to perform the analysis.
My thoughts about the book:
The thing I liked the most about “From Data to Decision” is that the authors didn’t go too much into statistical formulas when they explained the theory and case studies of each analysis. I read a few books about Data Analytics and in most of them, the authors go too much in-depth into formulas which is ok if you are looking at the analysis from a statistical point of view, but for a person like me who has no background in statistics, it’s confusing. Also for most Data Analysts today that is not the case or need to do so. As the authors mention in this book you as a data analyst don’t need that much knowledge about statistical formulas as most analytical tools such as Python or Rstudio already have them included in their libraries and functions. You as a Data Analyst must know which data you need, how to clean and transform it so it can be used in the analysis and you must know how to do the analysis. Don’t get me wrong, I’m not saying that we don’t need people who master statistics, of course, we do, as those are the key people who developed the models and analysis we use “today” and will use “tomorrow”. All I’m saying is there is a difference between Analysts who design new types of models and the Analysts who use already-developed models to do the needed analyses.
Also, I was glad to see the authors used Rstudio to do the analysis as I also do analysis in Rstudio and not in Python (which today is more popular). I also liked that the chapters had the same structure. At the start of each chapter, you first read what you are going to learn, and a link to data if it is available online. Then you get to read some theory about the subject matter and a case study or two. At the end of the chapter, you get to have a “Quiz”, but at this point, I have to point out that I was disappointed that the authors didn’t provide the correct answers with which you could check if you have the right solution.
If you picked up this book please let me know what you think about it in the comment section.
A short summary of the book:
The book starts out with an introduction to analytics and types of data. In the first couple of chapters, you go over the fundamentals of data, data strategies, frameworks, and processes. The authors help you get a bigger picture of what you have to keep in mind and which skills you need to master as a data analyst to be capable of developing a data strategy for a complex analytical project or to just do a simple analysis.
After you go over the above-mentioned basics the authors first introduce you to some Basic Analysis, continue with Linear and Logistic Regression, Decision Trees, Multi-Dimensional Scaling (MDS), Principal Component Analysis and Factor Analysis, Cluster Analysis, Time Series Analysis, and Text Analytics got which the authors provide in-depth steps backed up with case studies. In the end, the authors also describe the concept and workings of Neural Networks and Machine Learning but do not go in-depth as they did with other analyses.
For each of the analyses, the author first explains what you will learn in the current chapter, followed up by some theory about the analysis, what and when is the best use for it, what are the downfalls, and which would be better alternatives if there are any. After the theory, you get to read a case study of a situation in marketing where you mostly get to learn about the steps of creating a model of the analysis at hand.
In most case studies you get a link to a dataset so you can do the analysis yourself or follow the authors’ steps while going through the chapter. At the end of each chapter, there is also a “Quiz” you can go through and see if there is anything you didn’t understand fully. But the downside (at least for me) is that you have no answers from the authors to check if you were correct or not. The authors promote the use of analytical programming languages such as R or Python and in the book itself helps the reader by providing the appropriate libraries for the R programming language for each analysis.
My notes from the book:
- Analyze and model in multiple stages. Analytics is an iterative process. Initial analyses may not give any interesting insights. It is common to analyze a data set in many different ways before one discovers any insights.
- Combining different data sources can often bring unique perspectives and uncover new insights.
- To leverage data to get business insights and make better decisions you need to first establish what data is available and what data you deem necessary. If it is not available but you deem it necessary or valuable you need to find out where and how to get it and how to manage it.
- Analysts often spend considerable time talking and presenting what the analyses and models say but rarely what their models don't say. Communicating what the analyses can't confirm will help you build credibility.
- It is possible an insight is known and credible yet not accepted. The reason for this may be in lack of actionability, lack of alignment to short-term objectives, lack of alignment with the decision maker's agenda, or lack of ownership.
- The extent to which you can make the decision-makers co-producers of the insights will increase the likelihood the insights will be acted on.
-
When you have generated a model you should assess its quality on four criteria:
1. Does the model have significant coefficients and how stable are they?
2. Are the directions of these coefficients intuitive?
3. What is the predictive accuracy of the overall model?
4. What is the hold-out predictive accuracy of the overall model? -
When calculating the ROI in marketing the main objective is to understand if and "how much to spend" on certain marketing activities and how these activities will affect our sales. There are several dynamics you have to consider:
1. Direct effects
2. Carry-over and long-term effects
3. Indirect effects | 4. Diminishing return effects | 5. Feedback effects | 6. Interaction effects (the effect of two marketing activities are bigger than the sum of their individual effect) | 7. Time-varying effects (what is important changes over time). - For a simple neural netwrok model, increasing the number of layers provides better performance than increasing the number of nodes per layer.
- The mean can be heavily influenced by outliers. If you have outliers use median.
- If you have data scewed to one side in scatterplot you need to transform data so you can make a better comparison. Some options to transform data are: Log, Powers, Roots, Reciprocals, Box-Cox transformation,...
- When dealing with outliers you should instead of testing means test the differences in medians. Thus outliers will have no effect on the results. Use the Wilcox test.
- A common way to smooth a time series is by using a moving average.
-
Analyses and models:
1. LINEAR REGRESSION - is a simple yet powerful predictive model designed for predicting continuous numerical values such as "product price".
2. LASSO and RIDGE REGRESSION - are used when there are too many independent variables and high multicollinearity involved in the model (Linear Regression tends to have low accuracy since the model is confused by too much irrelevant information).
3. TOBIT MODEL - is used when the dependent variable has a range restriction. An example of a dependent variable is "market share".
4. SPATIAL REGRESSION - is designed specifically to run on spatial data. It takes into consideration the geometric or geographical structure.
5. LOGISTIC REGRESSION - is an approach to classification problems. Which means you are predicting a decision or equivalently a categorial output (yes/no). With it, you can do analysis such as "Key Driver Analysis" where you loo how to improve your products from customer feedback.
6. THE SWITCHABLE CONSUMER approach can predict future market shares. It does so by calculating the switchable consumers and teh at-risk consumers. From this, you can calculate the market share percentage that a brand would gain if all the switchable were to come, and what would be the market share loss if all at-risk customers were to defect from the brand.
7. MULTY-CLASS LOGISTIC REGRESSION - if you want to predict which product out of X will a customer use. A simple solution is the "One-versus-all" strategy.
8. DECISION TREES - a predictive model that is frequently used in marketing analytics such as market segmentation. They produce a hierarchical structure of data and can handle classification and regression problems. Decision trees make no assumptions that are typical for Logistic regression. Instead, they simply partition the data into smaller and smaller homogeneous chunks.
9. MULTI-DIMENSIONAL SCALING (MDS) - is a group of techniques that seek to find a graphical representation of a matrix with similarities or distances.
10. PRINCIPAL COMPONENT ANALYSIS (PCA) - is used for dimensional reduction. The goal of this is to reduce data to only its fundamental dimensions to make analysis more manageable and interpretable. The idea is to combine many variables into new variables which are called principal components.
11. CLUSTER ANALYSIS - is a set of techniques that groups observations into homogeneous groups in such a way that the units within a group are similar and units between groups are dissimilar. The main applications in marketing are market segmentation, market structure analysis, and data reduction.
12. CONFIRMATORY FACTOR ANALYSIS (CFA) - is a technique to verify that the relationship between the original variables and the extracted latent variables exists. This is usually done via structural equation modeling.
13. TIME SERIES ANALYSIS - is a popular tool for understanding and predicting past trends and future values. It can be used for forecasting, market-mix modeling, customer lifetime value models, and trend spotting.
14. NEURAL NETWORK ANALYSIS - is a set of analytical approaches that would fall under the umbrella of machine learning. Machine learning could be defined as a set of advanced and often computationally heavy automated algorithms that handle massive data using complex models. Often machine learning algorithms are non-parametric, this means that no specific relationship between variables is assumed at the start of the analysis.