100+ Concepts of Data Science/Analytics/Engineering in 5 minutes. (9/10)
Here are another 10 terms (81–90) in 5 minutes to give you an overview of the data world from Data Science to Data Analytics to Data Engineering:
81. Bayesian Statistics
Bayesian Statistics or Bayesian Thinking is a field of statistics that concerns itself with the likelihood of an event happening given other events.
A great example would be trying to figure out the likelihood that someone you’re going on a first date with, likes Star Wars.
Your estimate is that about 60% of the population likes Star Wars. While on this date, it comes out that your date went to see the last star wars movie.
You know that basically all fans of Star Wars saw the movie but not everyone who saw the movie was a Star Wars fan and you can update your hypothesis to say it’s 80% likely that your date is a Star Wars fan.
82. Data Distribution
A data distribution describes how usually data is numerically distributed.
The most common type of distribution is the Gaussian or Normal distribution which is a precondition for running a lot of statistical tests.
83. Selection Bias
One really important statistical concept that can affect the accuracy of your machine learning algorithms is selection bias.
This is the type of bias that is introduced when individuals or instances selected for groups are not genuinely randomized which leads to poor accuracy in your algorithms.
An example would be if your trained test split was done poorly such that your training group is not generally representative of your entire sample.
84. Bootstrapping
Bootstrapping is one way to fight selection bias and also try and make your algorithms work when you don’t have that much data to work with.
Bootstrapping is a procedure where you select a random sample, replace all the items in the sample, and then sample again.
85. Hypothesis Testing
Hypothesis Testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results.
You’re basically testing whether your results are valid by figuring out the odds that your results might have happened by chance.
If your results may have happened by chance, the experiment won’t be repeatable so it has little use.
This is very useful for data scientists who want to generalize over a population whether something that they’ve done is effective or not.
Data Analyst
If you want to break into the world of data science, becoming a data analyst is a great way to get started.
It was my first choice and taught me a lot of the skills that eventually got me into data science in a more delicate manner than diving straight into hardcore data science.
A data analyst generally works on understanding the needs in a process called requirements gathering, checking if the data is available, analyzing it, and presenting it to business leaders.
87. Data Storytelling
At the end of the day, a data analyst is involved in the practice of data storytelling.
What’s the point of analyzing data?
It is to drive some form of change in an organization or group of people. While you can inundate people with numbers, facts, and figures; you’re much more likely to convert people to your method of thinking if you can tell them a story that communicates the message using data.
There are seven different data stories that you can tell.
I. Narrate, change, over time.
An example of this could be a time-lapse of the Amazon being deforested over time.
II. Start big and drill down.
This is a great way to give context to the smaller data points you want to communicate.
If you wanted to communicate how infrastructurally underdeveloped North Korea is, you could show a map of all of Asia and how well-lit it is at night then zoom into the one part that isn’t North Korea.
III. Start small and zoom out.
This is the method a lot of news outlets will use when discussing an issue.
They’ll find an individual affected by the issue to anchor the audience with a sympathetic person then zoom out to explain how the problem affects a much larger population.
IV. Highlight Contrasts.
This is a great way of showing how problem areas tend to cluster around one another.
If we looked at an index of nations with poor health, we’ll notice that a lot of nations that are not healthy tend to cluster around certain regions of the world compared to other healthy nations.
V. Explore the intersection.
This can be how different phenomena develop in response to stimuli in their environment.
The line graph of North Korean GDP per capita vs South Korean GDP per capita shows an intersection in the 1950s followed by a huge acceleration of South Korea to the stratosphere while North Korea eventually suffers.
VI. Dissect the factors.
Oftentimes, we want to know what the constituent parts of observed phenomena are.
Let’s take global GDP figures by themselves. There is not much of a story here but as we dissect them by country you start to see the rise of certain countries during different times in history.
VII. Profile the Outliers.
Sometimes the outliers are what we’re truly interested in.
If you’re working in fraud detection, normal transactions are not something you’re interested in as much as the outliers are.
88. Business Intelligence
Business Intelligence is concerned with using generally tabular data and transforming and visualizing it in order to explain business performance.
It’s less about analyzing the data as a data analyst does and more about communicating what the data says, in formats that help leaders understand how the business is performing.
89. BI Tools
Business Intelligence Professionals use a class of tools called BI Tools.
These are tools with graphical user interfaces that connect them to many different sources of data and make visualizing them easy.
90. matplotlib / Seaborn / ggplot2
If you’re trying to create a visualization in Python you’ll use matplotlib or Seaborn.
Or ggplot2 if you use R.
Thank you so much for reading. This is part 9 out of 10. If you liked this story don’t forget to press that clap icon for the final part. Keep following for more such stories.