Imagine you want to analyze the data from a company. It could be a large multinational corporation, but it could also be just a small local pet shop. If the pet shop is organized, it will have a table on its internal system to list its customers, another one to list the pets, another one to list the purchases, and maybe others to list employees, expenses, stores, inventory, and so on.
You can not analyze the tables separately. You need to join them with each other to find out things like "Do Labrador owners purchase more dog food than Pit Bull owners?" because the breed information is in the 'pet' table while the purchase information is in the 'purchase' table. So, eventually, you will need to merge the two tables into one. In this case, you would even have to use the 'customer' table to make the connection between 'pet' and 'purchase', because probably it's not the pet that is making the purchases, but its owner, who is a customer at the pet shop.
There are multiple ways for you to join tables, depending on the desired outcome. Maybe you are only interested in the records that match both tables ("inner join"), or maybe you want to also have the records that belong to just one of them ("left join"), and so on. And there are also multiple software and languages for making that manipulation. Three of the most popular are SQL, Spark and Python pandas.
In the image below, there is a reference guide for the different ways to join (or "merge") two tables (or "datasets") and how is the corresponding syntax in SQL, Spark and Python pandas.
This is hardly the only one online. For decades similar images have been used. This one, on the other hand, is the most complete because it has the cross join and all 3 languages syntaxes and is the most aesthetic (but maybe I'm biased because I created it).
Save this image on your computer 💻 or mobile 📱. Copy to wherever you want. Save this page on your favorite bookmark 🧡. Only, if you are going to make money 💰 out of it, please cut me in the deal.
Enjoy!