Analyzing Airbnb Data— Mexico City
Taking advantage of the FIFA World Cup atmosphere, I will analyze data from one of the cities that will host the competition in 2026, Mexico City.
Mexico City, besides being the capital of Mexico, is one of the world’s great tourist centers, registering 31.9 million international tourists in 2021.
When it comes to tourism, it is essential to mention Airbnb, one of the largest lodging rental platforms today.
The startup, founded in 2008, is present in more than 220 countries and has more than 4 million registered hosts, offering alternatives to traditional lodging by connecting owners willing to rent their residences and tenants looking for a place to stay.
With a mission to demonstrate the company’s impact on the residential community, Airbnb makes data from major cities around the globe available through the Inside Airbnb portal, which I will use to extract and analyze the data from Mexico City.
Data Collection and Analysis
As mentioned, all data was collected through the Inside Airbnb portal. Within this imported dataset, we can find several entries, each representing a row, and variables, corresponding to columns. In total, we have 22948 entries to be analyzed, together with 18 variables. For better understanding, a Dictionary of Variables is available with their names and meaning.
Dictionary of Variables
id
— Identification number of the property.name
— Name of the advertised property.host_id
— Identification number of the owner (host) of the property.host_name
—Name of the property owner.neighbourhood_group
— Neighborhood group of the announced property.neighbourhood
— Neighborhood name.latitude
— Latitude coordinate of the property.longitude
— Longitude coordinate of property.room_type
— Tells you what type of room is offered.price
— Rental price of the property.minimum_nights
— Minimum number of nights for a reservation.number_of_reviews
— Number of reviews the property has.last_review
— Date of last review.reviews_per_month
— Amount of reviews per month.calculated_host_listings_count
— Amount of properties from the same host.availability_365
— Number of days of availability within 365 days.number_of_reviews_ltm
— Number of reviews within the last 12 months.license
— Municipality registration code.
Before starting any analysis, here is a sample of how our Data Frame, where we can check how the data came in by default and analyze its content.
Missing Values
Even at the initial contact with our Data Set, it is evident the absence of values in some fields as neighbourhood_group
and license
.
Missing data is usually caused by the user forgetting to fill in the data properly, a programming error, loss of data during transfer from another source, or even by the user’s choice when filling in the data.
Regardless of the cause of the missing data, it is important to address it properly to ensure everything from accuracy when visualizing data in dashboards to the application of advanced machine learning models.
Here’s a better visualization of the missing values by column:
As shown in the chart above, the neighbourhood_group
and license
columns are completely absent, while last_review
and reviews_per_month
have 17.72% of their total entries missing.
Cleaning our Data Set
With the total absence of the fields neighbourhood_group
and license
, reinforcing the importance of keeping the data as clean as possible, we can remove them from our Data Set.
I also chose to remove the id
and host_id
fields, as they are identifier fields and will not be utile for our analysis.
Identifying and Removing Outliers
Bringing it into the analytical context, an outlier is the name given to atypical values in a data set, standing out from the rest of the values and commonly biasing the visualization of data.
When first viewing the data distribution in our Data Frame, some inconsistencies were noticed, as can be observed here:
In the histograms above, the price
and minimum_nights
variables show a high concentration in only one region of the graph, not allowing an optimal visualization of the data.
To handle the outliers from price
, the Interquartile Range (IQR) measure was applied, which consists in calculating the IQR and using it as a reference value to define as atypical value the entries where this limit is exceeded. More in-depth details can be found in the Notebook used for the analysis.
For the variable minimum_nights
, I used a reference found in the portal Inside Airbnb: Mexico City, where the entries where minimum_nights
are less than 30 are considered “Short-Term Rentals”, thus defining the cutoff line for outliers.
Displaying Data Without Outliers
After applying the methods to remove outliers from our Data Set, we can now display the data to better visualize how the data is distributed and extract some insights from our source.
Is there any correlation between the variables?
Correlation allows you to measure the relationship between variables and what they represent, by means of correlation coefficients. For example, it is possible to measure the correlation between Air Pollution Increase x Respiratory Illnesses, or Unemployment Rate x Crime Index.
The intensity of the correlation between variables is represented between the interval -1 and 1, with values near the extremities of the interval correlations being stronger, and values near 0 being weaker.
Looking at the heatmap above, it can be seen that there is not much correlation between the variables, the strongest being between number_of_reviews
and reviews_per_month
, showing a positive correlation of 0.60, which can be better visualized through the scatter graph below:
The demonstrated correlation makes sense, considering that the amount of monthly reviews of a property is directly related to the total amount of reviews.
A valid hypothesis is that the weakening of the correlation is due to the non-linearity that properties can be evaluated, considering that the amount of monthly reviews of a property varies throughout the months.
What is the average rental price in Mexico City?
The average daily price for renting a house in Mexico City is Mx$978.65.
What is the most popular type of property on Airbnb?
It can be seen in the data above that:
- Shared rooms and hotel rooms are not common rental types within the platform, accounting for 1.5% and 0.7% of listings.
- Entire home/apt properties correspond to 60.1% of available properties, while Private room properties represent 37.5%. Which means that hosts often prefer to rent out their entire properties rather than separating them into apartments by room.
What are the most expensive locations in Mexico City?
According to the data, the most expensive neighborhood is Miguel Hidalgo.
The neighborhood is known for presenting a very interesting mix between modern buildings and large artifacts dating from colonial times, which gives a cosmopolitan city atmosphere to its habitants and travelers.
In addition to the historic buildings, many old mansions house some of the best hotels and restaurants in the country, adding to the neighborhood luxuriance, and therefore for its average rental cost.
You can check the list below of the top 10 most expensive neighborhoods in Mexico City and their average prices per night:
- Miguel Hidalgo: Mx$1217.88
- Cuauhtémoc: Mx$1108.15
- Cuajimalpa de Morelos: Mx$1073.93
- Álvaro Obregón: Mx$836.29
- Benito Juárez: Mx$808.75
- La Magdalena Contreras: Mx$779.72
- Coyoacán: Mx$766.09
- Xochimilco: Mx$666.26
- Iztacalco: Mx$652.01
- Tlalpan: Mx$634.75
With the help of thelatitude
and longitude
variables, we also can visually display the properties sorted by price:
What is the average minimum night’s rental?
The data shows that 41.34% of the total lodgings require a minimum of 2 to 3 nights to be booked.
Conclusion
This analysis, although superficial, aims to show that even in a limited data set it is possible to apply good data cleaning and treatment practices, and to extract insights through statistical analysis and data visualization, which can help in decision-making.
Even in a limited data set, we found outliers and missing data, which we had to clean in order to ensure greater data accuracy and better visualization.
Thus, it is important to note that the sample used for the project is a summarized version of the data, ideal in only an initial approach, for extracting more basic data. In a deeper analysis, a larger data set with complete data and a greater variety of columns is recommended.
To view the complete project, you can visit my GitHub. Also, you can find me on LinkedIn.