Analyzing Airbnb Data— Mexico City

Taking advantage of the FIFA World Cup atmosphere, I will analyze data from one of the cities that will host the competition in 2026, Mexico City.

Mexico City, besides being the capital of Mexico, is one of the world’s great tourist centers, registering 31.9 million international tourists in 2021.

When it comes to tourism, it is essential to mention Airbnb, one of the largest lodging rental platforms today.

The startup, founded in 2008, is present in more than 220 countries and has more than 4 million registered hosts, offering alternatives to traditional lodging by connecting owners willing to rent their residences and tenants looking for a place to stay.

With a mission to demonstrate the company’s impact on the residential community, Airbnb makes data from major cities around the globe available through the Inside Airbnb portal, which I will use to extract and analyze the data from Mexico City.

Data Collection and Analysis

As mentioned, all data was collected through the Inside Airbnb portal. Within this imported dataset, we can find several entries, each representing a row, and variables, corresponding to columns. In total, we have 22948 entries to be analyzed, together with 18 variables. For better understanding, a Dictionary of Variables is available with their names and meaning.

  • id — Identification number of the property.
  • name — Name of the advertised property.
  • host_id — Identification number of the owner (host) of the property.
  • host_name —Name of the property owner.
  • neighbourhood_group — Neighborhood group of the announced property.
  • neighbourhood — Neighborhood name.
  • latitude — Latitude coordinate of the property.
  • longitude — Longitude coordinate of property.
  • room_type — Tells you what type of room is offered.
  • price — Rental price of the property.
  • minimum_nights — Minimum number of nights for a reservation.
  • number_of_reviews — Number of reviews the property has.
  • last_review — Date of last review.
  • reviews_per_month — Amount of reviews per month.
  • calculated_host_listings_count — Amount of properties from the same host.
  • availability_365 — Number of days of availability within 365 days.
  • number_of_reviews_ltm — Number of reviews within the last 12 months.
  • license — Municipality registration code.

Before starting any analysis, here is a sample of how our Data Frame, where we can check how the data came in by default and analyze its content.

First entries of our Data Frame

Even at the initial contact with our Data Set, it is evident the absence of values in some fields as neighbourhood_group and license.

Missing data is usually caused by the user forgetting to fill in the data properly, a programming error, loss of data during transfer from another source, or even by the user’s choice when filling in the data.

Regardless of the cause of the missing data, it is important to address it properly to ensure everything from accuracy when visualizing data in dashboards to the application of advanced machine learning models.

Here’s a better visualization of the missing values by column:

Absence and Fulfillment of Variables

As shown in the chart above, the neighbourhood_group and license columns are completely absent, while last_review and reviews_per_month have 17.72% of their total entries missing.

With the total absence of the fields neighbourhood_group and license, reinforcing the importance of keeping the data as clean as possible, we can remove them from our Data Set.

I also chose to remove the idand host_id fields, as they are identifier fields and will not be utile for our analysis.

Bringing it into the analytical context, an outlier is the name given to atypical values in a data set, standing out from the rest of the values and commonly biasing the visualization of data.

When first viewing the data distribution in our Data Frame, some inconsistencies were noticed, as can be observed here:

Histograms of the numerical variables in our Data Set, with price and nights_minima concentrating all their values in a single bar due to outliers.
Biased Data Distribution Histograms

In the histograms above, the price and minimum_nights variables show a high concentration in only one region of the graph, not allowing an optimal visualization of the data.

To handle the outliers from price, the Interquartile Range (IQR) measure was applied, which consists in calculating the IQR and using it as a reference value to define as atypical value the entries where this limit is exceeded. More in-depth details can be found in the Notebook used for the analysis.

For the variable minimum_nights, I used a reference found in the portal Inside Airbnb: Mexico City, where the entries where minimum_nights are less than 30 are considered “Short-Term Rentals”, thus defining the cutoff line for outliers.

After applying the methods to remove outliers from our Data Set, we can now display the data to better visualize how the data is distributed and extract some insights from our source.

Data Distribution Histograms

Correlation allows you to measure the relationship between variables and what they represent, by means of correlation coefficients. For example, it is possible to measure the correlation between Air Pollution Increase x Respiratory Illnesses, or Unemployment Rate x Crime Index.

Correlation Strength Index

The intensity of the correlation between variables is represented between the interval -1 and 1, with values near the extremities of the interval correlations being stronger, and values near 0 being weaker.

Correlation Heatmap

Looking at the heatmap above, it can be seen that there is not much correlation between the variables, the strongest being between number_of_reviewsand reviews_per_month, showing a positive correlation of 0.60, which can be better visualized through the scatter graph below:

Correlation number_of_review x reviews_per_month

The demonstrated correlation makes sense, considering that the amount of monthly reviews of a property is directly related to the total amount of reviews.

A valid hypothesis is that the weakening of the correlation is due to the non-linearity that properties can be evaluated, considering that the amount of monthly reviews of a property varies throughout the months.

The average daily price for renting a house in Mexico City is Mx$978.65.

Distribution by Room Type

It can be seen in the data above that:

  • Shared rooms and hotel rooms are not common rental types within the platform, accounting for 1.5% and 0.7% of listings.
  • Entire home/apt properties correspond to 60.1% of available properties, while Private room properties represent 37.5%. Which means that hosts often prefer to rent out their entire properties rather than separating them into apartments by room.
Chapultepec Castle — Miguel Hidalgo, Mexico City

According to the data, the most expensive neighborhood is Miguel Hidalgo.

The neighborhood is known for presenting a very interesting mix between modern buildings and large artifacts dating from colonial times, which gives a cosmopolitan city atmosphere to its habitants and travelers.

In addition to the historic buildings, many old mansions house some of the best hotels and restaurants in the country, adding to the neighborhood luxuriance, and therefore for its average rental cost.

You can check the list below of the top 10 most expensive neighborhoods in Mexico City and their average prices per night:

  1. Miguel Hidalgo: Mx$1217.88
  2. Cuauhtémoc: Mx$1108.15
  3. Cuajimalpa de Morelos: Mx$1073.93
  4. Álvaro Obregón: Mx$836.29
  5. Benito Juárez: Mx$808.75
  6. La Magdalena Contreras: Mx$779.72
  7. Coyoacán: Mx$766.09
  8. Xochimilco: Mx$666.26
  9. Iztacalco: Mx$652.01
  10. Tlalpan: Mx$634.75

With the help of thelatitude and longitude variables, we also can visually display the properties sorted by price:

Scatter Plot of Properties by Price
El Ángel de la Independencia — Paseo de la Reforma, Mexico City

The data shows that 41.34% of the total lodgings require a minimum of 2 to 3 nights to be booked.

Conclusion

This analysis, although superficial, aims to show that even in a limited data set it is possible to apply good data cleaning and treatment practices, and to extract insights through statistical analysis and data visualization, which can help in decision-making.

Even in a limited data set, we found outliers and missing data, which we had to clean in order to ensure greater data accuracy and better visualization.

Thus, it is important to note that the sample used for the project is a summarized version of the data, ideal in only an initial approach, for extracting more basic data. In a deeper analysis, a larger data set with complete data and a greater variety of columns is recommended.

To view the complete project, you can visit my GitHub. Also, you can find me on LinkedIn.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mateus Friedmann

Passionate about technology and communication, focused in delivering business value through technology, analyzing and developing solutions.