<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Real Estate Prediction and Analysis</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.9734/AJRCOS/2023/v16i2339</article-id>
      <title-group>
        <article-title>Machine Learning Engine for Real Estate Price Estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Youssef Roman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdul-Rahman Mawlood-Yunis</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Wilfred Laurier University</institution>
          ,
          <addr-line>75 University Ave W, Waterloo, ON N2L 3C5</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <day>29</day>
        <month>9</month>
        <year>2022</year>
      </pub-date>
      <volume>9</volume>
      <issue>2021</issue>
      <history>
        <date date-type="accepted">
          <day>30</day>
          <month>8</month>
          <year>2022</year>
        </date>
        <date date-type="received">
          <day>14</day>
          <month>7</month>
          <year>2022</year>
        </date>
        <date date-type="revised">
          <day>22</day>
          <month>8</month>
          <year>2022</year>
        </date>
      </history>
      <abstract>
        <p>Accurate price estimation is crucial for informed decision-making in the real estate industry. This study to 2023, using a dataset of over 7,000 detached home transactions. Data preprocessing involved feature engineering, including economic indicators like prime rates. Exploratory data analysis revealed transaction patterns and market shifts linked to interest rate changes. ML techniques, including linear regression, Random Forest, and XGBoost, were employed, with models achieving R-squared values between 0.93 and 0.997. Decision Tree and Random Forest models were the most effective in capturing price variability. Additionally, a Flask-based price estimation tool was developed, and trained on several regions of the Greater Toronto Area (GTA) allowing users to predict home prices based on specific property features. The study demonstrates the value of ML in enhancing real estate market efficiency by providing reliable price predictions, benefiting stakeholders such as homebuyers, sellers, and investors.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Multiple Listing Services (MLS)</kwd>
        <kwd>Feature Engineering</kwd>
        <kwd>XGBoost</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Given the close relationship between property values and economic conditions, appraising houses is
an essential task for stakeholders such as developers, investors, homeowners, and appraisers.
Informed decision-making in real estate transactions is made possible by accurate forecasts, which
are also essential for investment planning and market stability.</p>
      <p>Previous research has employed various machine learning (ML) techniques, such as Linear
Regression [1, 2, 3], Random Forest [4, 2, 5], and Recurrent Neural Networks (RNNs) [6], to forecast
housing prices, demonstrating the importance of feature engineering and the integration of diverse
algorithms for improved accuracy. Approaches like linear regression and ensemble methods, such as
XGBoost, have shown promise by incorporating factors like LSTAT score and crime rate per capita
[5], while deep learning methods logistic regression, convolutional neural networks, and long
shortterm memory (LSTM) networks have been employed to predict prices by considering real estate
characteristics and time-series data [7]. Time-dependent factors were further analyzed using
AutoRegressive and Moving Average (ARMA) models [7], highlighting the significance of temporal
patterns in price prediction.</p>
      <p>However, a lot of the current research ignores the significance of outside market factors, like
shifts in prime rates, which are essential to comprehending variations in housing prices. By adding
outside economic variables to the forecasting model, this study seeks to close that gap and provide a
more thorough method of price prediction. To improve forecast accuracy and relevance, our model
incorporates additional economic indicators, such as the Prime Rate of the Bank of Canada, in
addition to property-specific attributes.</p>
      <p>The remainder of this article is structured as follows: Section 2 discusses the raw dataset and
the engineered features derived from it. Section 3 presents the exploratory data analysis
and key insights gained from the data. In Section 4, various ML models are developed, and
their performance is evaluated using appropriate metrics. Finally, Section 5 summarizes the
research findings and provides recommendations for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Preparation and Preprocessing</title>
      <p>The dataset employed in this study is derived from the REALM MLS Software, specifically
targeting detached homes sold within the Halton Region for the years 2022 and 2023. The
Halton Region, encompassing the cities of Oakville, Burlington, Milton, and Halton Hills, serves
as the focal geographical area for this analysis. The dataset contains over 7,000 records,
providing a substantial basis for a comprehensive examination of the real estate market trends
and dynamics in this region during the specified period. Table 1 presents a brief description of
the raw dataset. Each row in the table represents a home feature and its corresponding
description obtained from the REALM MLS records. The additional Engineered Features such
as Canadian Prime Rates and numerical PPSQFT are mentioned in section 2.2.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Data Cleaning</title>
      <p>Extraction of the dataset into an Excel file marked the start of the data cleaning procedure.
The data was then imported as a data frame using the Panda Library in the Google Colab
environment, which made tasks like cleaning, processing, and visualization easier. Because
data integrity was of the utmost importance, duplicate rows based on address had to be
systematically removed from the dataset. This stage made sure that every property in the
analysis was represented separately.</p>
      <sec id="sec-3-1">
        <title>Null, was removed.</title>
        <p>The column underwent conversion to datetime format for ease of analysis and
manipulation. This conversion facilitated chronological analysis and time-based visualizations.</p>
        <p>Additionally, during the cleaning process, the column was meticulously
parsed to extract total bedroom counts,
which were then converted into numeric values representing the total number of bedrooms.
Similarly, the (square footage) and columns, initially presented as
ranges, were processed to create new numeric features:
respectively. These enhancements resulted in the creation of single numerical values, enhancing
compatibility with machine learning algorithms.</p>
        <p>These steps ensured that the dataset was prepared thoroughly, addressing issues of
duplication, missing data, and format inconsistencies, thereby laying a solid foundation for
subsequent exploratory data analysis (EDA) and modeling phases.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2. Feature Engineering</title>
      <p>A brand-new function called "ppsqft" was released; it computes the price per square foot and
offers insightful data on the dynamics of property pricing. Prime rates were also added to the
dataset to make it more comprehensive for analysis. These prime rates, which provide vital
context about the current state of the economy and the interest rate environment during
the sale period, are expressed as a percentage of the Prime rate of the Central Bank of
Canada at the time of sale. The dataset was greatly enhanced by this augmentation, which
added more variables to take into account when analyzing the complex dynamics of the Halton
Region real estate market over the given period of time.</p>
      <p>Furthermore, the of the properties was added using the Redfin Walk Score
API. The Walk Score ranges from 0 to 100, indicating the walkability of a location. The
scores are categorized as follows:
• Car-Dependent: 0-50
• Somewhat Walkable: 50-70
• Very Walkable: 70-89
• Paradise: 90-100</p>
      <sec id="sec-4-1">
        <title>The total number of bedrooms in the detached home, including basement</title>
        <p>bedrooms, represented a numeric value.</p>
        <p>Price per square foot, calculated as the selling price of the detached home
divided by its total square footage, providing insight into pricing
dynamics based on size.</p>
        <p>Percentage of the Central Bank of Prime rate at the time of sale,
offering context on economic conditions during the sale period.</p>
      </sec>
      <sec id="sec-4-2">
        <title>A score from 0 to 100 indicates the walkability of the location, influencing its attractiveness and value.</title>
      </sec>
      <sec id="sec-4-3">
        <title>Describes the walkability score category, such as Car-Dependent,</title>
        <p>Somewhat Walkable, Very Walkable, or</p>
        <p>The engineered features and the features that were added to the property features to enrich
the data set are described in Table 2 above. The walkability feature adds more information
about how convenient and accessible the properties are, which makes the dataset more
complete. A deeper comprehension of the relationship between walkability and property values
and buyer appeal is made possible by the incorporation of Walk Score data. The dataset
underwent several improvements to maximize its usefulness for machine learning models.
Firstly, label
facilitated the integration of categorical data into the machine learning algorithms. For the
including Oakville, Milton, Burlington, and Halton Hills, populating a new column aptly named
column was generated by converting the range of square footage values into their average.
Similarly, the</p>
        <p>assigned a value of 0. One notable augmentation involved the
inclusion of
known to influence housing prices. This comprehensive feature engineering process aimed to
refine the dataset, empowering machine learning models with enhanced predictive capabilities
by incorporating nuanced real estate factors and economic conditions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. Exploratory Data Analysis</title>
      <p>The bar graphs in Fig.1 illustrate the number of homes sold in 2022 and 2023. Remarkably,
the general pattern for the quantity of homes sold is the same in both years, with a significant
percentage of sales concentrated in the first quarter, especially in February, March, and April.
changed over the course of the two years. As a result of the 2022-2023 increases in Prime Rates
and interest rates that exceed 6.45% in 2023, there is no discernible decrease in the volume of
transactions. Specifically, there was only about a 3% decrease in the number of transactions
recorded in 2023 compared to 2022. This observation suggests that there is no potential
correlation between changes in interest rates and the real estate market activity, indicating the
influence of financial factors on buyer behavior and market dynamics. Furthermore, Fig.2 below
portrays the percentage change of quantity of homes sold in the 2022 months versus the 2023
months, indicating a spike hike in the months of June and October. A 40% increase in homes
sold in June and a 30% increase in October is observed, likely due to Prime Rates not increasing
significantly in these periods, with only a 0.25% increase in interest rates.</p>
      <p>Moreover, the analysis extends to explore the broader trends in the real estate market
through Fig.3. The histogram in Fig.3 illustrates an overarching downward trajectory in the
number of homes sold, showcasing a decline of over 50% from January to December of 2022.
Concurrently, the accompanying line plot in Figure 4 illustrates a discernible negative
correlation between the escalation of prime rates, starting at 2.5% at the commencement of 2022
and reaching 6.5% by
prices. This observation sheds light on the intricate interplay between macroeconomic factors
and real estate dynamics, emphasizing the nuanced relationship between interest rates, market
sentiment, and property valuations.</p>
      <p>Figure 5 below illustrates the Average Home Sold Price of 2022 and 2023, revealing a modest
decrease of 3.94%. In tandem, Figure 6 showcases the Average Sold Price Per Square Foot of
2022 and 2023, displaying a reduction of -6.39%. These observations suggest that while the
increase in interest rates may have exerted some minor influence on home prices, it appears to
construction costs attributed to heightened inflation and supply chain disruptions likely played
significant roles in stabilizing home prices. As a result, despite the uptick in Prime Rates, the
impact on the prices of detached homes in the Halton Region remained relatively subdued.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Price Prediction Model and Evaluation</title>
      <p>Several models were built to predict housing prices during the ML Model Development
phase, each with unique benefits and trade-offs. The correlation matrix is displayed in Figure
7 below, and it identifies the top four characteristics that have a significant correlation with
the sold price: the number of bathrooms (0.54), total number of bedrooms (0.33), home
square footage (0.84), and city in the Halton region (0.27).
approximately
0.93. These models offer simplicity and interpretability but may struggle with capturing complex
non-linear relationships in the data. Decision Tree, Random Forest, and XGBoost models
were subsequently explored to leverage more sophisticated modeling techniques. The
Decision value nearing 0.99,
suggesting an impressive ability to capture intricate patterns within the dataset. However,
trees.</p>
      <p>The Support Vector Machine (SVM) exhibited some predictive capabilities, albeit with a lower
R-squared value of 0.0877 compared to the higher values of 0.93 and 0.99 achieved by other
models such as Decision Trees and Linear Regression. The ensemble learning technique
Random Forest produced competitive results with a noteworthy value of about 0.997.
This model is a popular choice for regression tasks because it combines multiple decision trees
to improve predictive accuracy and reduce overfitting. The gradient boosting algorithm
XGBoost performed well, showing an R2 score of about 0.95. By iteratively improving upon
weak learners, XGBoost excels at optimizing predictive performance and produces superior
predictive accuracy.</p>
      <p>Decision Tree and Random Forest models perform better than others in terms of
percentage, demonstrating their effectiveness in capturing the variance in housing prices. For
the purpose of creating the online tool with Flask Web Service, Random Forest was determined
to be the most accurate and suitable model. The robust predictive capabilities allow
users to estimate the price of their home based on property features like walk score, prime
rate, square footage, number of beds, number of bathrooms, and home type.
In conclusion, this study highlights the effectiveness of machine learning models, particularly
Random Forest, in predicting real estate prices with remarkable accuracy, achieving R-squared
values as high as 0.997. What distinguishes this work is the integration of both
propertyspecific features and external economic indicators, such as prime rates, into the predictive
models. This approach fills a critical gap in existing research by accounting for broader market
conditions, resulting in more comprehensive and reliable price forecasts.</p>
      <p>The novelty of this work lies in the incorporation of external economic factors, providing
a more comprehensive approach to real estate price estimation. Future research could
explore an application web interface which inherits the ml model. This allows stakeholders
to integrate their own datasets for custom model training and price prediction.
[1] X. Li, Prediction and analysis of housing price based on the generalized linear regression model,
[6] Authors, Real-estate price prediction with deep neural network and principal component
anal10, 2022.
prediction, Institute of High Performance Computing (IHPC), Agency for Science Technology</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>