=Paper=
{{Paper
|id=Vol-2873/paper4
|storemode=property
|title=Demo: Knowledge Graph-Based Housing Market Analysis
|pdfUrl=https://ceur-ws.org/Vol-2873/paper4.pdf
|volume=Vol-2873
|authors=Ziping Hu,Zepei Zhao,Mohammad Rostami,Filip Ilievski,Basel Shbita
|dblpUrl=https://dblp.org/rec/conf/esws/HuZRIS21
}}
==Demo: Knowledge Graph-Based Housing Market Analysis==
Demo: Knowledge Graph-Based Housing Market Analysis Ziping Hu1 , Zepei Zhao1 , Mohammad Rostami2 , Filip Ilievski2 , and Basel Shbita2 1 University of Southern California, Los Angeles CA 90007, USA {zipinghu,zepeizha}@usc.edu 2 Information Sciences Institute, Marina del Rey, CA 90292, USA {mrostami,ilievski,shbita}@isi.edu Abstract. The housing market is complex and multi-faceted, which makes its analysis challenging for users and professionals. We develop a four-step knowledge graph-based knowledge extraction approach for the housing market for efficient and accurate data analysis, consisting of data acquisition and cleaning, entity linking, ontology mapping, and question answering. The proposed system allows one to summarize the housing information for a selected geographical area, analyze the sur- roundings by collecting census data, understand the medical safety based on COVID-19 data, and the area attractiveness based on celebrities data from DBpedia. Our system can provide personalized recommendations given keywords and information about the market over time. A user- based evaluation demonstrates the utility of our system. Keywords: Knowledge graph, DBpedia, COVID-19, Recommendation 1 Introduction Analyzing the housing market involves data collection from various websites, narrowing down search based on geographical or economical constraints, and iterative searching over the collected information. This process is highly time- consuming and laborious, and becomes infeasible when the data size is large. In this paper, we develop a knowledge graph (KG)-based tool for flexible analysis of the housing market, given user constraints. Specifically: C1 We develop a KG pipeline for integrating information about housing items and their nearby environment, including economic data, COVID-19 data, nearby celebrities, and transits, etc. C2 We develop a demo-tool given the geography of interest, which uses three representative methods, based on: keywords, entity resolution and rule-based matching, and Latent Dirichlet Analysis. The tool supports flexible user searches: fixed search, queries, and a combination of the two. C3 We demon- strate the utility of the tool with two use-cases: rental recommendations based on user queries and analysis of the spatio-temporal dynamics of the market. Copyright © 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). Fig. 1. Housing KG Pipeline. The KG technology plays a central role in this application. The data enrich- ment, either through adding new data from other sources or further extraction from the original sources, is native to KGs [2]. As an integrated data source, it enables easy analysis and recommendation algorithms to be developed. Existing KG visualization functionality helps us to study correlations in the market and predictive market analysis. 2 Proposed Knowledge Integration Approach The goal of our system is to produce an integrated knowledge graph. Our ar- chitecture is shown in Fig. 1. First, data acquisition and data cleaning prepare usable and abundant data. Then data from different sources is linked based on similarity, allowing one to explore implicit relationships between the sources. After defining appropriate nodes, relationships, classes, and properties, we map the extracted knowledge to this ontology. Once the knowledge graph is ready, we build a question answering module to support intuitive queries over our graph. Finally, all the data interactions are integrated into a user-friendly web applica- tion, developed by full-stack technology. Without loss of generality, we implement the above pipeline for the use case of apartment finding in the state of California to consider a geographical constraint. We describe the design and implementation of the pipeline. 2.1 Data Acquisition and Cleaning Four sources of data are leveraged in our project: 1) over 24k records of apart- ment information from Apartments3 and ApartmentFinder4 , including proper- ties like location, floor plan, phone, and rating; 2) more than 55k records of celebrities’ information, such as names, birthplaces, death places, hometowns, residences, alma maters, from DBpedia5 ; over 3k records of US Census data and COVID-19 data from the Census Bureau6 and Los Angeles Times7 , including properties such as race distribution, average family income, employment rate, the number of confirmed cases of COVID-19, and the number of deaths. We circumvent crawling challenges like anti-crawler, dynamic web, and the limits on the number of returned records by setting sleep times, using different 3 https://www.apartments.com 4 https://www.apartmentfinder.com 5 https://www.dbpedia.org 6 https://www.census.gov 7 https://www.latimes.com/projects user-agents, and generating dynamic IPs. We perform further normalization of the original data. The duplicates and missing values are processed by dropping records or filling them with the average or median. The cleaned data are used for further data integration. 2.2 Entity Linking Entity linking is the process of establishing identities between representations in different sources, which is critical to the semantic integration of sources. It is challenging due to name variations and ambiguity [4]. The same apartment may have different names and address formats on different websites, so similarity calculation is needed to link the records which represent the same entity. We compute Levenshtein similarity [5] of the apartment name and location: Sim = α ∗ SimLocation + (1 − α) ∗ SimT itle. In our work, we set α to 0.9, giving higher weight for location as the records with the same location are always identical. We set the candidate threshold to 0.8, which means if Sim for two entities is greater than 0.8, then they are counted as candidates. Then, the highest similarity record is marked as linked. We use RLTK8 to implement the similarity functions. Since the calculation is time-consuming with millions of pairs to be com- pared, blocking is used to reduce complexity [1]. We use zip code to divide apartments into corresponding blocks to only compare the apartments within the same county. Thus, the time consumed per county is reduced ten times. The Us Census data, COVID-19 data and celebrities data are linked with apartments by using zip code. 2.3 Ontology Mapping Ontology mapping is defined as a process to find semantic correspon- dences between similar elements of different ontologies [3]. The seman- tic model of our KG is shown in Fig. 2. We have 10 classes with 46,200 instances. The main classes are: ApartmentFinder Info, Apartments Info, University, Celebrity, and Location Info. There are 160,824 edges belonging to 13 Fig. 2. The Semantic Model of our KG. types. COVID-19 and US Census data such as confirmed cases, confirmed death cases, crime rates, married ratio and population ratio are set as properties for location information. We store our graph in Neo4j9 . 8 https://rltk.readthedocs.io 9 http://neo4j.org Table 1. Examples for Categories, with their Keywords and Indicators. Category Example of Keyword Indicator education university, education,.. high school degree(%) & bachelor degree(%); number of nearby universities security safe, crime,security,... confirmed case & deaths of COVID-19; crime rate economy rich, economy,.. employ rate; mean income of families age of apartment new, built, time,... built in time of apartment social friend, young,... median age; the proportion of married race distribution Asian, black,... the proportion of Asian/black/white transportation transit, traffic,... the number of nearby transits 2.4 Query Matching Three methods were used to process user queries: keywords, entity recognition/rule- based matching, and topic modeling. Keywords: We attempt to map user keywords to defined categories, as shown in Table. 1. Some queries can be directly run on the KG, e.g., if shopping is mentioned, we order the apartments by the number of nearby shopping malls. Further indicators are defined by ourselves, e.g., for places with high security, we assume the confirmed cases, confirmed death cases of COVID-19, and crime rates should all be below the median. ER and Rule-based Matching: We use pattern-based extraction to find the address, location name, and person name mentioned in the user query. The patterns include the surrounding words, whether they contain numbers, upper case, ‘GPE’ or ‘LOC’ entity type, etc. Topic Modeling: If the query consists of paragraphs of description, we try to detect its topic, and return apartments which belong to the topic. To achieve it, we used a property named apartment.description as raw data. Then NLTK10 is applied to do data preprocessing, such as tokenization, removing stop words, lemmatizing, stemming, and filtering out words that appear infrequently. We used Gensim11 to train LDA models with bag of words and TF-IDF. We evaluated the models and fine-tuned parameters based on the average scores for the most-matched topics. For instance, for the data in Los Angeles, we divided the apartments into 20 topics and the highest average score is 68.5%. Some of the keywords of topics are “Hollywood”, “furnished”, “park”, etc. 3 Analyses We show the utility of the proposed system with three use cases: intuitive vi- sualization of housing information, analysis between housing aspects over time, and personalized housing recommendations. 10 https://www.nltk.org 11 https://pypi.org/project/gensim Fig. 3. Visualization of a Relevant Subgraph for a Query. (a) Deaths in Nov. (b) Deaths in Feb. (c) Price in Nov. (d) Price in Feb. Fig. 4. Heatmap of Deaths of COVID-19 and Average House Price Changes. Visualization of Apartment and Surrounding Information An intu- itive view of an apartment and its neighborhood can be easily visualized based on user queries. As shown in Fig. 3, the URLs of apartments are connected with nearby shopping malls, parks, universities, the location (zipcode) node and so on. And the celebrities are linked with location information, while all the COVID-19 data and US Census data are set as properties for the zipcode node. Factor Comparison over Time Analysis about how data changes over time and the correlation between them can be easily performed. The proposed system can be used to track and analyze data changes between different coun- ties, states, or countries. And the convenience of adding new data allows us to continuously expand data over time. These may contribute to analyzing some is- sues related to marketing or social science, including factors that impact housing prices, the impact of COVID-19, and changes in the living standards, etc. Fig. 4 compares the changes in the number of deaths of COVID-19 and the average house price per district between November 2020 and February 2021, for different counties in California. Darker color means a higher proportion of deaths, or higher average house price. Apartment Recommendation System We also provide a recommenda- tion interface to search for rental properties12 . The user interface supports three exploration ways for users: 1) fixed search, e.g., “an apartment with 3 bed- rooms, less than 3500 dollars“; 2) input queries, e.g., “a modern designed apartment near 6201 Hollywood Blvd or near Bernard cooper , with high security . I can go for a walk in the nearby park on weekends. And the people here are highly educated and relatively rich”; and 3) a combination of both. The specific information about the apartments/houses that meet the require- ments and nearby areas will be returned according to their score. For evaluation, we designed a user evaluation aimed at the above three ways. Nine scenarios assigned for each user who was asked for making their own input on our website based on the scenarios. Then we compared the top 50 results from 10 users with standard results and calculated precision and recall. The precision of fixed Search, queries, and their combination, are 1.0, 0.43, and 0.71, respectively. And the recalls of them are 1.0, 0.92, and 0.97, respec- tively. It is expected that fixed search provides more accurate results while in- formation extracted from the query is less stable. The evaluation result proves that our implementation is useful. 4 Conclusion In this paper, we developed a knowledge graph-based housing analysis pipeline and tool by: data crawling, data cleaning, entity linking, ontology mapping, and question answering. The advantages of building KG include systematic data integration, reuse, and personalized recommendation given input queries. The use cases that we provide demonstrate that our system can be used for a wide range of housing market analysis tasks. We expect that our approach could easily be reused for novel uses cases, like analyzing the impact of regional education and climate on China’s housing prices, by merely adapting the data sources. References 1. Fabio Azzalini, Songle Jin, Marco Renzi, and Letizia Tanca. Blocking techniques for entity linkage: A semantics-based approach. In Data Science and Engineering. Springer, 2020. 2. Xiao Huang, Dingcheng Li, and Ping Li. Knowledge graph embedding based ques- tion answering. pages 105–113, 2019. 3. Ming Mao. Ontology mapping: An information retrieval and interactive activation network based approach. pages 931–935, 01 2007. 4. Wei Shen, Jianyong Wang, and Jiawei Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data En- gineering, 27(2):443–460, 2014. 5. Shengnan Zhang, Yan Hu, and Guangrong Bian. Research on string similarity algorithm based on levenshtein distance. In IEEE 2nd Advanced Information Tech- nology, Electronic and Automation Control Conference, pages 2247–2251, 2017. 12 https://github.com/ZepeiZhao/KG_APPLICATION