A Natural Language Based Data Retrieval Engine for Automated Digital Data Extraction for Civil Infrastructure Projects Grant uri icon


  • This research project will create new knowledge and resources to significantly enhance the reusability of digital data during the lifecycle of civil infrastructure assets. The rapid development of digital technologies is transforming how civil infrastructure asset data and information is produced, exchanged, and managed throughout its life cycle. Despite growing digital data availability, such data cannot be fully exploited without the ability to infer meaning from the varying data terminologies entered by practitioners. The lack of common understanding of the same data, or similar data given in different terms, preclude data exchange or can lead to extraction of the wrong data and misinterpretation. This research project will leverage the advancements in linguistics and computer science to develop a novel approach that can recognize users' intention from their natural language input and automatically extract the desired data from heterogeneous datasets. The results of this research will benefit the construction industry by accelerating the industry's transition to digital data-based project delivery and asset management. The research will also broaden engineering education by creating advanced course materials both at undergraduate and graduate levels. Diversity in data terminology creates an important hurdle for computer-to-computer communication, creating a big burden to end users who must perform the role of middleware in digital data exchange. This issue exists throughout the life cycle of a civil infrastructure asset. This project will develop a computational theory and a platform for its implementation to analyze users' plain English data requirements, and automatically match their intention to the data entities in heterogeneous source datasets based on semantic equivalence. To accomplish this goal, the research team will: a) utilize Natural Language Processing and machine learning techniques to recognize user's intention from their natural language queries, b) translate text-based domain knowledge into an extensive civil engineering machine-readable dictionary that defines meanings of technical terms using a text-based automated ontology learning method, c) design an algorithm that finds the most semantic-relevant data entities in digital data sets for a given keyword input, and d) test the performance of the algorithm in terms of its accuracy using civil infrastructure text documents such as technical specifications, design manuals, and guidelines. The research outcomes will provide fundamental tools and resources for other researchers and industry professionals for various text-mining and intelligence-inference systems. It will facilitate seamless data exchange between various proprietary software applications used during the life cycle of civil infrastructure assets, including applications involving design evaluation and selection, digital model construction, and regulation compliance checking.

date/time interval

  • 2018 - 2020