Statistical modeling of New York City building construction times with the NYC Department of Design and Construction
by Sara Venkatraman
[ Department of Statistics and Data Science, Cornell University ]
The Challenge
The New York City Department of Design and Construction (DDC) is a city government agency that manages the construction of public buildings, such as libraries and courthouses, and infrastructures, such as roadwork and electrical systems, throughout the city’s five boroughs. The density and diversity of urban landscapes within the United States’ most populous city present unique challenges to planning construction projects; in particular, being able to accurately estimate the amount of time required for construction is important for allocating resources and ensuring minimal disruption to the neighborhood surrounding a construction project.
The Discovery and Exploration Process
This summer, I worked with the DDC as a Siegel Family Endowment PiTech PhD Impact Fellow to develop statistical models for forecasting the duration of construction for public buildings and infrastructure projects in NYC. The DDC has a vast database of details about each construction project undertaken within the last 30 years, so my first objective was to use this data to identify which attributes of a construction project are most associated, statistically, with its duration. The data includes information about each project’s location, allocated budget, sponsor (i.e. the government agency funding the construction), type of work (e.g., new construction, major/minor interior or exterior renovations, plumbing, etc.) and attributes that may render the project more complex, e.g. whether hazardous materials or demolition is involved or whether there are adjacent buildings that complicate construction.
Extracting meaningful insights from a large dataset is a challenging task in any domain, and there is no recipe for the right analytical steps to take to identify interesting relationships amongst the potentially hundreds of variables the dataset may comprise. I believe domain expertise is a particularly important part of this process, so I began by talking with project managers and analysts at the DDC about which aspects of a project had previously been used to estimate construction times and the reasons for which construction can be delayed. Because the model I would eventually build to forecast construction times could have an impact on future project planning and scheduling, I wanted to ensure that the variables I used to construct it were both statistically justified as well as intuitively sensible to construction managers; balancing predictive accuracy with contextual interpretability is a central challenge in constructing statistical models in any domain.
My Analysis
My work began with a computational analysis of the variables about each construction project that were available to me. I studied the highly-skewed distributions of construction durations and construction budgets, observing that most public building construction projects in New York take between one and three years but can occasionally take as long as six to eight years. Major renovations were the most common type of construction project to take place, followed by roof replacements and HVAC upgrades. I then sought to understand how construction duration was related to other attributes of a building. Many of the correlations between these attributes and construction duration were surprisingly low, possibly suggesting the presence of nonlinear relationships amongst them, which was also corroborated by the limited accuracy of multivariable linear regression models I tried fitting to this data. I then performed clustering analyses on the set of construction projects, with the aim of identifying the variables by which “similar” projects were grouped. It appeared the borough in which construction took place, the budget bracket, and the NYC government agency to which the project belonged were the the three attributes along which projects were primarily clustered, which seemed intuitively reasonable.
After a variety of other analyses aimed at exploring the relationships amongst construction duration and project attributes, I then sought to find an appropriate statistical model to describe duration as a function of these attributes. I tried a variety of algorithms, including classification and regression trees, random forests, gradient boosting, support vector machines, and others, as well as experimented with modeling construction durations as a continuous value versus a binned (categorical value).