Applied Machine Learning (COMP598), Fall 2014, McGill UniversityWe gave the following instructions to our students. Here's what they came up with.
There is a significant effort towards moving much of the data form the city of Montreal into an Open Data format. This data can be accessed here:
The goal of this project is to use this data to identify an interesting prediction question that can be tackled using machine learning methods, and solve the problem using appropriate machine learning algorithms and methodology. You are not restricted to using only this data (though you should use some of it). You can incorporate data from other sources, or collect additional data (e.g. new test set) if appropriate. The choice of prediction task and dataset to use is open. Try to pick a prediction question that is relevant and important to the citizens or administrators of the city. Remember to design a prediction task that is well suited to your choice of dataset; and vice versa, pick the right data for tackling your prediction question.
One of the most important stages most people go through at one point in their lives involves buying or selling a home. Real estate is the largest asset the average person will ever own and many times functions as not only a shelter but also an investment. The more information parties on either side of the transaction can obtain, the better their decision will be and the risk of losing large sums of money will decrease. While many models have been applied to this market over the years, not many have investigated the role municipal infrastructure has on pricing these homes. Through the use of data available through the city of Montreal and a snapshot of the real estate market on the island through select agencies, we look to investigate the impact leveraging municipal data can have on pricing and compare the performance of Linear Regression, Lasso Regression and K nearest neighbours within this setting. The best result obtained with one of the algorithms was a mean absolute error of $44,000. Results show that much of the city data played little in the output of the learners used and that poorer performance by the algorithms suggests the true non-linearity of the problem space. (GitHub)
In this paper, we present a machine learning approach to classifying photographs of nine Montreal neighborhoods acquired from Google Street View. We show results using a perceptron, a stacked denoising autoencoder (to pretrain a neural network), and a convolutional neural network. We put these results in perspective by comparing them with human performance on the same task; the mean human error rate is 71.6%. The convolutional neural network outperforms both human and other algorithmic classifiers, with an error rate of 42.61%. We conclude by discussing the implications of our results and future avenues of research. (GitHub)
Food safety inspections are an integral part of public health, but are costly to cities and municipalities. Targeting inspections by estimating the risk of severe food safety violations via publicly available data could potentially reduce the financial burden of food safety inspections and improve public health and safety. Several American cities are already exploring this approach, with Chicago leading the way. In this report, we apply logistic regression, random forests, and support vector machines to a carefully assembled dataset to explore the ability to predict the likelihood of a severe violation. We use data from Toronto, Ontario (Canada). We find that baseline methods with regularization perform similarly to those much less interpretable to a public health audience.
In this report, we analyze a dataset of black and white aerial images from the city of Montreal. We conduct a supervised patch-wise classification of these images into water, farmland, forest, and residential categories. Using a combination of auto-encoders and convolutional neural networks, we are able to achieve accuracy rates as high as 83.2%. We also show preliminary results for unsupervised patch-wise classification. The intended end-use is to describe the change in land use of the island of Montreal over time, however the process can easily be generalized to any area by using Google Maps data
Using public traffic camera data, we aim to quantify the amount of traffic in a given video using a combination of computer vision and machine learning techniques. We compare the results of casting the problem as a batch or online problem in order to better understand how to produce adequate results with the least amount of human intervention. While the best results were obtained with a convolutional neural network with both time derivatives and normal frames, we were able to produce reasonably good predictions through even baseline methods using raw frame data and PCA.
Recommender systems have been known to greatly improve users’ access to relevant products by making personalized suggestions solely based on their rating history. The purpose of our work is to describe a library book recommender system that uses information extraction techniques and machine-learning algorithms to make a binary “buy/do not buy” decision. Particularly, we study the effectiveness of various content-based filtering methods as well as effects of various feature representations and preprocessing techniques. Initial empirical results demonstrate that this approach can produce recommendations that are far better than random or even what the libraries’ current acquisition policies are able to obtain. We set the benchmark at the library’s median lending rate, such that any result above 50% would represent an improvement over the current performance. Using a k-NN classifier and a minimal number of features that best encode the book’s content, we obtained a score of 78%, i.e., 28% above our benchmark. (GitHub)
In this machine learning paper, we analyzed the real estate property prices in Montreal. The information on the real ´ estate listings was extracted from Centris.ca and duProprio.com. We predicted both asking and sold prices of real estate properties based on features such as geographical location, living area, and number of rooms, etc. Additional geographical features such as the nearest police station and fire station were extracted from the Montreal Open Data Portal. We used and compared ´ regression methods such as linear regression, Support Vector Regression (SVR), k-Nearest Neighbours (kNN), and Regression Tree/Random Forest Regression. We predicted the asking price with an error of 0.0985 using an ensemble of kNN and Random Forest algorithms. In addition, where applicable, the final price sold was also predicted with an error of 0.023 using the Random Forest Regression. We will present the details of the prediction questions, the analysis of the real estate listings, and the testing and validation results for the different algorithms in this paper. In addition, we will also discuss the significances of our approach and methodology. (GitHub)
In this paper we attempt to predict the number of bicycle rides on each one of ten different streets in Montreal in a given day. We apply and compare Linear Regression, k Nearest Neighbour, Decision Trees and Support Vector Regression. We found that using a Decision Tree Regressor with AdaBoost gives the best result, with a 530.4 mean absolute error on a hold out test set. We use a number of features such as day of the year, day of the week, weather, air pollution, holidays, festivals, hockey and football games. Our results show that the day of the week is the most important feature for predicting bike counts in Montreal.
Bike share networks are an emerging trend in major cities, but considerable difficulties lie in managing their operations. In particular, the movements of bikes throughout networks do not tend to balance with one another, and so must be balanced manually using trucks. Here we investigate the feasibility of using models to predict imbalances in bike share networks, using data from Montreal’s Bixi system. We approach the problem in two ways. First, we attempt to model the need for re-balancing directly, by detecting re-balancing events undertaken by operators of the Bixi network. Based on these observed re-balancing events, we find that certain stations act as sources and others as sinks. We demonstrate that it is possible to classify stations with about 70% accuracy, based on features of nearby traffic patterns. Next, we create a dynamical model of the usage patterns for Bixi stations while including historical weather data. The cyclical usage patterns do admit regression to cyclical time based models with good fit. We then investigate the use of traffic data to explain differences in the dynamical behavior of stations, but no strong dependencies could be found. Our results show that it is difficult to predict bike share network imbalances, but do suggest that for-purpose data collection might yield a generalizable predictive model of bike share network imbalance.
The rise of open data has created new opportunities for city planners and community leaders to improve the lives of citizens. Access to bike accident data can be used to predict the location of future accidents. We thus compare datasets from New York and Montreal using different tools and models to illustrate the impact that greater access to data can have on the lives of citizens. After exploring Hidden Markov Models, Random Forests and Neural Networks, we give a neural network structure which is able to accurately predict the incidence rate of bike accidents in New York.
Cycling is one of the most important and beautiful facet of Montreal culture. Nonetheless, regardless of all advantages that cycling brings to the island, many people end up injured or even killed as a result of bicycle accidents every year. In order to reduce the number of these accidents, we decided to use a Hidden Markov Model (HMM) in order to be able to predict future collisions. For this purpose, we have used a data set which has been collected by Robert Rocha in Montreal between 2006 and 2010 and which consists in a record of bicycle accidents during those year with their respective date, time, and location of incident. We fit a Gaussian HMM with various number of states and found that a six state gaussian HMM was able to model bicycle accidents the best.
In this project, we intend to propose a regression framework to predict STM bus intervals based on traffic data in the city intersections. Linear regression, k-nearest neighbor, support vector regression, neural net and Random Forest methods were used. The Random Forest showed the best result with RMSE of 5.68 minutes. However, none of the methods can perfectly predict the output. We hypothesize that this error can be highly reduced with richer feature set.
Cycling is an environmentally friendly mode of transportation. Encouraging inhabitants of the city of Montreal to use bicycles should be a strategy for the city. Bixi is a nonprofit public bicycle sharing system developed in Montreal, Canada . Each Bixi station has a pay station, bikes, and bike docks. Bike stations are mobile: they can be installed or removed in half an hour. They are controlled by a real-time management system. Bixi made available data sets of the bike counts in each station for the past couple of years . Although the company recently announced bankruptcy, new methods of optimization and prediction may help revive the idea. Predicting number of available bikes and number of available slots in a certain station can help the end user head to the right station and facilitate the search for a station with available bikes or empty docks. Our approach is to predict the bike count in any given station and from that we can deduce the number of empty slots as well, knowing the size of the station. The predicted number of counts is calculated using the history of bike counts in this specific station and nearby stations. Time series prediction has been approached by different models and algorithms. In this paper we explored and optimized Support Vector Regression, ARMA, Volterra and HMM to predict the bike counts. ARMA and Volterra were found to be the best approaches for this type of application.
BIXI is a popular public bicycle sharing system in Montreal, Canada accounting for more than one million trips annually . The goal of this project is to use machine learning methods to predict future vacancy status of BIXI stations in Montreal, using data that has been shared with the public by the city. Compilation of minute-by-minute readings of bicycle availability at all BIXI stations between April and August 2012 was obtained from the HackTaVille website  and used to train classifiers of varied complexity. The experimental results show that the methods implemented in this project provide a promising way of forecasting future vacancy status of BIXI stations.
In this report, authors have approached the “method of travel classification” task using different classifiers and preprocessing steps. The dataset is an augmented version of Concordia University’s private TRIP Dataset combined with gaseous pollutants data of Montreal OpenData and weather network data. The performances of logistic regression, feed forward neural network, Support Vector Machine (SVM), and random forest have been compared. Random forest shows the best accuracy 72.7%, followed by neural network (54%), logistic regression (53.5%), and SVM with radial basis function (RBF) kernel (52.7%), respectively.
Using bike usage, bike accident, traffic density, traffic light, and weather data we attempt to gain insights on bike usage and cyclist safety issues. We constructed models to predict 1) bike usage for specific weather conditions, and 2) bike accident likelihood for a given intersection. The results showed some correlation between features and output but the nature of the data available didn’t allow for conclusive decisions.
In this project, we tackled the issue of automatic re-colorization of old gray-scale pictures. Previous works got interesting results with learning methods such as SVM and efficient feature selection, but are limited by the fact that the user has to choose one training image for each target image. We try here to use deep-learning methods for this problem, using regression or classification, and to be able to color any image with the same training data. However our methods performed poorly, given our limited computation possibilities and size of the data we could use, thus our results are quite disappointing.
Object recognition has been tackled with some success. In this paper, we compare and analyze several methods of object recognition in natural scene with an emphasis of their application on the historical BAnQ Dataset, Montreal namely: SVM, KNN based, Gentleboost and HMAX(Hierarchical Model and X) based neural networks. The highest accuracy is achieved using a patch based gentleboost algorithm.
Since 2007, the Montreal Public Library has been tracking book loans, and making the resulting data available to the public. We explore how this data can be used to solve real world problems. The first application is to better distribute books across branches based on historical loan data. Next, we explore the closely related problem of predicting, for a new book, how many copies the library should stock. We use the OpenData Library Loans Dataset (OLLD), and explore augmenting it with sales data from Amazon. We find that second problem is more difficult than the first, and that using the sales data from Amazon helps improve predictive accuracy. We train several regression models on the data, and find that the best model has a cross-validated Mean Absolute Error (MSE) of a little over 1, allowing it to predict the number of loans to within 2 loans over 90% of the time. We note, however that these results are overly optimistic and that further work needs to be done before such a system can be practically used.
Housing prices experience significant variability across geography and time. This variability presents a challenge to prediction models and the lack of clarity in the contributing factors adds another layer of difficulty. Much study as been devoted to the analysis of housing prices across time, and of those that attempt to model prices spatially consider restricted feature sets. This work attempts to analyze the affect of the spatial distribution of municipal amenities, services, and structures on housing prices in the city of Montreal. Through an extensive regression analysis on recently accessible data through Montreal’s OpenData initiative, we attempt to identify the spatial features strongly predictive of housing prices. The results from our study will shed some light on areas Montreal’s administration can direct investment, to not only improve the housing market, but the welfare of its citizens as well.
Restaurants and Hotels is a fast faced business. There are new restaurants opening every day. Due to the overabundance of restaurant choices, restaurant selection has become really difficult. There a number of websites that offer platform for consumers to post specific remarks about a particular restaurant. But most of them fail to take into account the comments and suggestions given by the food inspector. In this project, we aim to address this problem by classifying the restaurants based on opinions from two different schools of thought, first is the consumer who takes into account the front end operations and second is the food inspector’s remarks who focus on the back end operations of the hotel business while reviewing them.
Load forecasting is crucial for energy generation and dispatching of smart city. This report focuses on the possible methods for load forecasting. Methods for short-term load forecasting, medium-term load forecasting and long-term load forecasting are proposed and analyzed. Feed forward neural network (FFNN), recurrent neural network (RNN), support vector regression (SVR), autoregressive moving average (ARMA), and hidden markov model (HMM) are adopted for load forecasting. Related data processing, feature selection are discussed in detail. Electric vehicle (EV) is another promising technology, which will see a fast development in the near future. However, high EV adoption would cause a heavy burden on the power system and may cause negative influences on the stability of the power system. A framework for determining the impacts of electric vehicles on power system is also proposed. The proposed load forecasting methods and impacts analysis framework is tested with Toronto historical load consumption data. However, with substitution of related data, these methods would also be useful for other cities like Montreal.