Machine learning for text classification in building management systems


In building management systems (BMS), a medium building may have between 200 and 1000 sensor points. Their labels need to be translated into a naming standard so they can be automatically recognised by the BMS platform. The current industrial practices often manually translate these points into labels (this is known as the tagging process), which takes around 8 hours for every 100 points. We introduce an AI-based multi-stage text classification that translates BMS points into formatted BMS labels. After comparing five different techniques for text classification (logistic regression, random forests, XGBoost, multinomial Naive Bayes and linear support vector classification), we demonstrate that XGBoost is the top performer with 90.29% of true positives, and use the prediction confidence to filter out false positives. This approach can be applied in sensors networks in various applications, where manual free-text data pre-processing remains cumbersome.

Keyword : free-text classification, building management systems, Haystack data standard, sensor tagging

How to Cite
Mesa-Jiménez, J. J., Stokes, L., Yang, Q., & Livina, V. N. (2022). Machine learning for text classification in building management systems. Journal of Civil Engineering and Management, 28(5), 408–421.
Published in Issue
May 12, 2022
Abstract Views
PDF Downloads
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.


Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics, 2014, 425731.

Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random forests and decision trees. International Journal of Computer Science Issues, 9(5), 272–277.

Alsaleem, S. (2011). Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology, 2(2), 124–128.

Barandiaran, I. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.

Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). ACM.

Brown, P., Desouza, P., Mercer, R., Della Pietra, V., & Lai, J. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

Chai, K., Chieu, H., & Ng, H. T. (2002). Bayesian online classifiers for text classification and filtering. In SIGIR ‘02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 97–104). ACM.

Chatterjee, S., George Jose, P., & Datta, D. (2019). Text classification using SVM enhanced by multithreading and CUDA. International Journal of Modern Education & Computer Science, 11(1), 11–23.

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM.

Dalal, M., & Zaveri, M. (2011). Automatic text classification: a technical review. International Journal of Computer Applications, 28(2), 37–40.

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.

Elnagar, A., Al-Debsi, R., & Einea, O. (2020). Arabic text classification using deep learning models. Information Processing & Management, 57(1), 102121.

Gargiulo, F., Silvestri, S., Ciampi, M., & De Pietro, G. (2019). Deep neural network for hierarchical extreme multi-label text classification. Applied Soft Computing, 79, 125–138.

Genkin, A., Lewis, D., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304.

Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.

Goodman, J. (2001). A bit of progress in language modeling. Computer Speech & Language, 15(4), 403–434.

Gopi, A. P., Jyothi, R. N. S., Narayana, V. L, & Sandeep, K. S. (2020). Classification of tweets data based on polarity using improved RBF kernel of SVM. International Journal of Information Technology.

Hasanli, H., & Rustamov, S. (2019). Sentiment analysis of Azerbaijani twits using logistic regression, Naive Bayes and SVM. In 2019 IEEE 13th International Conference on Application of Information and Communication Technologies (AICT). IEEE.

Haystack Project. (2019).

Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 354–362). ACM.

Jaskie, K., Elkan, C., & Spanias, A. (2019). A modified logistic regression for positive and unlabeled learning. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers (pp. 2007–2011). IEEE.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (pp. 137–142). Springer.

Joachims, T. (2001). A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 128–136). ACM.

Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of user comment using Word2Vec and SVM classifier. International Journal of Advanced Trends in Computer Science and Engineering, 9(1), 643–648.

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267–2273). AAAI.

Le, C., Prasad, P., Alsadoon, A., Pham, L., & Elchouemi, A. (2019). Text classification: Naive Bayes classifier with sentiment lexicon. IAENG International Journal of Computer Science, 46(2), 141–148.

Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. In ICML ‘02: Proceedings of the Nineteenth International Conference on Machine Learning (pp. 387–394).

Liu, J., Chang, W., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115–124).

Liu, P., Zhao, H., Teng, J., Yang, Y., Liu, Y., & Zhu, Z. (2019). Parallel Naive Bayes algorithm for large-scale Chinese text classification based on spark. Journal of Central South University, 26, 1–12.

Maron, M. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM, 8(3), 404–417.

McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48).

Miaschi, A., & Della-Orletta, F. (2020). Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In Proceedings of the 5th Workshop on Representation Learning for NLP (pp. 110–119).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space.

Mikolov, T., Le, Q., & Sutskever, I. (2013b). Exploiting similarities among languages for machine translation.

Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746–751).

Montieri, A., Ciuonzo, D., Bovenzi, G., Persico, V., & Pescape, A. (2019). A dive into the dark web: Hierarchical traffic classification of anonymity tools. IEEE Transactions on Network Science and Engineering, 7(3), 1043–1054.

Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2), 330–348.

Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28–47.

Onan, A. (2019). Topic-enriched word embeddings for sarcasm identification. In Computer Science On-line Conference (pp. 293–304). Springer.

Onan, A. (2020). Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurrency and Computation: Practice and Experience, 33(23), e5909.

Onan, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572–589.

Onan, A., Korukolu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1–16.

Onan, A., & Korukolu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25–38.

Onan, A., & Tocoglu, M. (2020). Satire identification in Turkish news articles based on ensemble of classifiers. Turkish Journal of Electrical Engineering & Computer Sciences, 28(2), 1086–1106.

Onan, A., & Tocoglu, M. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701–7722.

Prabhat, A., & Khullar, V. (2017). Sentiment classification on big data using Naïve Bayes and logistic regression. In 2017 International Conference on Computer Communication and Informatics (ICCCI). IEEE.

Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

Ramadhan, W., Novianty, S., & Setianingsih, S. (2017). Sentiment analysis using multinomial logistic regression. In 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC) (pp. 46–49). IEEE.

Rane, A., & Kumar, A. (2018). Sentiment classification system of twitter data for us airline service analysis. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 769–773). IEEE.

Singh, R., Kumar, B., Gaur, L., & Tyagi, A. (2019). Comparison between multinomial and Bernoulli Naïve Bayes for text classification. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM) (pp. 593–596). IEEE.

Sun, A., Lim, E., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191–201.

Tocoglu, M., & Onan, A. (2020). Sentiment analysis on students evaluation of higher educational institutions. In International Conference on Intelligent and Fuzzy Systems (pp. 1693–1700). Springer.

Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66.

Vapnik, V., & Lerner, A. (1963). Recognition of patterns with help of generalized portraits. Avtomatika i Telemekhanika, 24(6), 774–780.

Venkatesh Ranjitha, K. V., & Venkatesh Prasad, B. S. (2020). Optimization scheme for text classification using machine learning Naive Bayes classifier. In A. Kumar, M. Paprzycki, & V. Gunjan (Eds.), Lecture notes in electrical engineering: Vol. 601. ICDSMLA 2019 (pp. 576–586). Springer.

Wang, X., Sheng, Y., Deng, H., & Zhao, Z. (2019). CHARCNN-SVM for Chinese text datasets sentiment classification with data augmentation. International Journal of Innovative Computing, Information and Control, 15(1), 227–246.

Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computing, 7(12), 2913–2920.

Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7370–7377.

Zhang, Y., Jin, R., & Zhou, Z. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52.

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015) (pp. 649–657).

Zhang, M., Ai, X., & Hu, Y. (2019). Chinese text classification system on regulatory information based on SVM. In IOP Conference Series: Earth and Environmental Science (Vol. 252), 022133. IOP Publishing.