The original paper “Explainable automatic industrial carbon footprint estimation from bank transaction classification using natural language processing” was published by the Institute of Electrical and Electronics Engineers, December 2, 2022.
Jaime González González. IT Group, atlanTTic, TES University of Vigo, Spain.
Silvia García Méndez. IT Group, atlanTTic, TES University of Vigo, Spain.
Francisco De Arriba Pérez. IT Group, atlanTTic, TES University of Vigo, Spain.
Francisco J. González Castaño. IT Group, atlanTTic, TES University of Vigo, Spain.
Óscar Barba Seara. Coinscrap Finance S.L., Pontevedra, Spain
Concerns about the effect of greenhouse gases have motivated the development of certification protocols to quantify the industrial carbon footprint. These protocols are manual, work intensive, and expensive. All of the above have led to a shift towards automatic data-driven approaches to estimate the carbon footprint, including Machine Learning solutions.
Introduction to carbon footprint estimation
Concerns about climatic change, related to the increasing emission of greenhouse gases led 187 countries to sign the Paris Agreement in 2015. This accord expressed the need for policies and regulations on greenhouse gases emissions such as carbon dioxide (CO2). The so-called carbon footprint can be defined as the amount of greenhouse gases released to the atmosphere throughout the life cycle of a product or human activity.
The motivations for the calculation of carbon footprint are diverse, with compliance with environmental legislation and the certification of industrial sustainability (ISO 14064) being two of the most relevant reasons. Another relevant inducement is self-checking to avoid environmental taxes and attract funding from ecologically-minded investors.
Moreover, individuals, especially young people, have pressing concerns regarding the effects of climate change. Consequently, diverse tracking applications allow end users to estimate and reduce their carbon footprint.
Carbon footprint estimation solutions can be divided into manual and automatic approaches. In this paper, we propose an explainable automatic solution for industrial carbon footprint estimation based on a supervised bank transaction classification model. The training set was labeled as COICOP classes.
Departing from a categorization model combining Machine Learning with Natural Language Processing techniques, the main contribution of this study lies in the proposal of the automatic explainability of carbon footprint estimation decisions.
Methodology for carbon foortprint estimation
The features used as input data for the classification task were engineered from textual bank transaction data. For this purpose, the text was processed using the following Natural Language Processing techniques:
· Numbers’ removal.
· Terms’ reconstruction.
· Removal of symbols and diacritic marks.
· Stop words and code removal.
· Text lemmatization.
Once the processed bank transaction descriptions contain mostly semantically meaningful terms, the classification task is performed. When the transactions are classified, the proposed system automatically obtains their estimated carbon footprint from the formulae of the sectors to which they are predicted to belong and the bank transaction amount.
Experimental evaluation and discussion
The data-set is composed of 25,853 bank transactions issued by Spanish banks compiled by Coinscrap Finance S.L. Note that this data-set is comparable in size to that in our previous study on bank transaction classification. It was down-sampled using the FuzzyWuzzy Python library to keep only those entries sufficiently representative and distinguishable.
Those samples with descriptions with a similarity greater than 90% were discarded. The down-sampling process resulted in 2,619 transaction archetypes, with an average length of 10 words/73 characters. The transactions are divided into three main categories: car and transport, enterprise expenditures, and commodities, and several subcategories.
Conclusions of carbon footprint estimation
In this study, a novel explainable solution for automatic industrial carbon footprint estimation from bank transactions is proposed, addressing the lack of transparent decision explanation methodologies for this problem.
The explanation is especially important to trust the outcome of automatic processes, for them to replace more expensive alternatives, such as consultancy analytics. Indeed, even though automatic explainability has not been tackled in this domain, the study of the state of the art has also revealed that there are no previous works or existing commercial solutions for automatic industrial carbon footprint estimation based on bank transactions. The original data source includes more than 25,000 bank transactions. It was annotated for classification using COICOP categories.
In future work, the authors plan to extend this research to other main languages, enrich explanations with complementary enterprise information, and study the effect of hierarchical methodologies on categorization by leveraging the relations between target classes.
We also plan to move towards a semi-supervised approach by combining the current solution with a rule scheme, such as those proposed by other authors. Another possible line of research is the comparison of the model-agnostic approach to explainability with model specific methodologies.