Věra Kůrková · Yannis Manolopoulos Barbara Hammer · Lazaros Iliadis Ilias Maglogiannis (Eds.) Artificial Neural Networks and Machine Learning – ICANN 2018 27th International Conference on Artificial Neural Networks Rhodes, Greece, October 4–7, 2018 Proceedings, Part III 123 LNCS 11141 Lecture Notes in Computer Science 11141 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at http://www.springer.com/series/7407 Věra Kůrková • Yannis Manolopoulos Barbara Hammer • Lazaros Iliadis Ilias Maglogiannis (Eds.) Artificial Neural Networks and Machine Learning – ICANN 2018 27th International Conference on Artificial Neural Networks Rhodes, Greece, October 4–7, 2018 Proceedings, Part III 123 Editors Věra Kůrková Lazaros Iliadis Czech Academy of Sciences Democritus University of Thrace Prague 8 Xanthi Czech Republic Greece Yannis Manolopoulos Ilias Maglogiannis Open University of Cyprus University of Piraeus Latsia Piraeus Cyprus Greece Barbara Hammer CITEC Bielefeld University Bielefeld Germany ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-01423-0 ISBN 978-3-030-01424-7 (eBook) https://doi.org/10.1007/978-3-030-01424-7 Library of Congress Control Number: 2018955577 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface Technological advances in artificial intelligence (AI) are leading the rapidly changing world of the twenty-first century. We have already passed from machine learning to deep learning with numerous applications. The contribution of AI so far to the improvement of our quality of life is profound. Major challenges but also risks and threats are here. Brain-inspired computing explores, simulates, and imitates the struc- ture and the function of the human brain, achieving high-performance modeling plus visualization capabilities. The International Conference on Artificial Neural Networks (ICANN) is the annual flagship conference of the European Neural Network Society (ENNS). It features the main tracks “Brain-Inspired Computing” and “Machine Learning Research,” with strong cross-disciplinary interactions and applications. All research fields dealing with neural networks are present. The 27th ICANN was held during October 4–7, 2018, at the Aldemar Amilia Mare five-star resort and conference center in Rhodes, Greece. The previous ICANN events were held in Helsinki, Finland (1991), Brighton, UK (1992), Amsterdam, The Netherlands (1993), Sorrento, Italy (1994), Paris, France (1995), Bochum, Germany (1996), Lausanne, Switzerland (1997), Skovde, Sweden (1998), Edinburgh, UK (1999), Como, Italy (2000), Vienna, Austria (2001), Madrid, Spain (2002), Istanbul, Turkey (2003), Budapest, Hungary (2004), Warsaw, Poland (2005), Athens, Greece (2006), Porto, Portugal (2007), Prague, Czech Republic (2008), Limassol, Cyprus (2009), Thessaloniki, Greece (2010), Espoo-Helsinki, Finland (2011), Lausanne, Switzerland (2012), Sofia, Bulgaria (2013), Hamburg, Germany (2014), Barcelona, Spain (2016), and Alghero, Italy (2017). Following a long-standing tradition, these Springer volumes belong to the Lecture Notes in Computer Science Springer series. They contain the papers that were accepted to be presented orally or as posters during the 27th ICANN conference. The 27th ICANN Program Committee was delighted by the overwhelming response to the call for papers. All papers went through a peer-review process by at least two and many times by three or four independent academic referees to resolve any conflicts. In total, 360 papers were submitted to the 27th ICANN. Of these, 139 (38.3%) were accepted as full papers for oral presentation of 20 minutes with a maximum length of 10 pages, whereas 28 of them were accepted as short contributions to be presented orally in 15 minutes and for inclusion in the proceedings with 8 pages. Also, 41 papers (11.4%) were accepted as full papers for poster presentation (up to 10 pages long), whereas 11 were accepted as short papers for poster presentation (maximum length of 8 pages). The accepted papers of the 27th ICANN conference are related to the following thematic topics: AI and Bioinformatics Bayesian and Echo State Networks Brain-Inspired Computing VI Preface Chaotic Complex Models Clustering, Mining, Exploratory Analysis Coding Architectures Complex Firing Patterns Convolutional Neural Networks Deep Learning (DL) – DL in Real Time Systems – DL and Big Data Analytics – DL and Big Data – DL and Forensics – DL and Cybersecurity – DL and Social Networks Evolving Systems – Optimization Extreme Learning Machines From Neurons to Neuromorphism From Sensation to Perception From Single Neurons to Networks Fuzzy Modeling Hierarchical ANN Inference and Recognition Information and Optimization Interacting with the Brain Machine Learning (ML) – ML for Bio-Medical Systems – ML and Video-Image Processing – ML and Forensics – ML and Cybersecurity – ML and Social Media – ML in Engineering Movement and Motion Detection Multilayer Perceptrons and Kernel Networks Natural Language Object and Face Recognition Recurrent Neural Networks and Reservoir Computing Reinforcement Learning Reservoir Computing Self-Organizing Maps Spiking Dynamics/Spiking ANN Support Vector Machines Swarm Intelligence and Decision-Making Text Mining Theoretical Neural Computation Time Series and Forecasting Training and Learning Preface VII The authors of submitted papers came from 34 different countries from all over the globe, namely: Belgium, Brazil, Bulgaria, Canada, China, Czech Republic, Cyprus, Egypt, Finland, France, Germany, Greece, India, Iran, Ireland, Israel, Italy, Japan, Luxembourg, The Netherlands, Norway, Oman, Pakistan, Poland, Portugal, Romania, Russia, Slovakia, Spain, Switzerland, Tunisia, Turkey, UK, USA. Four keynote speakers were invited, and they gave lectures on timely aspects of AI. We hope that these proceedings will help researchers worldwide to understand and to be aware of timely evolutions in AI and more specifically in artificial neural net- works. We believe that they will be of major interest for scientists over the globe and that they will stimulate further research. October 2018 Věra Kůrková Yannis Manolopoulos Barbara Hammer Lazaros Iliadis Ilias Maglogiannis Organization General Chairs Věra Kůrková Czech Academy of Sciences, Czech Republic Yannis Manolopoulos Open University of Cyprus, Cyprus Program Co-chairs Barbara Hammer Bielefeld University, Germany Lazaros Iliadis Democritus University of Thrace, Greece Ilias Maglogiannis University of Piraeus, Greece Steering Committee Vera Kurkova Czech Academy of Sciences, Czech Republic (President of ENNS) Cesare Alippi Università della Svizzera Italiana, Switzerland Guillem Antó i Coma Pompeu Fabra University, Barcelona, Spain Jeremie Cabessa Université Paris 2 Panthéon-Assas, France Wlodzislaw Duch Nicolaus Copernicus University, Poland Petia Koprinkova-Hristova Bulgarian Academy of Sciences, Bulgaria Jaakko Peltonen University of Tampere, Finland Yifat Prut The Hebrew University, Israel Bernardete Ribeiro University of Coimbra, Portugal Stefano Rovetta University of Genoa, Italy Igor Tetko German Research Center for Environmental Health, Munich, Germany Alessandro Villa University of Lausanne, Switzerland Paco Zamora-Martínez das-Nano, Spain Publication Chair Antonis Papaleonidas Democritus University of Thrace, Greece Communication Chair Paolo Masulli Technical University of Denmark, Denmark Program Committee Najem Abdennour Higher Institute of Computer Science and Multimedia (ISIMG), Gabes, Tunisia X Organization Tetiana Aksenova Atomic Energy Commission (CEA), Grenoble, France Zakhriya Alhassan Durham University, UK Tayfun Alpay University of Hamburg, Germany Ioannis Anagnostopoulos University of Thessaly, Greece Cesar Analide University of Minho, Portugal Annushree Bablani National Institute of Technology Goa, India Costin Badica University of Craiova, Romania Pablo Barros University of Hamburg, Germany Adam Barton University of Ostrava, Czech Republic Lluís Belanche Polytechnic University of Catalonia, Spain Bartlomiej Beliczynski Warsaw University of Technology, Poland Kostas Berberidis University of Patras, Greece Ege Beyazit University of Louisiana at Lafayette, USA Francisco Elanio Bezerra University Ninth of July, Sao Paolo, Brazil Varun Bhatt Indian Institute of Technology, Bombay, India Marcin Blachnik Silesian University of Technology, Poland Sander Bohte National Research Institute for Mathematics and Computer Science (CWI), The Netherlands Simone Bonechi University of Siena, Italy Farah Bouakrif University of Jijel, Algeria Meftah Boudjelal Mascara University, Algeria Andreas Bougiouklis National Technical University of Athens, Greece Martin Butz University of Tübingen, Germany Jeremie Cabessa Université Paris 2, France Paulo Vitor Campos Souza Federal Center for Technological Education of Minas Gerais, Brazil Angelo Cangelosi Plymouth University, UK Yanan Cao Chinese Academy of Sciences, China Francisco Carvalho Federal University of Pernambuco, Brazil Giovanna Castellano University of Bari, Italy Jheymesson Cavalcanti University of Pernambuco, Brazil Amit Chaulwar Technical University Ingolstadt, Germany Sylvain Chevallier University of Versailles St. Quentin, France Stephane Cholet University of Antilles, Guadeloupe Mark Collier Trinity College, Ireland Jorg Conradt Technical University of Munich, Germany Adriana Mihaela Coroiu Babes-Bolyai University, Romania Paulo Cortez University of Minho, Portugal David Coufal Czech Academy of Sciences, Czech Republic Juarez Da Silva University of Vale do Rio dos Sinos, Brazil Vilson Luiz Dalle Mole Federal University of Technology Parana, Brazil Debasmit Das Purdue University, USA Bodhisattva Dash International Institute of Information Technology, Bhubaneswar, India Eli David Bar-Ilan University, Israel Konstantinos Demertzis Democritus University of Thrace, Greece Organization XI Antreas Dionysiou University of Cyprus, Cyprus Sergey Dolenko Lomonosov Moscow State University, Russia Xiao Dong Chinese Academy of Sciences, China Shirin Dora University of Amsterdam, The Netherlands Jose Dorronsoro Autonomous University of Madrid, Spain Ziad Doughan Beirut Arab University, Lebanon Wlodzislaw Duch Nicolaus Copernicus University, Poland Gerrit Ecke University of Tübingen, Germany Alexander Efitorov Lomonosov Moscow State University, Russia Manfred Eppe University of Hamburg, Germany Deniz Erdogmus Northeastern University, USA Rodrigo Exterkoetter LTrace Geophysical Solutions, Florianopolis, Brazil Yingruo Fan The University of Hong Kong, SAR China Maurizio Fiasché Polytechnic University of Milan, Italy Lydia Fischer Honda Research Institute Europe, Germany Andreas Fischer University of Fribourg, Germany Qinbing Fu University of Lincoln, UK Ninnart Fuengfusin Kyushu Institute of Technology, Japan Madhukar Rao G. Indian Institute of Technology, Dhanbad, India Mauro Gaggero National Research Council, Genoa, Italy Claudio Gallicchio University of Pisa, Italy Shuai Gao University of Science and Technology of China, China Artur Garcez City University of London, UK Michael Garcia Ortiz Aldebaran Robotics, France Angelo Genovese University of Milan, Italy Christos Georgiadis University of Macedonia, Thessaloniki, Greece Alexander Gepperth HAW Fulda, Germany Peter Gergeľ Comenius University in Bratislava, Slovakia Daniel Gibert University of Lleida, Spain Eleonora Giunchiglia University of Genoa, Italy Jan Philip Goepfert Bielefeld University, Germany George Gravanis Democritus University of Thrace, Greece Ingrid Grenet University of Côte d’Azur, France Jiri Grim Czech Academy of Sciences, Czech Republic Xiaodong Gu Fudan University, China Alberto Guillén University of Granada, Spain Tatiana Valentine Guy Czech Academy of Sciences, Czech Republic Myrianthi KIOS Research and Innovation Centre of Excellence, Hadjicharalambous Cyprus Petr Hajek University of Pardubice, Czech Republic Xue Han China University of Geosciences, China Liping Han Nanjing University of Information Science and Technology, China Wang Haotian National University of Defense Technology, China Kazuyuki Hara Nihon University, Japan Ioannis Hatzilygeroudis University of Patras, Greece XII Organization Stefan Heinrich University of Hamburg, Germany Tim Heinz University of Siegen, Germany Catalina Hernandez District University of Bogota, Colombia Alex Hernández García University of Osnabrück, Germany Adrian Horzyk AGH University of Science and Technology in Krakow, Poland Wenjun Hou China Agricultural University, China Jian Hou Bohai University, China Haigen Hu Zhejiang University of Technology, China Amir Hussain University of Stirling, UK Nantia Iakovidou King’s College London, UK Yahaya Isah Shehu Coventry University, UK Sylvain Jaume Saint Peter’s University, Jersey City, USA Noman Javed Namal College Mianwali, Pakistan Maciej Jedynak University of Grenoble Alpes, France Qinglin Jia Peking University, China Na Jiang Beihang University, China Wenbin Jiang Huazhong University of Science and Technology, China Zongze Jin Chinese Academy of Sciences, China Jacek Kabziński Lodz University of Technology, Poland Antonios Kalampakas American University of the Middle East, Kuwait Jan Kalina Czech Academy of Sciences, Czech Republic Ryotaro Kamimura Tokai University, Japan Andreas Kanavos University of Patras, Greece Savvas Karatsiolis University of Cyprus, Cyprus Kostas Karatzas Aristotle University of Thessaloniki, Greece Ioannis Karydis Ionian University, Greece Petros Kefalas University of Sheffield, International Faculty City College, Thessaloniki, Greece Nadia Masood Khan University of Engineering and Technology Peshawar, Pakistan Gul Muhammad Khan University of Engineering and Technology, Peshawar, Pakistan Sophie Klecker University of Luxembourg, Luxembourg Taisuke Kobayashi Nara Institute of Science and Technology, Japan Mario Koeppen Kyushu Institute of Technology, Japan Mikko Kolehmainen University of Eastern Finland, Finland Stefanos Kollias University of Lincoln, UK Ekaterina Komendantskaya Heriot-Watt University, UK Petia Koprinkova-Hristova Bulgarian Academy of Sciences, Bulgaria Irena Koprinska University of Sydney, Australia Dimitrios Kosmopoulos University of Patras, Greece Costas Kotropoulos Aristotle University of Thessaloniki, Greece Athanasios Koutras TEI of Western Greece, Greece Konstantinos Koutroumbas National Observatory of Athens, Greece Organization XIII Giancarlo La Camera Stony Brook University, USA Jarkko Lagus University of Helsinki, Finland Luis Lamb Federal University of Rio Grande, Brazil Ángel Lareo Autonomous University of Madrid, Spain René Larisch Chemnitz University of Technology, Germany Nikos Laskaris Aristotle University of Thessaloniki, Greece Ivano Lauriola University of Padua, Italy David Lenz Justus Liebig University, Giessen, Germany Florin Leon Technical University of Iasi, Romania Guangli Li Chinese Academy of Sciences, China Yang Li Peking University, China Hongyu Li Zhongan Technology, Shanghai, China Diego Ettore Liberati National Research Council, Rome, Italy Aristidis Likas University of Ioannina, Greece Annika Lindh Dublin Institute of Technology, Ireland Junyu Liu Huiying Medical Technology, China Ji Liu Beihang University, China Doina Logofatu Frankfurt University of Applied Sciences, Germany Vilson Luiz Dalle Mole Federal University of Technology – Paraná (UTFPR), Campus Toledo, Spain Sven Magg University of Hamburg, Germany Ilias Maglogiannis University of Piraeus, Greece George Magoulas Birkbeck College, London, UK Christos Makris University of Patras, Greece Kleanthis Malialis University of Cyprus, Cyprus Kristína Malinovská Comenius University in Bratislava, Slovakia Konstantinos Margaritis University of Macedonia, Thessaloniki, Greece Thomas Martinetz University of Lübeck, Germany Gonzalo Martínez-Muñoz Autonomous University of Madrid, Spain Boudjelal Meftah University Mustapha Stambouli, Mascara, Algeria Stefano Melacci University of Siena, Italy Nikolaos Mitianoudis Democritus University of Thrace, Greece Hebatallah Mohamed Roma Tre University, Italy Francesco Carlo Morabito Mediterranean University of Reggio Calabria, Italy Giorgio Morales National Telecommunications Research and Training Institute (INICTEL), Peru Antonio Moran University of Leon, Spain Dimitrios Moschou Aristotle University of Thessaloniki, Greece Cristhian Motoche National Polytechnic School, Ecuador Phivos Mylonas Ionian University, Greece Anton Nemchenko UCLA, USA Roman Neruda Czech Academy of Sciences, Czech Republic Amy Nesky University of Michigan, USA Hoang Minh Nguyen Korea Advanced Institute of Science and Technology, South Korea Giannis Nikolentzos Ecole Polytechnique, Palaiseau, France XIV Organization Dimitri Nowicki National Academy of Sciences, Ukraine Stavros Ntalampiras University of Milan, Italy Luca Oneto University of Genoa, Italy Mihaela Oprea University Petroleum-Gas of Ploiesti, Romania Sebastian Otte University of Tubingen, Germany Jun Ou Beijing University of Technology, China Basil Papadopoulos Democritus University of Thrace, Greece Harris Papadopoulos Frederick University, Cyprus Antonios Papaleonidas Democritus University of Thrace, Greece Krzysztof Patan University of Zielona Góra, Poland Jaakko Peltonen University of Tampere, Finland Isidoros Perikos University of Patras, Greece Alfredo Petrosino University of Naples Parthenope, Italy Duc-Hong Pham Vietnam National University, Vietnam Elias Pimenidis University of the West of England, UK Vincenzo Piuri University of Milan, Italy Mirko Polato University of Padua, Italy Yifat Prut The Hebrew University, Israel Jielin Qiu Shanghai Jiao Tong University, China Chhavi Rana Maharshi Dayanand University, India Marina Resta University of Genoa, Italy Bernardete Ribeiro University of Coimbra, Portugal Riccardo Rizzo National Research Council, Rome, Italy Manuel Roveri Polytechnic University of Milan, Italy Stefano Rovetta University of Genoa, Italy Araceli Sanchis de Miguel Charles III University of Madrid, Spain Marcello Sanguineti University of Genoa, Italy Kyrill Schmid University of Munich, Germany Thomas Schmid University of Leipzig, Germany Friedhelm Schwenker Ulm University, Germany Neslihan Serap Sengor Istanbul Technical University, Turkey Will Serrano Imperial College London, UK Jivitesh Sharma University of Agder, Norway Rafet Sifa Fraunhofer IAIS, Germany Sotir Sotirov University Prof. Dr. Asen Zlatarov, Burgas, Bulgaria Andreas Stafylopatis National Technical University of Athens, Greece Antonino Staiano University of Naples Parthenope, Italy Ioannis Stephanakis Hellenic Telecommunications Organisation, Greece Michael Stiber University of Washington Bothell, USA Catalin Stoean University of Craiova, Romania Rudolf Szadkowski Czech Technical University, Czech Republic Mandar Tabib SINTEF, Norway Kazuhiko Takahashi Doshisha University, Japan Igor Tetko Helmholtz Center Munich, Germany Yancho Todorov Aalto University, Espoo, Finland Organization XV César Torres-Huitzil National Polytechnic Institute, Victoria, Tamaulipas, Mexico Athanasios Tsadiras Aristotle University of Thessaloniki, Greece Nicolas Tsapatsoulis Cyprus University of Technology, Cyprus George Tsekouras University of the Aegean, Greece Matus Tuna Comenius University in Bratislava, Slovakia Theodoros Tzouramanis University of the Aegean, Greece Juan Camilo Vasquez Tieck FZI, Karlsruhe, Germany Nikolaos Vassilas ATEI of Athens, Greece Petra Vidnerová Czech Academy of Sciences, Czech Republic Alessandro Villa University of Lausanne, Switzerland Panagiotis Vlamos Ionian University, Greece Thanos Voulodimos National Technical University of Athens, Greece Roseli Wedemann Rio de Janeiro State University, Brazil Stefan Wermter University of Hamburg, Germany Zhihao Ye Guangdong University of Technology, China Hujun Yin University of Manchester, UK Francisco Zamora-Martinez Veridas Digital Authentication Solutions, Spain Yongxiang Zhang Sun Yat-Sen University, China Liu Zhongji Chinese Academy of Sciences, China Rabiaa Zitouni Tunis El Manar University, Tunisia Sarah Zouinina Université Paris 13, France Keynote Talks Cognitive Phase Transitions in the Cerebral Cortex – John Taylor Memorial Lecture Robert Kozma University of Massachusetts Amherst Abstract. Everyday subjective experience of the stream of consciousness sug- gests continuous cognitive processing in time and smooth underlying brain dynamics. Brain monitoring techniques with markedly improved spatio- temporal resolution, however, show that relatively smooth periods in brain dynamics are frequently interrupted by sudden changes and intermittent dis- continuities, evidencing singularities. There are frequent transitions between periods of large-scale synchronization and intermittent desynchronization at alpha-theta rates. These observations support the hypothesis about the cinematic model of cognitive processing, according to which higher cognition can be viewed as multiple movies superimposed in time and space. The metastable spatial patterns of field potentials manifest the frames, and the rapid transitions provide the shutter from each pattern to the next. Recent experimental evidence indicates that the observed discontinuities are not merely important aspects of cognition; they are key attributes of intelligent behavior representing the cog- nitive “Aha” moment of sudden insight and deep understanding in humans and animals. The discontinuities can be characterized as phase transitions in graphs and networks. We introduce computational models to implement these insights in a new generation of devices with robust artificial intelligence, including oscillatory neuromorphic memories, and self-developing autonomous robots. On the Deep Learning Revolution in Computer Vision Nathan Netanyahu Bar-Ilan University, Israel Abstract. Computer Vision (CV) is an interdisciplinary field of Artificial Intelligence (AI), which is concerned with the embedding of human visual capabilities in a computerized system. The main thrust, essentially, of CV is to generate an “intelligent” high-level description of the world for a given scene, such that when interfaced with other thought processes can elicit, ultimately, appropriate action. In this talk we will review several central CV tasks and traditional approaches taken for handling these tasks for over 50 years. Noting the limited performance of standard methods applied, we briefly survey the evolution of artificial neural networks (ANN) during this extended period, and focus, specifically, on the ongoing revolutionary performance of deep learning (DL) techniques for the above CV tasks during the past few years. In particular, we provide also an overview of our DL activities, in the context of CV, at Bar-Ilan University. Finally, we discuss future research and development challenges in CV in light of further employment of prospective DL innovations. From Machine Learning to Machine Diagnostics Marios Polycarpou University of Cyprus Abstract. During the last few years, there have has been remarkable progress in utilizing machine learning methods in several applications that benefit from deriving useful patterns among large volumes of data. These advances have attracted significant attention from industry due to the prospective of reducing the cost of predicting future events and making intelligent decisions based on data from past experiences. In this context, a key area that can benefit greatly from the use of machine learning is the task of detecting and diagnosing abnormal behaviour in dynamical systems, especially in safety-critical, large-scale applications. The goal of this presentation is to provide insight into the problem of detecting, isolating and self-correcting abnormal or faulty behaviour in large-scale dynamical systems, to present some design method- ologies based on machine learning and to show some illustrative examples. The ultimate goal is to develop the foundation of the concept of machine diagnostics, which would empower smart software algorithms to continuously monitor the health of dynamical systems during the lifetime of their operation. Multimodal Deep Learning in Biomedical Image Analysis Sotirios Tsaftaris University of Edinburgh, UK Abstract. Nowadays images are typically accompanied by additional informa- tion. At the same time, for example, magnetic resonance imaging exams typi- cally contain more than one image modality: they show the same anatomy under different acquisition strategies revealing various pathophysiological information. The detection of disease, segmentation of anatomy and other classical analysis tasks, can benefit from a multimodal view to analysis that leverages shared information across the sources yet preserves unique information. It is without surprise that radiologists analyze data in this fashion, reviewing the exam as a whole. Yet, when aiming to automate analysis tasks, we still treat different image modalities in isolation and tend to ignore additional information. In this talk, I will present recent work in learning with deep neural networks, latent embeddings suitable for multimodal processing, and highlight opportunities and challenges in this area. Contents – Part III Recurrent ANN Policy Learning Using SPSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 R. Ramamurthy, C. Bauckhage, R. Sifa, and S. Wrobel Simple Recurrent Neural Networks for Support Vector Machine Training . . . 13 Rafet Sifa, Daniel Paurat, Daniel Trabold, and Christian Bauckhage RNN-SURV: A Deep Recurrent Model for Survival Analysis . . . . . . . . . . . . 23 Eleonora Giunchiglia, Anton Nemchenko, and Mihaela van der Schaar Do Capsule Networks Solve the Problem of Rotation Invariance for Traffic Sign Classification? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Jan Kronenberger and Anselm Haselhoff Balanced and Deterministic Weight-Sharing Helps Network Performance. . . . 41 Oscar Chang and Hod Lipson Neural Networks with Block Diagonal Inner Product Layers . . . . . . . . . . . . 51 Amy Nesky and Quentin F. Stout Training Neural Networks Using Predictor-Corrector Gradient Descent . . . . . 62 Amy Nesky and Quentin F. Stout Investigating the Role of Astrocyte Units in a Feedforward Neural Network . . . 73 Peter Gergel’ and Igor Farkaŝ Interactive Area Topics Extraction with Policy Gradient. . . . . . . . . . . . . . . . 84 Jingfei Han, Wenge Rong, Fang Zhang, Yutao Zhang, Jie Tang, and Zhang Xiong Implementing Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Mark Collier and Joeran Beel A RNN-Based Multi-factors Model for Repeat Consumption Prediction . . . . . 105 Zengwei Zheng, Yanzhen Zhou, Lin Sun, and Jianping Cai Practical Fractional-Order Neuron Dynamics for Reservoir Computing. . . . . . 116 Taisuke Kobayashi An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Giuseppe Marra, Andrea Zugarini, Stefano Melacci, and Marco Maggini XXIV Contents – Part III Towards End-to-End Raw Audio Music Synthesis. . . . . . . . . . . . . . . . . . . . 137 Manfred Eppe, Tayfun Alpay, and Stefan Wermter Real-Time Hand Prosthesis Biomimetic Movement Based on Electromyography Sensory Signals Treatment and Sensors Fusion. . . . . . . . . 147 João Olegário de Oliveira de Souza, José Vicente Canto dos Santos, Rodrigo Marques de Figueiredo, and Gustavo Pessin An Exploration of Dropout with RNNs for Natural Language Inference . . . . . 157 Amit Gajbhiye, Sardar Jaf, Noura Al Moubayed, A. Stephen McGough, and Steven Bradley Neural Model for the Visual Recognition of Animacy and Social Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Mohammad Hovaidi-Ardestani, Nitin Saini, Aleix M. Martinez, and Martin A. Giese Attention-Based RNN Model for Joint Extraction of Intent and Word Slot Based on a Tagging Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Dongjie Zhang, Zheng Fang, Yanan Cao, Yanbing Liu, Xiaojun Chen, and Jianlong Tan Using Regular Languages to Explore the Representational Capacity of Recurrent Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Abhijit Mahalunkar and John D. Kelleher Learning Trends on the Fly in Time Series Data Using Plastic CGP Evolved Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Gul Mummad Khan and Durr-e-Nayab Noise Masking Recurrent Neural Network for Respiratory Sound Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Kirill Kochetov, Evgeny Putin, Maksim Balashov, Andrey Filchenkov, and Anatoly Shalyto Lightweight Neural Programming: The GRPU . . . . . . . . . . . . . . . . . . . . . . 218 Felipe Carregosa, Aline Paes, and Gerson Zaverucha Towards More Biologically Plausible Error-Driven Learning for Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Kristína Malinovská, Ľudovít Malinovský, and Igor Farkaš Online Carry Mode Detection for Mobile Devices with Compact RNNs . . . . 232 Philipp Kuhlmann, Paul Sanzenbacher, and Sebastian Otte Contents – Part III XXV Deep Learning Deep CNN-ELM Hybrid Models for Fire Detection in Images . . . . . . . . . . . 245 Jivitesh Sharma, Ole-Christopher Granmo, and Morten Goodwin Siamese Survival Analysis with Competing Risks . . . . . . . . . . . . . . . . . . . . 260 Anton Nemchenko, Trent Kyono, and Mihaela Van Der Schaar A Survey on Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu Cloud Detection in High-Resolution Multispectral Satellite Imagery Using Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Giorgio Morales, Samuel G. Huamán, and Joel Telles Metric Embedding Autoencoders for Unsupervised Cross-Dataset Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Alexey Potapov, Sergey Rodionov, Hugo Latapie, and Enzo Fenoglio Classification of MRI Migraine Medical Data Using 3D Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Hwei Geok Ng, Matthias Kerzel, Jan Mehnert, Arne May, and Stefan Wermter Deep 3D Pose Dictionary: 3D Human Pose Estimation from Single RGB Image Using Deep Convolutional Neural Network . . . . . . . . . . . . . . . 310 Reda Elbasiony, Walid Gomaa, and Tetsuya Ogata FiLayer: A Novel Fine-Grained Layer-Wise Parallelism Strategy for Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Wenbin Jiang, Yangsong Zhang, Pai Liu, Geyan Ye, and Hai Jin DeepVol: Deep Fruit Volume Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Hongyu Li and Tianqi Han Graph Matching and Pseudo-Label Guided Deep Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Debasmit Das and C. S. George Lee fNIRS-Based Brain–Computer Interface Using Deep Neural Networks for Classifying the Mental State of Drivers. . . . . . . . . . . . . . . . . . . . . . . . . 353 Gauvain Huve, Kazuhiko Takahashi, and Masafumi Hashimoto Research on Fight the Landlords’ Single Card Guessing Based on Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Saisai Li, Shuqin Li, Meng Ding, and Kun Meng XXVI Contents – Part III Short-Term Precipitation Prediction with Skip-Connected PredNet. . . . . . . . . 373 Ryoma Sato, Hisashi Kashima, and Takehiro Yamamoto An End-to-End Deep Learning Architecture for Classification of Malware’s Binary Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Daniel Gibert, Carles Mateu, and Jordi Planes Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Stanislaw Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey Data Correction by a Generative Model with an Encoder and its Application to Structure Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Takaya Ueda, Masataka Seo, and Ikuko Nishikawa PMGAN: Paralleled Mix-Generator Generative Adversarial Networks with Balance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Xia Xiao and Sanguthevar Rajasekaran Modular Domain-to-Domain Translation Network . . . . . . . . . . . . . . . . . . . . 425 Savvas Karatsiolis, Christos N. Schizas, and Nicolai Petkov OrieNet: A Regression System for Latent Fingerprint Orientation Field Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Zhenshen Qu, Junyu Liu, Yang Liu, Qiuyu Guan, Chunyu Yang, and Yuxin Zhang Avoiding Degradation in Deep Feed-Forward Networks by Phasing Out Skip-Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 Ricardo Pio Monti, Sina Tootoonian, and Robin Cao A Deep Predictive Coding Network for Inferring Hierarchical Causes Underlying Sensory Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Shirin Dora, Cyriel Pennartz, and Sander Bohte Type-2 Diabetes Mellitus Diagnosis from Time Series Clinical Data Using Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Zakhriya Alhassan, A. Stephen McGough, Riyad Alshammari, Tahani Daghstani, David Budgen, and Noura Al Moubayed A Deep Learning Approach for Sentence Classification of Scientific Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 Sérgio Gonçalves, Paulo Cortez, and Sérgio Moro Weighted Multi-view Deep Neural Networks for Weather Forecasting . . . . . . 489 Zahra Karevan, Lynn Houthuys, and Johan A. K. Suykens Contents – Part III XXVII Combining Articulatory Features with End-to-End Learning in Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 Leyuan Qu, Cornelius Weber, Egor Lakomkin, Johannes Twiefel, and Stefan Wermter Estimation of Air Quality Index from Seasonal Trends Using Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Arjun Sharma, Anirban Mitra, Sumit Sharma, and Sudip Roy A Deep Learning Approach to Bacterial Colony Segmentation . . . . . . . . . . . 522 Paolo Andreini, Simone Bonechi, Monica Bianchini, Alessandro Mecocci, and Franco Scarselli Sparsity and Complexity of Networks Computing Highly-Varying Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 Věra Kůrková Deep Learning Based Vehicle Make-Model Classification . . . . . . . . . . . . . . 544 Burak Satar and Ahmet Emir Dirik Detection and Recognition of Badgers Using Deep Learning . . . . . . . . . . . . 554 Emmanuel Okafor, Gerard Berendsen, Lambert Schomaker, and Marco Wiering SPSA for Layer-Wise Training of Deep Networks. . . . . . . . . . . . . . . . . . . . 564 Benjamin Wulff, Jannis Schuecker, and Christian Bauckhage Dipolar Data Aggregation in the Context of Deep Learning . . . . . . . . . . . . . 574 Leon Bobrowski and Magdalena Topczewska Video Surveillance of Highway Traffic Events by Deep Learning Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Matteo Tiezzi, Stefano Melacci, Marco Maggini, and Angelo Frosini Augmenting Image Classifiers Using Data Augmentation Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Antreas Antoniou, Amos Storkey, and Harrison Edwards DeepEthnic: Multi-label Ethnic Classification from Face Images . . . . . . . . . . 604 Katia Huri, Eli (Omid) David, and Nathan S. Netanyahu Handwriting-Based Gender Classification Using End-to-End Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Evyatar Illouz, Eli (Omid) David, and Nathan S. Netanyahu A Deep Learning Approach for Sentiment Analysis in Spanish Tweets . . . . . 622 Gerson Vizcarra, Antoni Mauricio, and Leonidas Mauricio XXVIII Contents – Part III Location Dependency in Video Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 630 Niloofar Azizi, Hafez Farazi, and Sven Behnke Brain Neurocomputing Modeling State-Space Analysis of an Ising Model Reveals Contributions of Pairwise Interactions to Sparseness, Fluctuation, and Stimulus Coding of Monkey V1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 Jimmy Gaudreault and Hideaki Shimazaki Sparse Coding Predicts Optic Flow Specifities of Zebrafish Pretectal Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 Gerrit A. Ecke, Fabian A. Mikulasch, Sebastian A. Bruijns, Thede Witschel, Aristides B. Arrenberg, and Hanspeter A. Mallot Brain-Machine Interface for Mechanical Ventilation Using Respiratory-Related Evoked Potential . . . . . . . . . . . . . . . . . . . . . . . . 662 Sylvain Chevallier, Guillaume Bao, Mayssa Hammami, Fabienne Marlats, Louis Mayaud, Djillali Annane, Frédéric Lofaso, and Eric Azabou Effectively Interpreting Electroencephalogram Classification Using the Shapley Sampling Value to Prune a Feature Tree. . . . . . . . . . . . . . . . . . 672 Kazuki Tachikawa, Yuji Kawai, Jihoon Park, and Minoru Asada EEG-Based Person Identification Using Rhythmic Brain Activity During Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Athanasios Koutras and George K. Kostopoulos An STDP Rule for the Improvement and Stabilization of the Attractor Dynamics of the Basal Ganglia-Thalamocortical Network . . . . . . . . . . . . . . 693 Jérémie Cabessa and Alessandro E. P. Villa Neuronal Asymmetries and Fokker-Planck Dynamics . . . . . . . . . . . . . . . . . 703 Vitor Tocci F. de Luca, Roseli S. Wedemann, and Angel R. Plastino Robotics/Motion Detection Learning-While Controlling RBF-NN for Robot Dynamics Approximation in Neuro-Inspired Control of Switched Nonlinear Systems . . . . . . . . . . . . . . 717 Sophie Klecker, Bassem Hichri, and Peter Plapper A Feedback Neural Network for Small Target Motion Detection in Cluttered Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728 Hongxin Wang, Jigen Peng, and Shigang Yue Contents – Part III XXIX De-noise-GAN: De-noising Images to Improve RoboCup Soccer Ball Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738 Daniel Speck, Pablo Barros, and Stefan Wermter Integrative Collision Avoidance Within RNN-Driven Many-Joint Robot Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748 Sebastian Otte, Lea Hofmaier, and Martin V. Butz An Improved Block-Matching Algorithm Based on Chaotic Sine-Cosine Algorithm for Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759 Bodhisattva Dash and Suvendu Rup Terrain Classification with Crawling Robot Using Long Short-Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771 Rudolf J. Szadkowski, Jan Drchal, and Jan Faigl Mass-Spring Damper Array as a Mechanical Medium for Computation . . . . . 781 Yuki Yamanaka, Takaharu Yaguchi, Kohei Nakajima, and Helmut Hauser Kinematic Estimation with Neural Networks for Robotic Manipulators . . . . . 795 Michail Theofanidis, Saif Iftekar Sayed, Joe Cloud, James Brady, and Fillia Makedon Social Media Hierarchical Attention Networks for User Profile Inference in Social Media Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805 Zhezhou Kang, Xiaoxue Li, Yanan Cao, Yanmin Shang, Yanbing Liu, and Li Guo A Topological k-Anonymity Model Based on Collaborative Multi-view Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817 Sarah Zouinina, Nistor Grozavu, Younès Bennani, Abdelouahid Lyhyaoui, and Nicoleta Rogovschi A Credibility-Based Analysis of Information Diffusion in Social Networks. . . . 828 Sabina-Adriana Floria, Florin Leon, and Doina Logofătu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839 Recurrent ANN Policy Learning Using SPSA R. Ramamurthy1,2(B), C. Bauckhage1,2, R. Sifa1,2, and S. Wrobel1,2 1 Department of Computer Science, University of Bonn, Bonn, Germany ramamurt@iai.uni-bonn.de 2 Fraunhofer Center for Machine Learning, Sankt Augustin, Germany Abstract. We analyze the use of simultaneous perturbation stochastic approximation (SPSA), a stochastic optimization technique, for solving reinforcement learning problems. In particular, we consider settings of partial observability and leverage the short-term memory capabilities of echo state networks (ESNs) to learn parameterized control policies. Using SPSA, we propose three different variants to adapt the weight matrices of an ESN to the task at hand. Experimental results on classic control prob- lems with both discrete and continuous action spaces reveal that ESNs trained using SPSA approaches outperform conventional ESNs trained using temporal difference and policy gradient methods. Keywords: Echo state networks · Recurrent neural networks Reinforcement learning · Stochastic optimization 1 Introduction Creating systems that learn to solve complex tasks from interactions with their environment is one of the primary goals of artificial intelligence research. Recently, much progress has been made in this regard, mainly achieved through modern reinforcement learning (RL) techniques [1,21]. Examples of recent suc- cesses include systems which exceed human level performance in playing console- based Atari games [12] or can navigate 3D virtual environments [11], and AlphaGo Zero [17] became the first program to beat world class GO players by learning from self-play only. Function approximators such as deep neural net- works, when used with off-policy and bootstrapping methods such as Q-learning, which used to be unstable and were referred to as a “deadly-triad” [20], have now been proven to be a competent approach using techniques such as experience replay [8] which stabilize learning with the help of a large replay memory. Spurred by these successes, another line of recent research has considered alternative approaches to RL using black-box optimization methods which do not require back propagation of gradient computations. Corresponding contribu- tions include systems [10,14] that are trained using so called evolution strategies which achieve competitive performance in playing Atari games. Similar perfor- mance was obtained in [19] where genetic algorithms were found to scale better than evolution strategies. This revived interest in black-box methods for solving ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 3–12, 2018. https://doi.org/10.1007/978-3-030-01424-7_1 4 R. Ramamurthy et al. RL problems as these can be parallelized when using modern distributed archi- tectures. However, most real-world systems must deal with limited and noisy state information resulting in partial observability as encountered in partially observable Markov decision processes (POMDPs). To learn policies under such circumstances, systems need to have internal memory. Therefore, recurrent RL methods to cope with partial observability have recently been investigated but were found to be difficult to train [4]. In this paper, we focus on these kind of problems and consider RL in par- tially observable environments. Since echo state networks [5] are known for their simple architecture and short-term memorization capabilities, we choose them in order to train parameterized control policies. In particular, we propose to use simultaneous perturbation stochastic optimization (SPSA), a gradient approxi- mation technique, as a training algorithm, which at each iteration requires only two evaluations of objective function regardless of dimension of the parameter. Using SPSA, we devise three types of ESN training that differ in how the weight matrices are chosen in each iteration. Finally, we use such ESNs to learn policies and test them against baselines on classic control problems. Previous work on black-box methods for training echo state networks seeks to combine genetic algorithms to train internal weights of the reservoir and stochastic gradient descent to train the output weights [3,15]. Similar work was done in [6] where output weights and spectral radii of internal weight matrices were evolved. Alternatively, more recent work [16] concerning different learning strategy focused on using hebbian learning rules to adapt reservoir matrices. An interesting hybrid of using hebbian learning and temporal difference learning was later proposed in [7] to adapt actor-critic ESNs. In contrast to these previous approaches, we use SPSA to optimize the entire network weights which has several noteworthy properties: (i) it requires only two loss measurements at each iteration, (ii) it does not require back propagation of gradients, (iii) it does not require any maintenance of candidate solutions as in genetic algorithms, and (iv) it can handle stochastic returns and hence does not require averaging over multiple measurements to account for the noisy returns. 2 Simultaneous Perturbation Stochastic Approximation In this short section, we briefly recall the main ideas behind simultaneous pertur- bation stochastic approximation (SPSA) for derivative free optimization; readers familiar with this technique may safely skip ahead. Consider the general problem of maximizing a differentiable objective func- tion f(θ) : dR → R, that is, consider the problem of finding θ∗ = argmaxθf(θ). For many complex systems, the gradient ∂f/∂θ cannot be computed directly so that ∂f/∂θ = 0 can often not be solved. It is, however, typically possible to evaluate f(θ) at various values of θ which, in turn, allows, for computing stochastic approximations of the gradient. One method in this regard is SPSA due to Spall [18] which iteratively updates estimates of the optimal θ as θk+1 = θk + lk ĝk(θk) (1) Policy Learning Using SPSA 5 where ĝk(θk) is an estimator of the gradient at θk and lk is the learning rate in iteration k. To estimate the gradient, two perturbations are generated, namely (θk+ck δk) and (θk−ck δk) where δk is a perturbation vector and ck is a scaling parameter. Then, the possibly noisy objective function F (·) = f(·) + noise is measured at F (θk+ck δk) and F (θk−ck δk) and the gradient is estimated using a two-sided gradient approximation F (θ ( k + ck δk)− F (θk − ck δk) ĝk θk) = . (2)2 ck δk The convergence of the SPSA algorithm critically depends on the choice of its parameters lk, ck and δk. In particular, the lea∑rning rate lk must meet the Robbins-Monro conditions [13], namely lk > 0 and ∞ k=1 lk = ∞, and a common choice in practice therefore is lk = l(L+k)α where l, α, L > 0. Similarly, the scaling ∑ factor must satisfy ∞ ( ) c l 2 k k k=1 c < ∞ so that a good choice amounts to ck = ck kγ where c, γ > 0. And, essentially, each element of the perturbation vector δk is sampled from a uniform distribution over the set {−1,+1}. 3 Learning Policies Using Echo State Networks In this section, we first briefly review policy learning under partial observability as well as echo state networks and then introduce our approach towards policy learning using echo state networks trained via SPSA. 3.1 Partial Observability Consider an agent interacting with an environment. At any time t, the agent observes the state st of the environment and performs an action at by following a policy π(at|st) which is a mapping of state st to the probability of choosing action a at time t. In return, the environment responds with a reward rt and finds itself in a new state st+1. However, in environments that are only partially observable, the agent does not receive all relevant state information because of limited sensory inputs. In this case, the state st does not satisfy the Markov property because it does not summarize what has happened in the past so that an informed decision cannot be taken. For such non-Markovian states, it is necessary to make the policy dependent on a history of states ht = {st, st−1, . . . } rather than on the current state st only. Hence, the policy becomes π(at|ht). This, however, becomes impractical to compute whenever different tasks require arbitrary lengths of histories. In situations like these, an echo state net- work can be used to integrate the required history in its reservoir states. In this way, we are able to parameterize the policy with weights of an echo state net- work θ as π(at|st,θ) which takes the current state st as the input and returns probabilities of actions by compacting the history of input states in the reservoir memory. 6 R. Ramamurthy et al. 3.2 Echo State Networks We next briefly recall the notion of echo state networks. These belong to reservoir computing paradigm in which a large reservoir of recurrently interconnected neurons processes sequential input data. In our setup, given that the state of the environment s ∈ nR st is given as the input to the network, the hidden states and output of our policy network are given by ht ∈ nR h and π nt ∈ R a , respectively. The temporal evolution of such a network is governed by the following, non-linear dynamical system ( ) ht = (1− β)h − ht 1 + β fh W ht−1 +W sst (3) ( ) π at = fπ W ht (4) where β ∈ [0, 1] is called the leaking rate and W s, W h, and W a are the input, reservoir, and output weight matrices, respectively. The function fh(·) is under- stood to act component-wise on its argument and is typically a sigmoidal acti- vation function. For the output layer, however, fπ(·) is usually just a linear or softmax function depending on the application context. 3.3 Policy Learning Using Echo State Networks At any time, the goal of the agent is to maximize the expected cumulativ∑e reward or the return received over a period of time which is defined as RT = T [ t=]1 rt. Hence, the objective function that is to be maximized is f(θ) = Eπ Rθ T and finding an optimal policy amounts to finding θ∗ = argmaxθf(θ) where we now write θ to denote the set of weights of an echo state network used to approximate the policy π(at|st,θ). According to our discussion in Sect. 2, we can then iteratively learn an opti- mal θ [acco]rding to a stochastic gradient ascent rule that follows the gradient∇θEπ R . In particular, we can resort to SPSA in order to approximate thisθ T gradient as ∇ [ ] ≈ F (θ + )− F (θ − )θEπ R (5)θ T 2 where F (·) is the stochastic return from the environment by running an episode where, in each step, the agent follows the policy π(at|st,θ) approximated by the ESN and where  is the perturbation generated by SPSA. A summary of this learning method can be found in Algorithm1. 3.4 Deterministic and Stochastic Policies An agent’s policy can either be deterministic or stochastic. In a discrete action space, the agent may apply a deterministic, greedy, “winner-takes-all” strategy to select an action, i.e. at = argmaxaπ(a|st,θ). However, in order to encourage exploration, the agent can follow a stochastic softmax policy in which actions are sampled based on action probabilities according to the policy π(at|st,θ), i.e. at ∼ fπ where fπ is the softmax function. In a continuous action space, the agent’s actions are sampled from a Gaussian policy parameterized by mean and variance neurons, that is fπ is considered a Gaussian probability distribution. Policy Learning Using SPSA 7 Algorithm 1. Learn policies using SPSA Input: SPSA parameters l, c, L, α, γ and initial weight θ0 for k = 0 to k max do l lk = (L+ k)α c ck = kγ δk ∼ U(−1, 1) θ+ = θk + ck δk θ− = θk − ck δk Compute returns F (θ+) and F (θ−) by running an episode with weights θ+ and θ− respectively F (θ+)− F (θ−) ĝk(θk ) = 2 ck δk θk+1 = θk + lk ĝk(θk ) end 3.5 Three Variants of Echo State Network Training Typically, in echo state networks, only the output weight matrix W a is opti- mized. However some tasks require tuning of the input- and reservoir weights W s and W h in order to extract relevant information from observations or to construct missing state information. Therefore, we consider three variants of our SPSA algorithm using different choices of θ at each iteration 1. output spsa: at each iteration, we optimize only the output weight matrix, that is we let θ = W a 2. all spsa: at each iteration, all of the weight matrices are updated at once, that is we let θ = {W s,W h,W a} 3. alternating spsa: at each iteration, we update one of these matrices and alternate in the subsequent iteration. 4 Experiments and Results We evaluated the above SPSA variants on a benchmark of classic control prob- lems available from OpenAI Gym [2] and compared them against temporal dif- ference and policy gradient learning methods. 4.1 Acrobot and Mountain Car We considered two classic problems, namely Mountain Car and Acrobot, and considered discrete and continuous action selection. For both problems, we restrict state observations to include only positional information excluding veloc- ities so that the agent has to infer velocity information in order to retrieve the full state information. An illustration of these OpenAI Gym problems and their state-action spaces is given in Fig. 1. 8 R. Ramamurthy et al. State and Action space State: cosine and sine of two joint angles Acrobot Action: the action is either applying +1, 0 or -1 torque on the joint between two links (b) State: 1-dimensional position of a car Mountain Car Action: for the discrete version, the action is either push left, no push and push right; for the continuous version, the action is a scalar force (c) (a) Fig. 1. Test environments: (a) description of observation and action space for the acrobot and mountain car tasks; (b), (c) task illustration from OpenAI Gym. 4.2 Implementation Details We used the same architecture of echo state networks consisting of 40 reservoir neurons with tanh activation functions for our SPSA variants and their RL baselines. The number of input- and output neurons, and the output activation function are chosen depending on the task and the type of policy being learned. The weight matrices are initialized according to parameters such as sparsity, scaling and spectral radius which are carefully set as per the guidelines in [9]. The input and reservoir matrices are chosen from a uniform distribution over values [−0.5, 0.5]. However, the output scaling is chosen differently for each task. The initial spectral radius of the reservoir matrix and the leaking rate are chosen to be 1.0 and 0.3, respectively, for all tasks. The SPSA parameters such as learning rate, scaling factor, decay rates and similarly, parameters concerning reinforcement learning methods such as discount factor and learning rates are tuned for each experiment. Table 1 lists all hyper parameters and their values. 4.3 Results First, we tested our algorithms to train deterministic greedy policies for the discrete versions of the acrobot and mountain car tasks and found that SPSA variants are able to solve both these tasks. In a quantitative evaluation, we computed mean learning curves with 10 different random seeds and compared them to similar curves obtained using echo state networks trained with temporal difference methods such as Q-learning and SARSA learning using stochastic gradient descent. Figures 2(a) and (b) show the learning curves in terms of the evolution of episodic total reward in the learning process (the higher the better). As we can observe, all SPSA variants find better policies than Q-learning or SARSA learning. Policy Learning Using SPSA 9 Table 1. Hyperparameters and their values for different experiments. Category Parameter Deterministic Stochastic Acrobot (discrete) Mountain car Acrobot Mountain car (discrete) (discrete) (continuous) SPSA Learning rate (l) 1e−6 1e−3 5e−5 5e−3 Scaling factor (c) 1e−1 1e−1 1e−1 1e−1 L 10 100 10 100 α 0.102 0.602 0.102 0.602 γ 0.101 0.101 0.101 0.101 ESN Reservoir size 40 40 40 40 Input connectivity 0.7 0.3 0.3 0.7 Reservoir connectivity 0.7 0.3 0.7 0.7 Output scaling 0.1 0.1 1e−5 1e−2 Spectral radius 1.0 1.0 1.0 1.0 Leaking rate 0.3 0.3 0.3 0.3 RL Discount factor 0.99 1.0 0.99 0.99 Learning rate 1e−2 1e−2 1e−3 1e−3 Next, we tested our algorithms to learn stochastic policies for a discrete ver- sion of the acrobot- and a continuous version of the mountain car task. We found that the SPSA variants are able to solve these by finding a softmax policy and a Gaussian policy for acrobot and mountain car, respectively. In a quantitative evaluation, we again computed mean learning curves with 10 different random seeds and compared them to data obtained using actor-critic methods. In the actor-critic method, two echo state networks are used, one to learn the policy (policy network) and one to learn the state value function (value network), both act with limited state information as in our SPSA variants. Figures 2(c) and (d) show the learning curves and it is seen that SPSA variants perform better than actor-critic methods. Next, in order to visualize the learned Gaussian pol- icy for the mountain car task, we plotted action probabilities for selected input states. As we can see in Fig. 3(b), for the same input states, the resulting action probability distribution is a mixture of Gaussians, meaning that the actions are sampled from appropriate mixture components based on the hidden states of the network which constructs the missing velocity information. Our most important evaluation results are summarized in Fig. 3(a) which shows average episodic total rewards in the last 100 iterations with 10 different random seeds. Here we observe: (i) training only the output weight matrix using SPSA yields better performance than its RL counterparts in all the experiments which indicate that SPSA is a powerful alternative to common RL methods; (ii) updating all the weight matrices at once gives the best performance in all tasks; however, training in an alternating fashion also seems to be a promising approach which warrants for further investigation; (iii) for the acrobot tasks, it is evident that SPSA works better in learning a deterministic policy than a stochastic policy. The reason could be that it is not necessary to also intro- duce stochasticity into action space since the exploration happens already in 10 R. Ramamurthy et al. (a) deterministic (b) stochastic (c) deterministic (d) stochastic Fig. 2. Learning curves: (a), (c) evolution of episodic total reward in learning deter- ministic policies for discrete versions of acrobot and mountain car. (b), (d) evolution of episodic total reward in learning of a softmax and Gaussian policy for discrete acrobot and continuous mountain car problems tasks, respectively. It is evident that SPSA variants perform better than RL methods deterministic stochastic variants Acrobot Mountain Car Acrobot Mountain Car (discrete) (discrete) (discrete) (continuous) all spsa -105.56 -121.61 -121.83 85.34 alternating spsa -110.07 -124.70 -131.01 80.41 output spsa -109.16 -144.88 -141.25 80.24 output q -123.72 -150.69 - - output sarsa -132.28 -163.94 - - actor critic - - -193.95 72.64 (a) (b) Fig. 3. Performance summary: (a) evaluation results containing average episodic total reward in the last 100 iterations of policy learning on classic problems for different variants and their baselines (the higher the value, the better the performance) (b) visualization of a Gaussian policy learned for the mountain car task. Policy Learning Using SPSA 11 parameter space in terms of perturbations. This concurs with the work done in [14] whose authors also seek to learn a deterministic policy when using black- box methods. Nevertheless, our approach demonstrates the general feasibility of learning both deterministic- and stochastic policies. 5 Conclusion In this paper, we considered the use of SPSA in training echo state networks to solve action selection tasks under partial observability. We proposed three variants that seek to perform gradient updates without using back-propagation. Experiments on classic problems indicate that SPSA is a powerful alternative to reinforcement learning methods commonly used for policy learning. In future work, we intend to extend the ideas reported here using LSTM units to solve more complex RL problems that require long-term dependencies. We also plan to examine the alternating SPSA variant further to verify their applicability in training deep recurrent neural networks. References 1. Bertsekas, D.P.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) 2. Brockman, G., et al.: OpenAI Gym. arXiv:1606.01540 (2016) 3. Chatzidimitriou, K.C., Mitkas, P.A.: A NEAT way for evolving echo state networks. In: Proceedings of European Conference on Artificial Intelligence (2010) 4. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: Proceedings of International Con- ference on Machine Learning (2016) 5. Jäger, H.: The “echo state” approach to analysing and training recurrent neural networks. Technical report 148, GMD (2001) 6. Jiang, F., Berry, H., Schoenauer, M.: Supervised and evolutionary learning of echo state networks. In: Proceedings of International Conference on Parallel Problem Solving from Nature (2008) 7. Koprinkova-Hristova, P.: Three approaches to train echo state network actors of adaptive critic design. In: Proceeding of International Conference on Artificial Neu- ral Networks (2016) 8. Lin, L.J.: Reinforcement learning for robots using neural networks. Technical reports CMU-CS-93-103, Carnegie-Mellon University (1993) 9. Lukoševičius, M.: A practical guide to applying echo state networks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 659–686. Springer, Heidelberg (2012). https://doi.org/10.1007/978- 3-642-35289-8 36 10. Mania, H., Guy, A., Recht, B.: Simple random search provides a competitive app- roach to reinforcement learning. arXiv:1803.07055 (2018) 11. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Pro- ceedings of International Conference on Machine Learning (2016) 12. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015) 12 R. Ramamurthy et al. 13. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951) 14. Salimans, T., Ho, J., Chen, X., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 (2017) 15. Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent net- works by Evolino. Neural Comput. 19(3), 757–779 (2007) 16. Schrauwen, B., Wardermann, M., Verstraeten, D., Steil, J.J., Stroobandt, D.: Improving reservoirs using intrinsic plasticity. Neurocomputing 71(7–9), 1159–1171 (2008) 17. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354 (2017) 18. Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturba- tion gradient approximation. IEEE Trans. Autom. Control. 37(3), 332–341 (1992) 19. Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., Clune, J.: Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567 (2017) 20. Sutton, R.: Introduction to reinforcement learning with function approximation. In: Tutorial at the Conference on Neural Information Processing Systems (2015) 21. Sutton, R.S., Barto, A.G., et al.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) Simple Recurrent Neural Networks for Support Vector Machine Training Rafet Sifa1,2,3(B), Daniel Paurat1,2, Daniel Trabold1,2, and Christian Bauckhage1,2,3 1 Fraunhofer Center for Machine Learning, Sankt Augustin, Germany 2 Fraunhofer IAIS, Sankt Augustin, Germany {rafet.sifa,daniel.paurat,daniel.trabold, christian.bauckhage}@iais.fraunhofer.de 3 B-IT, University of Bonn, Bonn, Germany Abstract. We show how to implement a simple procedure for support vector machine training as a recurrent neural network. Invoking the fact that support vector machines can be trained using Frank-Wolfe opti- mization which in turn can be seen as a form of reservoir computing, we obtain a model that is of simpler structure and can be implemented more easily than those proposed in previous contributions. 1 Introduction Support vector machines can be seen as neural networks with a single hidden layer (see Fig. 1). Since this insight is not new but dates back to work by Cortes and Vapnik [6], it seems odd that the literature on neurocomputing approaches towards SVM training is rather scarce [1,7,11,13,16–18]. Moreover, while these contributions show that SVMs can be trained using recurrent neural networks, they are mainly concerned with continuous dynamical systems and, curiously, how to implement those in electronic circuits. In this paper, we propose to train support vector machines by means of much simpler, time-discrete recurrent neural networks. We base our arguments on recent work in [2] where it was shown that recurrent neural networks can implement the Frank-Wolfe algorithm [8] for constrained convex optimization. That is, we show how the Frank-Wolfe algorithm allows for SVM training and how this approach can be interpreted in terms of neural reservoir computation. For mathematical convenience, we focus on L2 support vector machines [12]; not because our approach would not work for classical SVMs, but because the equations for the dual problem of L2 SVM training are particularly easy to work with. We begin our presentation with a brief review of L2 support vector machines for binary classification; in particular, we point out differences between L2- and classical L1 SVMs and clarify to what extent SVMs can be understood as neu- ral networks. We then show how the Frank-Wolfe algorithm can train SVMs ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 13–22, 2018. https://doi.org/10.1007/978-3-030-01424-7_2 14 R. Sifa et al. and how this process can be implemented by means of recurrent neural net- works. We present and discuss didactic practical examples to illustrate this idea and conclude with a discussion of implications and suggestions for practical implementations. 2 L2 Support Vector Machines Next, we briefly review the idea of L2 support vector machines for binary clas- sification. The likely unfamiliar matrix-vector notation we introduce in passing is intended to simplify our subsequent discussion. Consider a set of labeled training data {(xi, y ni)}i=1 where the data x ∈ mi R have been sampled from two distinct classes and the labels yi ∈ {−1,+1} indicate class membership. Training a linear L2 support vector classifier ( ) y(x) = sign xᵀw − θ (1) is to determine suitable parameters w and θ. In its primal form, this problem consists in solving ∑n argmin wᵀw + θ2 − ρ+ C ξ2i w , θ,ξ ( ) i=1 (2) s.t. y wᵀi xi − θ ≥ ρ− ξi and we note that, contrary to classical L1 SVMs [6], the slack variables ξi enter the objective in squared form. While this may improve generalization [12,15] our interest in L2 SVMs mainly stems from the fact that their Lagrangian duals are easy to work with. Introducing a data matrix X = [x1, . . . ,xn], a label vector y = [y1, . . . , y ᵀn] , and three n× n matrices Y = diag(y) (3) Z = Y ᵀXᵀXY ⇔ Z ᵀij = yi xi xj yj , (4) M = Z + yyᵀ + 1C I (5) it is straightforward to see [3] that—when written as a minimization problem— the dual problem to the one in (2) consists in solving argmin αᵀM α α 1ᵀα = 1 (6) s.t. α ≥ 0 where α = [α1, . . . , α ᵀn] is a vector of Lagrange multipliers. Once (6) has been solved, those elements αs of α that exceed zero identify the support vectors in X and thus allow for computing both parameters of the support vector machine ∑ w = αs ys xs = XY α (7) αs>0 Simple Recurrent Neural Networks for SVM Training 15 y β β2 β1 3 βn φ1 φ2 φ3 . . . φn x1 x2 . . . xm Fig. 1. Support vector machines are spe(cific basis fun)ction networks. For an input vector x ∈ m ∑R , they compute y(x) = sign ni=1 βi φi(x) using basis functions φi(x) = k(x,xi)+1 where k(x,xi) is a linear or non-linear kernel function. In either case, the xi denote training data and the weight vector β = Y α results from training the machine on this data. ∑ θ = − αs ys = −1ᵀY α. (8) αs>0 Plugging these training results into (1) provides a classifier which, written in the matrix-vector notation introduced above, reads ( ) (( ) ) y(x) = sign xᵀXY α+ 1ᵀY α = sign xᵀX + 1ᵀ Y α . (9) Introducing the shorthand β = Y α, we can think of this classifier as a basis function network [4]. In other words, writing (9) as ( ) ∑n y(x) = sign βi φi(x) (10) i=1 we recognize it as an instance of the neural architecture in Fig. 1 where the basis functions in the hidden layer in our case are given by φi(x) = xᵀxi + 1. We further observe that, during training and application of this machine, i.e. in the expressions Z = Y ᵀXᵀXY and xᵀw = xᵀXY α, all (training) data vectors occur in form of inner products. This of course allows for invoking the kernel trick where inner products are replaced by kernel evaluations so that the approach becomes applicable to non-linear settings. In other words, given an appropriate Mercer kernel k : m × mR R → R, a non-linear L2 support vector classifier can be trained by letting Z = Y ᵀKY 16 R. Sifa et al. Algorithm 1. Frank-Wolfe algorithm for iteratively solving (6) guess an initial, feasible point α ∈ Δn−1, for instance, α = 10 0 1n for t = 0, . . . , tmax do determine s ᵀt = argmin s ∇f(αt) s∈Δn−1 = argmin sᵀMαt s∈Δn−1 update the learning rate η = 2t t+2 update the current estimate αt+1 = αt + ηt (st − αt) where Kij = k(xi,xj) and the trained classifier becomes ( ) ∑ ( ) y(x) = sign k(x,xs) ys α ᵀ ᵀs − θ = sign k (x)Y α+ 1 Y α (11) αs>0 (( ᵀ ᵀ) ) = sign k (x) + 1 Y α (12) where ki(x) = k(x,xi). Using β = Y α and φi(x) = k(x,xi) + 1, this classifier, too, can be expressed as in (10) and therefore is nothing but another instance of the neural network shown in Fig. 1. Finally, we note that a linear SVM is a kernel SVM where K = XᵀX and k(x) = Xᵀx. Henceforth, we will thus drop this distinction and only discuss the kernel case. 3 Frank-Wolfe Training of Support Vector Machines One of the favorable properties of L2 support vector machines is that the dual training problem in (6) is of comparatively simple nature. Just as in the case of L1 SVMs, the minimization objective f(α) = αᵀM α is a quadratic form. However, in contrast to L1 SVMs, the two constraints 1ᵀα = 1 and α ≥ 0 constitute a simplicial - rather than a box constraint. That is, the feasible set of the L2 SVM training problem in (6) is the standard simplex − { ∈ ∣ }Δn 1 = α nR ∣ 1ᵀα = 1 ∧ α ≥ 0 . (13) In other words, we are dealing with a quadratic minimization problem over an arguably simple compact convex set. The Frank-Wolfe algorithm shown in Algorithm1 is an efficient iterative solver for this kind of problems. Given an initial feasible guess α 1t=0 = n1 Simple Recurrent Neural Networks for SVM Training 17 for the solution, the basic idea in our setting is to determine the s ∈ Δn−1t that minimizes sᵀ∇f(αt) and to apply conditional gradient updates αt+1 = αt + ηt (st − αt) where the learning rate ηt ∈ [0, 1] decreases over time. This guarantees that updates will not leave the feasible set and the efficiency of the algorithm stems from the fact that it turns a quadratic optimization problem into a series of simple linear optimization problems. Moreover, one can show that after t iterations the current estimate αt is O(1/t) from the optimal solution [5] which provides a convenient criterion for choosing the number tmax of iterations to be performed. For further details on the Frank-Wolfe algorithm, its properties and applications, we refer to [10] and [14]. 4 Neural Training of Support Vector Machines For the gradient of the objective function in (6), we simply have ∇f(α) = 2Mα so that each iteration of the Frank-Wolfe algorithm has to compute st = argmin sᵀM αt (14) s∈Δn−1 where we dropped the factor 2 as it exerts no influence on the outcome of argmin. Clearly, the expression on the right of (14) is linear in s and needs to be minimized over a compact convex set. Since the minima of a linear function over a compact convex sets are necessarily attained at a vertex of that set, st on the left of (14) must coincide with a vertex of Δn−1. Hence, as the vertices of the standard simplex in nR correspond to the standard basis vectors e ∈ nj R , we can rewrite (14) as st = argmin e ᵀ jM αt (15) ej ≈ ( )σ M αt . (16) Here, the non-linear, vector-valued function σ(z) introduced in (16) denotes the softmin operator whose components are given by e−βzi σ(z)i = ∑ (17) j e −βzj and we note that lim σ(z) = ei = argmin e ᵀ →∞ j z. (18) β ej Given the relaxed optimization step in (16), we can rewrite the Frank-Wolfe updates for our problem as ( αt+1 = αt + ηt st − ) αt (19) = (1− ηt)αt + ηt st (20) ≈ (1− ( )ηt)αt + ηt σ M αt . (21) 18 R. Sifa et al. But this is then to say that—by choosing an appropriate parameter β for the softmin function—the following non-linear dynamical system (( ) ) αt+1 = (1− ηt)αt + ηt σ Y ᵀKY + yyᵀ + 1C I αt (22) βt = Y αt (23) mimics the Frank-Wolfe algorithm up to arbitrary precision and can therefore accomplish support vector machine training. The equivalence of the Frank-Wolfe algorithm for SVM training and the non- linear dynamical system in (22), (23) is the main result of this paper. From the point of view of neural network research, the system in (22), (23) is of interest because it is structurally equivalent to the governing equations of the simple recurrent architectures considered in the area of reservoir computing [9]. In other words, we can think of this system in terms of a reservoir of n neurons whose synaptic connections are encoded in the matrix Y ᵀKY +yyᵀ+ 1C I. The system evolves without inputs, its output weights are given by Y , and the learning rate ηt assumes the role of the leaking rate of the reservoir. At each time t, the next internal state αt+1 of the network is a convex combination of the current state and a nonlinear transformation of the synaptically weighted current state. Since ηt decays towards zero, states will stabilize and the output is guaranteed to approach a fixed point α∗ = limt→∞ αt. What is further worth noting about the system in (22), (23) is that the weight matrices Y ᵀKY +yyᵀ+ 1C I, and Y depend on the training data for the problem under consideration. From the point of view of a learning system, they could thus be interpreted as a form of short term memory. At the beginning of a learning episode, data is loaded into this memory and used to determine crucial properties (support vectors) of the problem at hand. At the end of a learning episode, only those data points and labels required for decision making, i.e. those xs and ys for which αs > 0, need to be persisted in a long term memory to be able to compute the decision function in (11). In order for this memorization to be efficient it would thus be desirable if α was sparse because then only a few basis functions φi and weights βi could solve the problem satisfactory. 5 Practical Examples In this section, we consider several examples to investigate the behavior of the system in (22), (23) for training support vector machines. Note that, in order for these examples to be intuitive and interpretable, they are deliberately simple. Figure 2 shows three training sets of 200 two-dimensional data points each. It also visualizes how a support vector machine with a Gaussian kernel solves the corresponding classification problem after having been trained using the Frank- Wolfe algorithm or, equivalently, the system in (22), (23) if the parameter of the softmin activation function of the reservoir neurons is set to β = ∞. Figures 3 and 4 illustrate intermediate steps in learning such decision func- tions. Here, we considered a polynomial kernel and a Gaussian kernel and also Simple Recurrent Neural Networks for SVM Training 19 replaced the sign function in the output of the classifier by tanh so as to see more clearly, how class regions and margins evolve over time. What is noticeable is that, in either case, the simple recurrent neural network model discussed in this paper is able to train robust classifiers in only moderately many, i.e 100, iterations. (a) xor (b) nested circles (c) two moons Fig. 2. Didactic data sets and support vector classifiers using Gaussian kernels. (a) t = 0 (b) t = 1 (c) t = 2 (d) t = 3 (e) t = 4 (f) t = 10 (g) t = 25 (h) t = 50 (i) t = 75 (j) t = 100 Fig. 3. Evolution of a support vector classifier over 100( iterationᵀ ) s of the system in (22), 5 (23) using a 5th-degree polynomial kernel k(x,xi) = x xi + 1 . (a) t = 0 (b) t = 1 (c) t = 2 (d) t = 3 (e) t = 4 (f) t = 10 (g) t = 25 (h) t = 50 (i) t = 75 (j) t = 100 Fig. 4. Evolution of a support vector classifie(r over 100 iterat)ions of the system in (22), (23) using a Gaussian kernel k(x,x ) = exp − 1i ‖x − x 2 12σ2 i‖ where σ = /2. 20 R. Sifa et al. A natural question to ask is then: how sensitive is neural SVM training to different choices of the parameter β of the reservoir activation function? To investigate this, we randomly created 1000 different training sets for the xor, nested circles, and two moons problems, used the system in (22), (23) to train SVMs with polynomial and Gaussian kernels, and plotted the average training error (measured in terms of 0–1 loss) over 100 training iterations. Somewhat surprisingly, we observe in Fig. 5 that the choice of β does not impact the capa- bilities of the corresponding networks to quickly reduce the training error. Again somewhat surprisingly, the figure also shows that networks with the theoretically optimal choice of β = ∞ need more time to converge to a good solution. 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (a) xor, polynomial kernel (b) xor, Gaussian kernel 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (c) nested circles, polynomial kernel (d) nested circles, Gaussian kernel 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (e) moons, polynomial kernel (f) moons, Gaussian kernel Fig. 5. Evolution of average training errors over 100 iterations of the system in (22), (23). Regardless of the choice of softmin activation parameter β, training errors decrease quickly. However, Fig. 6 indicates that the quick learning behavior for parameters β ∞ comes at a price. Here we plot the average number of support vectors (in percentage of all training data) identified in each iteration of the training process. What is noticeable is that running the recurrent network in (22), (23) using larger values of β yields sparser solutions and letting β = ∞ yields much sparser solutions and thus more efficient classifiers. All in all, these experiments illustrate that the simple dynamical system in (22), (23) or, equivalently, rather simple recurrent neural network models known % error % error % error % error % error % error Simple Recurrent Neural Networks for SVM Training 21 from reservoir computing can train SVMs. This appears to be independent of the choice of kernel function but care needs to be exercised when choosing the activation function of the neuron in the reservoir. 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (a) xor, polynomial kernel (b) xor, Gaussian kernel 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (c) nested circles, polynomial kernel (d) nested circles, Gaussian kernel 1.00 β = 10.0 β = 1000.0 1.00 β = 10.0 β = 1000.0 0.75 β = 100.0 β = inf 0.75 β = 100.0 β = inf 0.50 0.50 0.25 0.25 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 t t (e) moons, polynomial kernel (f) moons, Gaussian kernel Fig. 6. Evolution of the average number (percentage of training data) of support vec- tors identified in 100 iterations of the system in (22), (23). 6 Conclusion Building on work in [2], we considered Frank-Wolfe optimization for the task of training support vector machines and showed how to interpret this as a form of reservoir computing. In other words, we showed that a recurrent reservoir of neurons governed by simple dynamics can identify support vectors. Since support vector machines themselves are basis function networks, our results underline that both training and running an SVM are forms of neur- computing. Moreover, the mechanism discussed in this paper is interpretable in terms of short- and long-term memory processes. At the beginning of a learning episode, data is encoded in the weights of a neural architecture for training; upon convergence of a learning episode, crucial information is persisted in the basis functions and weights of a neural architecture for classification. With respect to practical application, we note that our experimental results were obtained from direct implementations of the matrix-vector equations and % SVs % SVs % SVs % SVs % SVs % SVs 22 R. Sifa et al. softmin activations discussed throughout the text. However, we mainly used this notation because it seamlessly reveals that to train an SVM is to run a dynamical system. For real world applications, training will be more efficient when using (15) rather than (16). Likewise, implementations of the resulting classifier should use the equation in the middle of (11) rather than Eq. (12). References 1. Anguita, D., Boni, A.: Improved neural network for SVM learning. IEEE Trans. Neural Netw. 13(2), 1243–1244 (2002) 2. Bauckhage, C.: A neural network implementation of Frank-Wolfe optimization. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds.) ICANN 2017. LNCS, vol. 10613, pp. 219–226. Springer, Cham (2017). https://doi.org/10.1007/978-3-319- 68600-4 26 3. Bauckhage, C.: The dual problem of L2 SVM training. Technical report, Research- Gate (2018) 4. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 5. Clarkson, K.: Coresets, sparse greedy approximation, and the Frank-Wolfe algo- rithm. ACM Trans. Algorithms 6(4), 63:1–63:30 (2010) 6. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20(3), 273–297 (1995) 7. Duch, W.: Support vector neural training. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 67–72. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907 11 8. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3(1–2), 95–110 (1956) 9. Jäger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004) 10. Jaggi, M.: Revisiting Frank-Wolfe: projection-free sparse convex optimization. J. Mach. Learn. Res. 28(1), 427–435 (2013) 11. Jändel, M.: Biologically relevant neural network architectures for support vector machines. Neural Netw. 49, 39–50 (2014) 12. Koshiba, Y., Abe, S.: Comparison of L1 and L2 support vector machines. In: Pro- ceedings IJCNN (2003) 13. Perfetti, R., Ricci, E.: Analogue neural network for support vector machine learn- ing. IEEE Trans. Neural Netw. 17(4), 1085–1091 (2006) 14. Sifa, R.: An overview of Frank-Wolfe optimization for stochasticity constrained interpretable matrix and tensor factorization. In: ICANN 2018 (2018) 15. Tang, Y.: Deep Learning using Linear Support Vector Machines. arXiv:1306.0239 [cs.LG] (2013) 16. Vincent, P., Bengio, Y.: A neural support vector network architecture with adaptive kernels. In: Proceedings IJCNN (2000) 17. Xia, Y.: A new neural network for solving linear and quadratic programming prob- lems. IEEE Trans. Neural Netw. 7(6), 1544–1547 (1996) 18. Yang, Y., He, Q., Hu, X.: A compact neural network for training support vector machines. Neurocomputing 86, 193–198 (2012) RNN-SURV: A Deep Recurrent Model for Survival Analysis Eleonora Giunchiglia1(B), Anton Nemchenko2, and Mihaela van der Schaar2,3,4 1 DIBRIS, Università di Genova, Genova, Italy eleonora.giunchiglia@icloud.com 2 Department of Electrical and Computer Engineering, UCLA, Los Angeles, USA 3 Department of Engineering Science, University of Oxford, Oxford, UK 4 Alan Turing Institute, London, UK Abstract. Current medical practice is driven by clinical guidelines which are designed for the “average” patient. Deep learning is enabling medicine to become personalized to the patient at hand. In this paper we present a new recurrent neural network model for personalized survival analysis called rnn-surv. Our model is able to exploit censored data to compute both the risk score and the survival function of each patient. At each time step, the network takes as input the features characterizing the patient and the identifier of the time step, creates an embedding, and outputs the value of the survival function in that time step. Finally, the values of the survival function are linearly combined to compute the unique risk score. Thanks to the model structure and the training designed to exploit two loss functions, our model gets better concordance index (C-index) than the state of the art approaches. 1 Introduction Healthcare is moving from a population-based model, in which the decision mak- ing process is targeted to the “average” patient, to an individual-based model, in which each diagnosis is based on the features characterizing the given patient. This process has been boosted by the recent developments in the Deep Learning field, which has been proven to not only get impressive results in its traditional areas, but also to perform very well in medical tasks. In particular, in the medical field, the study of the time-to-event, i.e., the expected duration of time until one or more events happen, such as death or recurrence of a disease, is of vital importance. Nevertheless, it is often made more complicated by the presence of censored data, i.e., data in which the information about the time-to-event is incomplete, as it happens, e.g., when a patient drops a clinical trial. Traditionally, these issues are tackled in a field called Survival Analysis, a branch of statistics in which special models have been proposed to predict the time-to-event exploiting censored data, while only a few deep learning approaches have such an ability (e.g., [13,28]). About the latter, it is interesting to note that most of the encountered deep learning approaches are ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 23–32, 2018. https://doi.org/10.1007/978-3-030-01424-7_3 24 E. Giunchiglia et al. based on feedforward neural networks and, at least so far, there does not seem to exist published results deploying recurrent neural networks despite the sequential nature of the problem. In this paper we present a new recurrent neural network model handling censored data and computing, for each patient, both a survival function and a unique risk score. The survival function is computed by considering a series of binary classifications problems each leading to the estimation of the survival probability in a given interval of time, while the risk score is obtained through the linear combination of the estimates. rnn-surv three main features are: 1. its ability to model the possible time-variant effects of the covariates, 2. its ability to model the fact that the survival probability estimate at time t is function of each survival probability estimate at t′ : t′ < t, and 3. its ability to compute a highly interpretable risk score. The first two are given by the recurrent structure, while the last is given by the linear combination of the estimates. rnn-surv is tested on three small publicly available datasets and on two large heart transplantation datasets. On these datasets rnn-surv performs sig- nificantly better than the state of the art models, always resulting in a higher C-index than the state of the art models (up to 28.4%). We further show that if we simplify the model we always get worse performances, hence showing the significance of rnn-surv different features. This paper is structured as follows. We start with the analysis of the related work (Sect. 2), followed by the background about Survival Analysis (Sect. 3). Then, we present of our model (Sect. 4), followed by the experimental analysis (Sect. 5), and finally the conclusions (Sect. 6). 2 Related Work The problem of survival analysis has attracted the attention of many machine learning scientists, giving birth to models such as random survival forest [11], dependent logistic regressors [26], multi-task learning model for survival anal- ysis [17], semi-proportional hazard model [27] and support vector regressor for censored data [21], all of which not based on neural networks. Considering the works that have been done in the field of Survival Analysis using Deep Learning techniques, these can be divided in three main subcate- gories, that stemmed from just as many seminal papers: (1) Faraggi and Simon [7] generalized Cox Proportional Hazards model (CPH) [5] allowing non-linear functions instead of the traditional linear combinations of covariates by modeling the relationship between the input covariates and the corresponding risk with a single hidden layer feedforward neural network. This work has been later resumed in [13] and [28]. Contrar- ily to rnn-surv, CPH and the models [13] and [28] assume time-invariant effects of the covariates. RNN-SURV: A Deep Recurrent Model for Survival Analysis 25 (2) Liestbl, Andersen and Andersen [18] subdivided time into K intervals, assumed the hazard to be constant in each interval and proposed a feed- forward neural network with a single hidden layer that for each patient outputs the conditional event probabilities pk = P (T ≥ tk|T ≥ tk−1) for k = 1, ...,K, T being the time-to-event of the given patient. This work was then expanded in [2], but even in this later work the value of the estimate pk−1 for a given patient is not exploited for the computation of the estimate pk for the same patient. On the contrary, rnn-surv, thanks to the presence of recurrent layers, is able to capture the intrinsic sequential nature of the problem. (3) Buckley and James [4] developed a linear regression model that deals with each censored data by computing its most likely value on the basis of the available data. This approach was then generalized using neural networks in various ways (e.g., [6]). Unlike rnn-surv, in [4] and in the following ones, estimated and known data are treated in the same way during the regression phase. 3 Background on Survival Analysis Consider a patient i, we are interested in estimating the duration Ti of the interval in between the event of interest for i and the time t0 at which we start to measure time for i. We allow for right censored data, namely, data for which we do not know when the event occurred, but only that it did not occur before a censoring time Ci. The observed time Yi is defined as Yi = min(Ti, Ci), and each datapoint corresponds to the pair (Yi, δi) where δi = 0 if the event is censored (in which case Yi = Ci) and δi = 1 otherwise. In Survival Analysis, the standard functions used to describe Ti are the sur- vival function and the hazard function [15]. 1. The survival function Si(t) is defined as: Si(t) = Pr(Ti > t) (1) with Si(t0) = 1. 2. The hazard function hi(t) is defined as: Pr(t ≤ T < t+ dt | T ≥ t) hi(t) = lim i i . (2) dt→0 dt Further, in order to offer a fast understanding of the conditions of the patient, a common practice of the field is to create a risk score ri for each patient i: the higher the score the higher the risk of the occurrence of the event of interest. 4 RNN-SURV In order to transform the survival analysis problem in a series of binary deci- sion problems, we assume that the maximal observed time is divided into K 26 E. Giunchiglia et al. Fig. 1. rnn-surv with N1 = 2 feedforward layers, followed by N2 = 2 recurrent layers. intervals (t0, t1], . . . , (tK−1, tK ] and that the characteristic function modeling Ti is constant within each interval (tk−1, tk] with k = 1, . . . ,K. Given a patient i, the purpose of our model is to output both an estimate (k)ŷi of the survival probability Si for the kth time interval and a risk score ri. 4.1 The Structure of the Model The overall structure of rnn-surv is represented in Fig. 1 and is described and motivated below: 1. the input of each layer is given by the features xi of each patient i together with the time interval identifier k. Thanks to this input, rnn-surv is able to capture the time-variant effect of each feature over time, 2. taking the idea from the natural language processing field, the input is then elaborated by N1 embedding layers. Thanks to the embeddings we are able to create a more meaningful representation of our data, and 3. the output of the embedding layers is then passed through N2 recurrent layers and a sigmoid non-linearity. This generates the estimates (1) (K)ŷi , . . . , ŷi from which we can compute the risk score with the following equation: ∑K (k) r̂i = wkŷi (3) k=1 where wk for k = 1, . . . ,K are the parameters of the last layer of rnn-surv. Thanks to the linear combination, the risk score, whose quality is evaluated with the C-index [9], is highly interpretable. RNN-SURV: A Deep Recurrent Model for Survival Analysis 27 Further, in order to handle the vanishing gradient problem, the feedforward layers use the ReLU non-linearity [19], while the recurrent layers are constituted of LSTM cells [10], which are defined as: ⎛ ⎞ ⎛ ⎞ it σ(Wi[wt,ht−1] + bi) ⎜⎜ f ⎟⎟ ⎜⎜σ(W [w ,h − ] + b ⎟t = f t t 1 f )⎝ ⎟ot⎠ ⎝σ(Wo[wt,ht−1] + bo)⎠ gt f(Wg[w (4)t,ht−1] + bg) ct = ft ∗ ct−1 + it ∗ gt ht = ot ∗ f(ct). 4.2 Training Since the neural network predicts both the discrete survival function and the risk score for each datapoint, it is trained to jointly minimize two different loss functions: 1. The first one is a modified cross-entropy function able to take into account the censored data, defined as: ∑K ∑ L − [1 = I(Yi > tk) log (k)ŷi + (1− (k)I(Yi > tk)) log(1− ŷi )] (5) k=1 i∈Uk where Uk = {i | δi = 1 or Ci > tk} represents the set of individuals that are uncensored throughout the entire observation time or for which censoring has not yet happened at the end of the kth time interval. 2. The second one is an upper bound of the negative C-index [23] defined as: ∑ [ ( )] L2 = − 1 log σ(r̂j − r̂i)|C| 1 + (6)log 2 (i,j)∈C where C is the set of pairs {(i, j) | δi = 1 and (Yi ≤ Yj)}. The advantage of minimizing (6) instead of the negative C-index is that the former still leads to good results [23], and the latter is far more expensive to compute and would have made the experimental evaluation impractical. The two losses L1 and L2 are then linearly combined, with the hyperparameters of the sum optimized during the validation phase. In order to avoid overfitting, we apply dropout to both the feedforward lay- ers [22] and to the recurrent layers [8], together with a holdout-based early stop- ping as described in [20]. Further, we add L2-regularization to the linear com- bination of the losses. The entire neural network is trained using mini-batching and Adam optimizer [14]. 5 Experimental Analysis All our experiments are conducted on two large datasets, UNOS Transplant and UNOS Waitlist, from the United Network for Organ Sharing (UNOS)1 and on 1 https://www.unos.org/data/. 28 E. Giunchiglia et al. three publicly available, small datasets, AIDS2, FLCHAIN, NWTCO.2 In each experiment we deploy 60/20/20 division into training, validation and test sets and the early stopping is configured as a no validation gain for 25 consecutive epochs. The main characteristics of these datasets are shown in Table 1, while the structure of rnn-surv for each dataset is shown in Table 2. The performances of our model are measured using the C-index [9].3 Table 1. Datasets description Dataset Num. features Num. patients (%) Censored Missing data UNOS Transplant 53 60400 51.3 Yes UNOS Waitlist 27 36329 48.9 Yes NWTCO 9 4028 85.8 No FLCHAIN 26 7874 72.5 Yes AIDS2 12 2843 38.1 No Table 2. Structure of the model for each experiment. UNOS Transplant UNOS Waitlist NWTCO FLCHAIN AIDS2 # FF layers 2 2 3 3 2 # recurrent layers 2 2 2 2 2 # neurons I FF layer 53 33 18 45 22 # neurons II FF layer 51 35 18 40 25 # neurons III FF layer - - 18 35 - LSTM state size 55 26 17 32 15 5.1 Preprocessing Our datasets present missing data and thus they require a preprocessing phase. UNOS Transplant and UNOS Waitlist contain data about patients that reg- istered in order to undergo heart transplantation during the years from 1985 to 2015. In particular UNOS Transplant contains data about patients who have already undergone the surgery, while UNOS Waitlist contains data about patients who are still waitlisted. From the complete datasets, we discard 12 fea- tures that can be obtained only after transplantation and all the features for which more than 10% of the patients have missing information. In order to deal with the missing data on the remaining 53 and 27 features, we conduct 10 multi- ple imputations using Multiple Imputation by Chained Equations (MICE) [24]. The three small datasets contain data about: 1. NWTCO: contains data from the National Wilm’s Tumor Study [3], 2. FLCHAIN: contains half of the data collected during a study [16] about the possible relationship between serum FLC and mortality, and 3. AIDS2: contains data on patients diagnosed with AIDS in Australia [25]. 2 https://vincentarelbundock.github.io/Rdatasets/datasets.html/. 3 Implementation by lifelines package. RNN-SURV: A Deep Recurrent Model for Survival Analysis 29 Table 3. Performances, in terms of C-index, of rnn-surv, CPH, AAH, deep-surv, rfs and mtlsa together with the 95% confidence interval for the mean C-index. The * indicates a p-value < 0.05 while ** < 0.01. UNOS Transp. UNOS Waitlist NWTCO FLCHAIN AIDS2 CPH 0.566** 0.642** 0.706 0.883* 0.558 (0.565–0.567) (0.637–0.647) (0.687–0.725) (0.879–0.887) (0.546–0.570) AAH 0.561** 0.636** 0.710 0.885 0.557 (0.557–0.565) (0.632–0.640) (0.601–0.719) (0.879–0.891) (0.542–0.572) deep-surv 0.566** 0.645* 0.706 0.835 0.558 (0.560–0.572) (0.638–0.652) (0.686–0.726) (0.774–0.896) (0.532–0.584) rfs 0.563** 0.646* 0.663* 0.828 0.501** (0.561–0.565) (0.642–0.650) (0.648–0.678) (0.765–0.891) (0.489–0.513) mtlsa 0.484** 0.529** 0.595* 0.696** 0.520* (0.480–0.488) (0.525–0.533) (0.567–0.623) (0.688–0.704) (0.500–0.540) rnn-surv 0.587 0.656 0.724 0.894 0.573 (0.583–0.591) (0.652–0.660) (0.697–0.751) (0.886–0.902) (0.553–0.593) For these datasets, we complete the missing data using the mean value for the continuous features and using the most recurrent value for the categorical ones. Once complete the missing data, we then use one-hot encoding for the categorical features and we standardize each feature so that each has mean μ = 0 and variance σ = 1. 5.2 Comparison with Other Models We have compared rnn-surv with the two traditional Survival Analysis models, CPH and Aalen Additive Hazards model (AAH) [1], and with three recent models that try to conjugate Machine Learning with Survival Analysis: rfs [11], deep- surv [13] and mtlsa [17]. Both CPH and AAH have been implemented using the lifelines package4, while we deployed the randomForestSRC package5 for rfs, the deepsurv package6 for deep-surv and the mtlsa package7 for mtlsa. The results shown in Table 3 are obtained using k-fold cross validation (with k = 5). As it can be seen from the table, rnn-surv outperforms the other models in all the datasets. In particular, the biggest improvements are obtained with respect to mtlsa, with a peak of 28.4% on the FLCHAIN dataset. 5.3 Estimating the Survival Curves To further demonstrate the good results obtained by rnn-surv, in Fig. 2 we show some of the survival curves obtained in largest dataset available, the UNOS Transplant dataset. Figure 2 shows that our model is able to capture the average trend of the survival curves, both for the whole population and for subsets of it. Further, 4 https://github.com/CamDavidsonPilon/lifelines/. 5 https://cran.r-project.org/web/packages/randomForestSRC/. 6 https://github.com/jaredleekatzman/DeepSurv/. 7 https://github.com/yanlirock/MTLSA/. 30 E. Giunchiglia et al. Fig. 2. Performances of rnn-surv on UNOS Transplant dataset on a 36 months horizon on the test set. (a) average Survival Function obtained with rnn-surv and Kaplan- Meier curve [12]. (b) average Survival Functions obtained with rnn-surv and Kaplan- Meier curves for two subgroups of patients: patients who experienced an infection and patients who did not. (c) Kaplan-Meier curve together with the survival curves of two different patients (P1: Patient 1, P2: Patient 2). rnn-surv demonstrates to have a great discriminative power: it is able to plot a unique survival function for each patient and, as it is shown in Fig. 2(c), the survival curves can be very different one from another and from the average survival curve. 5.4 Analysis of the Model We now analyze how the different main components of rnn-surv contribute to its good performances. In particular, we consider the model without the three main features of the model: 1. We first consider the case in which we do not have the feedforward layers, i.e., with N1 = 0; 2. Then the case in which the interval identifier k as input to the feedforward layer is always set to 1; 3. Finally the case in which the model has only one likelihood, i.e., L2. The C-index of the various versions and of the complete model on the different datasets are shown in Table 4. In the Table the best results are in bold, while the worst results are underlined. As it can be seen, the best performances are always obtained by the complete model, meaning that all the different components have a positive contribution. Interestingly, the worst performances are obtained when we disable the L1 score on the large datasets and the feedforward layers in the small ones. The explanation for the very positive contribution of using both the L1 and L2 scores on the two large datasets is that L1 allows to take into account the intermediate performances of the network when computing (1) (K)ŷi , . . . , ŷi . On the other hand, for the small datasets, the positive contribution of using the two scores is superseded by the feedforward layers and this can be explained by the characteristics of the datasets presenting a majority of discrete features. RNN-SURV: A Deep Recurrent Model for Survival Analysis 31 Table 4. Performances, in terms of C-index, of the complete model compared with its incomplete versions. Dataset Without k input Without L1 Without FF rnn-surv UNOS Transplant 0.583 0.501 0.562 0.587 UNOS Waitlist 0.653 0.516 0.623 0.656 NWTCO 0.683 0.665 0.578 0.724 FLCHAIN 0.874 0.874 0.865 0.894 AIDS2 0.558 0.542 0.535 0.573 6 Conclusions In this paper we have presented rnn-surv: a new recurrent neural network model for predicting a personalized risk score and survival probability function for each patient in presence of censored data. The proposed model has three main distinguishing features, each having a positive impact on the performances on two large and three small, publicly available datasets. Our experiments show that rnn-surv always performs much better than competing approaches when considering the C-index, improving the state of the art up to 28.4%. References 1. Aalen, O.: A Model for nonparametric regression analysis of counting processes. In: Klonecki, W., Kozek, A., Rosiński, J. (eds.) Mathematical Statistics and Prob- ability Theory. LNS, vol. 2, pp. 1–25. Springer, New York (1980). https://doi.org/ 10.1007/978-1-4615-7397-5 1 2. Biganzoli, E., Boracchi, P., Mariani, L., Marubini, E.: Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. Stat. Med. 17, 1169–1186 (1998) 3. Breslow, N.E., Chatterjee, N.: Design and analysis of two-phase studies with binary outcome applied to Wilm’s tumour prognosis. Appl. Stat. 48, 457–468 (1999) 4. Buckley, J., James, I.: Linear regression with censored data. Biometrika 66(3), 429–436 (1979) 5. Cox, D.R.: Partial likelihood. Biometrika 62(2), 269 (1975) 6. Dezfouli, H.N.: Improving gastric cancer outcome prediction using time-point arti- ficial neural networks models. Cancer Inform. 16, 117693511668606 (2017) 7. Faraggi, D., Simon, R.: A neural network model for survival data. Stat. Med. 14(1), 73–82 (1995) 8. Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recur- rent neural networks. In: 29th NIPS, pp. 1019–1027 (2016) 9. Harrell, F.J., Califf, R., Pryor, D., Lee, K., Rosati, R.: Evaluating the yield of medical tests. JAMA 247(18), 2543–2546 (1982) 10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 11. Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2, 841–860 (2008) 32 E. Giunchiglia et al. 12. Kaplan, E.L., Meier, P.: Non parametric estimation from incomplete observations. J. Am. Stat. Assoc. 53, 457–481 (1958) 13. Katzman, J., Shaham, U., Cloninger, A., Bates, J., Jiang, T., Kluger, Y.: Deep survival: a deep cox proportional hazards network. CoRR (2016) 14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980 15. Klein, J.P., Moeschberger, M.L.: Survival Analysis Techniques for Censored and Truncated Data, 2nd edn. Springer, New York (2003) 16. Kyle, R., et al.: Use of monclonal serum immunoglobulin free light chains to predict overall survival in the general population. N. Engl. J. Med. 354, 1362–1369 (2006) 17. Li, Y., Wang, J., Ye, J., Reddy, C.K.: A multi-task learning formulation for survival analysis. In: 22nd ACM SIGKDD, KDD 2016, pp. 1715–1724. ACM, New York (2016) 18. Liestbl, K., Andersen, P.K., Andersen, U.: Survival analysis and neural nets. Stat. Med. 13(12), 1189–1200 (1994) 19. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: 27th ICML, pp. 807–814 (2010) 20. Prechelt, L.: Early stopping—but when? In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 53–67. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 5 21. Shivaswamy, P.K., Chu, W., Jansche, M.: A support vector approach to censored targets. In: Proceedings of 7th IEEE ICDM, ICDM 2007, pp. 655–660. IEEE Com- puter Society (2007). https://doi.org/10.1109/ICDM.2007.93 22. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 23. Steck, H., Krishnapuram, B., Dehing-oberije, C., Lambin, P., Raykar, V.C.: On ranking in survival analysis: bounds on the concordance index. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) 20th NIPS, pp. 1209–1216. Curran Asso- ciates Inc., New York (2008) 24. Van Buuren, S., Oudshoorn, K.: Flexible mutlivariate imputation by MICE. TNO, Leiden (1999) 25. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S. Springer, Heidel- berg (2002). https://doi.org/10.1007/978-0-387-21706-2 26. Yu, C.N., Greiner, R., Lin, H.C., Baracos, V.: Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) 24th NIPS, pp. 1845–1853. Curran Associates Inc., New York (2011) 27. Zhang, J., Chen, L., Vanasse, A., Courteau, J., Wang, S.: Survival prediction by an integrated learning criterion on intermittently varying healthcare data. In: 30th AAAI, AAAI 2016, pp. 72–78. AAAI Press (2016) 28. Zhu, X., Yao, J., Huang, J.: Deep convolutional neural network for survival analysis with pathological images. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China, pp. 544–547 (2016) Do Capsule Networks Solve the Problem of Rotation Invariance for Traffic Sign Classification? Jan Kronenberger(&) and Anselm Haselhoff(&) Computer Science Institute, Hochschule Ruhr West, Mülheim, Germany {jan.kronenberger,anselm.haselhoff}@hs-ruhrwest.de Abstract. Detecting and classifying traffic signs is a very import step to future autonomous driving. In contrast to earlier approaches with handcrafted features, modern neural networks learn the representation of the classes themselves. Current convolutional neural networks achieve very high accuracy when clas- sifying images, but they have one big problem with their robustness to shift and rotation. In this work an evaluation of a new technique with Capsule Networks is performed and the results are compared to a standard Convolutional Neural Network and a Spatial Transformer Network. Moreover various methods for augmenting the training data are evaluated. This comparison shows the big advantages of the Capsule Networks but also their restrictions. They give a big boost in solving problems mentioned above but their computational complexity is much higher than convolutional neural networks. Keywords: Neural networks  Capsule network  Rotation invariance 1 Introduction As long as there are human drivers left on the road (semi-) autonomous cars have to read the signs which are designed for humans. Cameras imitate the human vision system the most. Therefore the artificial intelligence has to analyze the images captured by these sensors. In early computer vision systems the features (and maybe relations) of the to be detected objects were handcrafted. These systems had problems detecting a very high range of varieties of single classes. Neural Networks can learn abstract representations of the given data themselves. Due to the fact that these networks are statistically approximating the training data, they are not good at recognizing new data which is shifted or rotated. There are basically two options to avoid this. One option is to generate more training data based on the given but with a bigger variance in rotation and shift. The other one is to use pooling layers in the network. Pooling reduces the resolution of the features to ignore these small shifts or rotations in the representation. It also gives a big advance in performance, because smaller images are processed later in the network. But while resizing the data some information is lost. CNNs also don’t consider spatial relationships between the subfeatures of the objects. © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 33–40, 2018. https://doi.org/10.1007/978-3-030-01424-7_4 34 J. Kronenberger and A. Haselhoff The newest approach from Sabour et al. [12] tries to address this problem with capsules. These capsules contain multiple convolutional layers and try to represent abstract features. Inside one capsule this feature is represented in different rotations and scales giving the capsule the ability to output the presence and multiple characteristics of this feature. This is inspired by the human vision and recognition system where not only the presence of some features are analyzed but also their relation. This paper compares three techniques and their performance in scenarios with rotated images. 2 Related Work Most work done with image classification does not consider a special technique to model the rotation or shift of their object classes. Either the training set already contains enough variants or more data is augmented like in Sect. 3.2. Generating augmented data is the most common way of achieving some rotation invariance. Shijie et al. [14] demonstrated increasing performance when adding augmented data to the trainingset. But they also remarked that adding to much augmentation may have negative effects. Fukushima [4] follows a proposal of Hubel and Wiesel [6] to implement complex and simple cells in the neural network. The simple cells are an early type of pooling. This tries to adapt the human vision system. Another approach is presented by Simard et al. [15]. They created so called tangent vectors which compactly represent the transfor- mation invariance. Hadsell et al. [5] introduce a new technique to learn an invariant mapping using prior knowledge. Their results also mention a big disadvantage for neural networks which can also learn lightning and other variabilities instead of focusing on the objects. The Spatial Transformer Networks which were proposed by Jaderberg et al. [7] can spatial manipulate the data within the network. These networks learn invariance in multiple transformations. Marcos et al. [10] extract rotation invariant features from canonical filters. By doing so the rotation invariance is directly encoded in the model. Cohen et al. [3] created so called G-convolutions where multiple translations and rotations by 90 degree are composed. Zouh et al. [17] introduced Oriented Response Networks which use Active Rotation Filters that actively rotate during convolution and produce feature maps with location and orientation explicitly encoded. The Capsule Network as being proposed by Sabour et al. [12] uses a different structure as traditional neurons. Instead of using weights and biases like in Eq. 1 where the data is represented by scalars X aj ¼ wi ixi þ b; ð1Þ they use no bias in their Eq. 2 and the data is represented as vectors. X sj ¼ ci ijbujji; bujji ¼ Wjiui; ð2Þ where bu describes the affine transformation and cji are the coupling coefficients. Do Capsule Networks Solve the Problem of Rotation Invariance 35 This design of the capsule allows more capabilities in representing its features. Capsule Networks are also translation equivariance, which means they consider the spatial position of key features. 3 Dataset In this comparison the database GTSRB [16] is used because it has different problems between its classes which are explained later. The images were cut from a sequence recorded from a vehicle, leading to many similar data. 3.1 Preparation The images in the dataset appear in different scales and aspect ratios. To feed the images through the network they are resized to 48  48 px. At this point no nor- malization was performed on the images. The networks use batch-normalization instead. The original dataset has a widely spread number of samples per class. The distribution is visualized in Fig. 1. 2,000 1,000 0 5 10 15 20 25 30 35 40 class Fig. 1. Amount of images in each class of the original GTSRB dataset. This gives a high a priori probability of the class existence to the network. The network might learn that some classes are more important than others, which can be a big problem if the distribution does not correlate with the real world distribution of the data or the input data is not very good distinguishable between two classes. Multiple datasets based on the original trainingset were created. • Original (Or): Only images from the original dataset. • A priori (Ap): Original images were copied to achieve the same amount of images in each class. • Augmented (Au): Each class contains the same number of images, but there are new images generated based on the original ones. Lawrence et al. [9] described the problem of having an uneven balance of the individual classes. Buda et al. [1] presented different methods of addressing this issue. images 36 J. Kronenberger and A. Haselhoff 3.2 Augmentation To create the augmented images their Rotation, Brightness, Color and Contrast were randomly changed. It is important to not change these parameters too much to prevent getting completely black or white images and not to loose the effect of augmentation as mentioned by Shijie et al. [14]. To not mix the classes Keep left and Keep right the images were not rotated more than 40 in each direction. 4 Approach Given the three different datasets the three networks were trained using TensorFlow. The convolutional network uses three convolution layer, where the first one is used for normalizing the color. It uses the ReLU activation function. Except for the fully connected layer only small filter were selected to fit the number of parameters to the capsule network. This network contains a total number of 148.635 trainable parame- ters. The network uses several regularization techniques like dropout and the batches are normalized after each convolution. The learning rate was set to decay exponential starting from 0.001. The Capsule Network was adapted from a project by Neveu [11]. The code was modified to work with the three different datasets. Also slight changes to the hyperparameters were made resulting in a total number of 207.947 trainable parameters. The Spatial Transformer Networks architecture was inspired by the CNN. The biggest difference is the spatial transform layer where the network learns the translation, scale, rotation and clutter of an input image. The amount of parameters is close to the CNN. Currently the training process of capsule networks is very slow. Running on the same machine this network runs about 5 times longer than the CNN. 5 Results In this section several parameters are evaluated. In Sect. 5.1 the overall performance on the different training- and testsets is compared. In Sect. 5.2 the rotation robustness of some classes is appraised. The figures refer (if not mentioned) to the original training- and testdata. 5.1 Accuracy With the convolutional network the results mentioned in Table 1 can be accomplished when running for 30 epochs. These are not state-of-the-Art results for the testset, but they show how much the accuracy drops if the augmented testset is evaluated with the network. When being trained on the augmented trainingset the network becomes much more robust against the augmented testset. Stallkamp et al. [16] measured the human performance on the dataset with 98.97%. Sermanet et al. [13] got a performance of 98.31% using their Multi-Scale CNN. Therefore they selected features from not only the last layer for Do Capsule Networks Solve the Problem of Rotation Invariance 37 Table 1. Results for all three networks using the validationset, testset and the augmented testset. CNN CapsNet STN Or Ap Au Or Ap Au Or Ap Au Val 0.9252 0.9766 0.9646 0.9944 0.9984 0.9980 0.9907 0.9995 0.9959 Test 0.8144 0.8253 0.8695 0.9393 0.9370 0.9441 0.9113 0.9336 0.9441 Augm test 0.4920 0.4809 0.6358 0.6882 0.6979 0.7721 0.6583 0.6666 0.7846 classification. Ciresat et al. [2] achieved 99.46% with their multi-column network. Jin et al. [8] could reach a performance of 99.65%. But all these networks have a much higher number on trainable parameters. The capsule network archives higher accuracy than the CNN but needs to run longer till it reaches a stable state. 5,000 epochs worked best for the used set of parameters. While the results against the original testset are about 4% to 8% higher, the results running the augmented testset and the capsule network get a higher score by about 14%. The spatial transformer network reaches similar accuracy on the default testset compared to the capsule network. When testing against the augmented data the STN achieves about 1% higher accuracy. 5.2 Rotation For evaluating the rotation robustness the complete original testset was rotated from 90 to 90. Everything outside this range was not considered because it may not be relevant for real world applications. Figure 2a shows how the convolutional neural network (trained on the three trainingsets mentioned in Sect. 3.1) performs on these rotated images. The network trained with the augmented data obviously is more robust against rotation than the a priori trained network which seems to overfit to not rotated data. Looking at the network in general it always performs better than guessing which would be an accuracy of 0.0233. This is the result of some classes performance not being affected by rotations. original original original 1 a priori 1 a priori 1 a priori augmented augmented augmented 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 −100 −50 0 50 100 −100 −50 0 50 100 −100 −50 0 50 100 Rotation of Testset Rotation of Testset Rotation of Testset (a) Accuracy on the CNN (b) Accuracy on the capsnet (c) Accuracy on the STN Fig. 2. Results on the different trainingsets with rotated testsets Accuracy Accuracy Accuracy 38 J. Kronenberger and A. Haselhoff The overall performance of the Capsule Network (Fig. 2b) is much better and there is no accuracy drop when being trained with the a priori trainingset compared to the original trainingset. The augmented trainingset gives a small boost in rotation robustness. This shows the strength of the Capsules Network not being that prone to overfitting. The STN, visualized in Fig. 2c, got about the same characteristics as the Capsule Network but performs a bit better. Especially the augmented trainingset per- forms quite well considering the rotation. 1 CNN 1 Caps 0.8 STN 0.8 0.6 0.6 0.4 0.4 0.2 0.2 CNN Caps STN 0 0 −100 −50 0 50 100 −100 −50 0 50 100 Rotation Rotation (a) Accuracy on class 31. (b) Accuracy on class 12. 1 1 CNN Caps 0 0.8 STN 0.6 0.6 0.4 0.4 0.2 CNN 0.2 Caps STN 0 0 −100 −50 0 50 100 −100 −50 0 50 100 Rotation Rotation (c) Accuracy on class 15. (d) Accuracy on class 8. 1 CNN 1 Caps 0.8 STN 0.8 0.6 0.6 0.4 0.4 0.2 0.2 CNN Caps STN 0 0 −100 −50 0 50 100 −100 −50 0 50 100 Rotation Rotation (e) Accuracy on class 13. (f) Accuracy on class 40. Fig. 3. Results on the different trainingsets with rotated testsets. Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Do Capsule Networks Solve the Problem of Rotation Invariance 39 When looking at single classes and their rotation robustness most classes perform similar to class 31 visualized in Fig. 3a. The angle where the accuracy drops may vary. The accuracy with the augmented training data drops later most of the time while the a priori data has a negative effect to the rotation robustness. Some of the classes where the performance differs from this one are explained separately. The sign priority road accuracy (Fig. 3b) drops as expected every 45 using the Convolutional Neural Network. The Capsule Network is able to get an accuracy about 80% at this state. The STN is not that sensitive to rotation compared to the CapsNet but its peak performance is a bit worse. The no vehicles-sign is rotation invariant due to its definition. But as shown in Fig. 3c the accuracy varies greatly. One reason for this behavior might be the back- ground of the images which was also learned by the network. But generally there are no big accuracy drops noticeable for this class. The CapsNet and the STN can reach about 18% more accuracy. The sign speed limit 120 has a very weak performance considering the CNN. The STN and the CapsNet are producing similar results. The sign Give way is the only class where the convolutional neural network per- forms better than the capsule network. It reaches its highest accuracies every 120. The last special class is the roundabout-sign where the capsule network shows very stable prediction performance (Fig. 3f). The drop at rotation 90 is only present with the original training data. The convolutional neural network reacts similar to this sign as to the Give way-sign in Fig. 3e. 6 Conclusion These results show that the Capsule Networks are much more robust against rotated input data but they are not the perfect solution to this problem. Because these networks are very slow to train nowadays it is also possible to create an augmented trainingset or use Spatial Networks which can achieve similar results on rotated inputs. Nevertheless Capsule Networks are a very huge improvement to image classification because they don’t just simply learn the somehow average of all training data but they learn sub- features and their parameters. This is a huge benefit when thinking about explainable artificial intelligence. Future works may have to “look inside” the capsules to extract more information what features they learned in which representations. References 1. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018) 2. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: CVPR (2002) 3. Cohen, T.S., Welling, M.: Group equivariant convolutional networks (2016) 4. Fukushima, K., Miyake, S.: Necognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recogn. 15(6), 455–468 (1982) 40 J. Kronenberger and A. Haselhoff 5. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 173–1742. IEEE (2006) 6. Hubel, D.H., Wiesel, T.N.: Receptive field of single neurons in the cat’s striate cortex. J. Physiol. 148, 574–591 (1959) 7. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks (2015) 8. Jin, J., Fu, K., Zhang, C.: Traffic sign recognition with hinge loss trained convolutional neural networks. IEEE Trans. Intell. Transp. Syst. 15(5), 1991–2000 (2014) 9. Lawrence, S., Burns, I., Back, A., Tsoi, A.C., Giles, C.L.: Neural network classification and prior class probabilities. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 295–309. Springer, Heidelberg (2012). https://doi. org/10.1007/978-3-642-35289-8_19 10. Marcos, D., Volpi, M., Tuia, D.: Learning rotation invariant convolutional filters for texture classification. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2012–2017. IEEE (2016) 11. Neveu, T.: Capsnet-traffic-sign-classifier (2017). https://github.com/thibo73800/capsnet- traffic-sign-classifier 12. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems (2017) 13. Sermanent, P., LeCun, Y.: Traffic sign recognition with multi-scale convolutional networks. In: 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2809–2813. IEEE (2011) 14. Shijie, J., Ping, W., Peiyi, J., Siping, H.: Research on data augmentation for image classification based on convolution neural networks. In: 2017 Chinese Automation Congress (CAC), pp. 4165–4170, October 2017 15. Simard, P.Y., Le Cun, Y.A., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition: tangent distance and propagation. Int. J. Imaging Syst. Technol. 11(3), 181–197 (2000) 16. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012) 17. Zhou, Y., Ye, Q., Qui, Q., Jiao, J.: Oriented response networks. In: CVPR (2017) Balanced and Deterministic Weight-Sharing Helps Network Performance Oscar Chang(B) and Hod Lipson Data Science Institute, Columbia University, New York City, USA {oscar.chang,hod.lipson}@columbia.edu Abstract. Weight-sharing plays a significant role in the success of many deep neural networks, by increasing memory efficiency and incorporating useful inductive priors about the problem into the network. But under- standing how weight-sharing can be used effectively in general is a topic that has not been studied extensively. Chen et al. [1] proposed Hashed- Nets, which augments a multi-layer perceptron with a hash table, as a method for neural network compression. We generalize this method into a framework (ArbNets) that allows for efficient arbitrary weight-sharing, and use it to study the role of weight-sharing in neural networks. We show that common neural networks can be expressed as ArbNets with different hash functions. We also present two novel hash functions, the Dirichlet hash and the Neighborhood hash, and use them to demonstrate experi- mentally that balanced and deterministic weight-sharing helps with the performance of a neural network. Keywords: Weight-sharing · Weight tying · HashedNets · Hashing Entropy 1 Introduction Most deep neural network architectures can be built using a combination of three primitive networks: the multi-layer perceptron (MLP), the convolutional neural network (CNN), and the recurrent neural network (RNN). These three networks differ in terms of where and how the weight-sharing takes place. We know that the weight-sharing structure is important, and in some cases essential, to the success of the neural network at a particular machine learning task. For example, a convolutional layer can be thought of as a sliding window algorithm that shares the same weights applied across different local segments in the input. This is useful for learning translation-invariant representations. Zhang et al. [10] showed that on a simple ten-class image classification problem like CIFAR10, applying a pre-processing step with 32, 000 random convolutional This research was supported in part by the US Defense Advanced Research Project Agency (DARPA) Lifelong Learning Machines Program, grant HR0011-18-2-0020. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 41–50, 2018. https://doi.org/10.1007/978-3-030-01424-7_5 42 O. Chang and H. Lipson filters boosted test accuracy from 54% to 83% using an SVM with a vanilla Gaussian kernel. Additionally, although the ImageNet challenge only started in 2010, from 2012 onwards, all the winning models have been CNNs. This sug- gests the importance of convolutional layers for the task of image classification. We show later on that balanced and deterministic weight-sharing helps network performance, and indeed, the weights in convolutional layers are shared in a balanced and deterministic fashion. We also know that tying the weights of encoder and decoder networks can be helpful. In an autoencoder with one hidden layer and no non-linearities, tying the weights of the encoder and the decoder achieves the same effect as Prin- cipal Components Analysis [8]. In language modeling tasks, tying the weights of the encoder and decoder for the word embeddings also results in increased performance as well as a reduction in the number of parameters used [5]. Developing general intuitions about where and how weight-sharing can be leveraged effectively is going to be very useful for the machine learning prac- titioner. Understanding the role of weight-sharing in a neural network from a quantitative perspective might also potentially lead us to discover novel neu- ral network architectures. This paper is a first step towards understanding how weight-sharing affects the performance of a neural network. We make four main contributions: – We propose a general weight-sharing framework called ArbNet that can be plugged into any existing neural network and enables efficient arbitrary weight-sharing between its parameters (Sect. 1.1). – We show that deep networks can be formulated as ArbNets, and argue that the problem of studying weight-sharing in neural networks can be reduced to the problem of studying properties of the associated hash functions (Sect. 2.4). – We show that balanced weight-sharing increases network performance (Sect. 5.1). – We show that making an ArbNet hash function, which controls the weight- sharing, more deterministic increases network performance, but less so when it is sparse (Sect. 5.2). 1.1 ArbNet ArbNets are neural networks augmented with a hash table to allow for arbi- trary weight-sharing. We can label every weight in a given neural network with a unique identifier, and each identifier maps to an entry in the hash table by computing a given hash function prior to the start of training. On the forward and backward passes, the network retrieves and updates weights respectively in the hash table using the identifiers. A hash collision between two different iden- tifiers would then imply weight-sharing between two weights. This mechanism of forcing hard weight-sharing is also known as the ‘hashing trick’ in some machine learning literature. A simple example of a hash function is the modulus hash: wi = tablei mod n (1) Balanced and Deterministic Weight-Sharing 43 where the weight wi with identifier i maps to the (i mod n)th entry of a hash table of size n. Another example is the uniform hash, where the weights are uniformly dis- tributed across a table of fixed length. wi = tableUniform(n) (2) An ArbNet is an efficient mechanism of forcing weight-sharing between any two arbitrarily selected weights, since the only overhead involves memory occupied by the hash table and the identifiers, and compute time involved in initializing the hash table. 1.2 How the Hash Function Affects Network Performance As the load factor of the hash table goes up, or equivalently as the ratio of the size of the hash table relative to the size of the network goes down, the performance of the neural network goes down. This was demonstrated by Chen et al. [1]. While the load factor is a variable controlling the capacity of the network, it is not necessarily the most important factor in determining network performance. A convolutional layer has a much higher load factor than a fully connected layer, and yet it is much more effective at increasing network performance in a range of tasks, most notably image classification. There are at least two other basic questions we can ask: – How does the balance of the hash table affect performance? • The balance of the hash table indicates the evenness of the weight shar- ing. We give a more precise definition in terms of Shannon entropy in the Experimental Setup section, but intuitively, a perfectly balanced weight sharing scheme accesses each entry in the hash table the same number of times, while an unbalanced one would tend to favor using some entries more than others. – How does noise in the hash function affect performance? • For a fixed identifier scheme, if the hash function is a deterministic opera- tion, it will map to a fixed entry in the hash table. If it is a noisy operation, we cannot predict a priori which entry it would map into. • We do not have a rigorous notion for ‘noise’, but we demonstrate in the Experimental Setup section an appropriate hash function whose parameter can be tuned to tweak the amount of noise. We are interested in the answers to these question across different levels of sparsity, since as in the case of a convolutional layer, this might influence the effect of the variable we are studying on the performance of the neural network. We perform experiments on two image classification tasks, MNIST and CIFAR10, and demonstrate that balance helps while noise hurts neural network performance. MNIST is a simpler task than CIFAR10, and the two tasks show the difference, if any, when the neural network model has enough capacity to capture the complexity of the data versus when it does not. 44 O. Chang and H. Lipson 2 Common Neural Networks are MLP ArbNets The hash function associated with an MLP ArbNet is an exact specification of the weight-sharing patterns in the network, since an ordinary MLP does not share any weights. 2.1 Multi-layer Perceptrons An MLP consists of repeated applications of fully connected layers: yi = σi(Wixi + bi) (3) at the ith layer of the network, where σi is an activation function, Wi a weight matrix, bi a bias vector, xi the input vector, and yi the output vector. None of the weights in any of the Wi are shared, so we can consider an MLP in the ArbNet framework as being augmented with identity as the hash function. 2.2 Convolutional Neural Networks A CNN consists of repeated applications of convolutional layers, which in the 2D case, resembles an equation like the following: ∑ ∑ (j,k) = ( (m,n) (j−m,k−n) (j,k)Yi σi Wi Xi +Bi ) (4) m n at the ith layer of the network, where the superscripts index the matrix, σi is an activation function (includes pooling operations), Wi a weight matrix of size m by n, Xi the input matrix of size a by b, Bi a bias matrix and Yi the output matrix. The above equation produces one feature map. To produce l feature maps, we would have l different Wi and Bi, and stack the l resultant Yi together. Notice that Eq. 4 can be rewritten in the form of Eq. 3: y = σ (W ′i i ixi + bi) (5) where the weight matrix W ′i is given by the equation: { ((u+v)/b,(u+v) mod b) ′(u,v) Wi if the indices are defined for W= iW i (6)0 otherwise W ′i has the form of a sparse Toeplitz matrix, and we can write a CNN as an MLP ArbNet with a hash function corresponding to Eq. 6. Convolutions where the stride or dilation is more than one have not been presented for ease of explanation but analogous results follow. 2.3 Recurrent Neural Networks An RNN consists of recurrent layers, which takes a similar form as Eq. 3, except that Wi = Wj and Bi = Bj for i = j, i.e. the same weights are being shared across layers. This is equivalent to an MLP ArbNet where we number all the weights sequentially and the hash function is a modulus hash (Eq. 1) with n the size of each layer. Balanced and Deterministic Weight-Sharing 45 2.4 General Networks We have shown above that MLPs, CNNs, and RNNs can be written as MLP Arb- Nets associated with different hash functions. Since deep networks are built using a combination of these three primitive networks, it follows that deep networks can be expressed as MLP ArbNets. This shows the generality of the ArbNet framework. Fully connected layers do not share any weights, while convolutional layers share weights within a layer in a very specific pattern resulting in sparse Toeplitz matrices when flattened out, and recurrent layers share the exact same weights across layers. The design space of potential neural networks is extremely big, and one could conceive of effective weight-sharing strategies that deviate from these three standard patterns of weight-sharing. In general, since any neural network, not just MLPs, can be augmented with a hash table, ArbNets are a powerful mechanism for studying weight-sharing in neural networks. The problem of studying weight-sharing in neural networks can then be reduced to the problem of studying the properties of the associated hash functions. 3 Related Work Chen et al. [1] proposed HashedNets for neural network compression, which is an MLP ArbNet where the hash function is computed layer-wise using xxHash prior to the start of training. Han et al. [4] also made use of the same layer-wise hashing strategy for the purposes of network compression, but hashed according to clusters found by a K-means algorithm run on the weights of a trained net- work. Our work generalizes this technique, and uses it as an experimental tool to study the role of weight-sharing. Besides hard weight-sharing, it is also possible to do soft weight-sharing, where two different weights in a network are not forced to be equal, but are related to each other. Nowlan et al. [7] implemented a soft weight-sharing strat- egy for the purposes of regularization where the weights are drawn from a Gaus- sian mixture model. Ullrich et al. [9] also used Gaussian mixture models as soft weight-sharing for doing network compression. Another soft weight-sharing strategy called HyperNetworks [3] involves using a LSTM controller as a meta-learning algorithm to generate the weights of another network. 4 Experimental Setup In this paper, we limit our attention to studying certain properties of MLP ArbNets as tested on the MNIST and CIFAR10 image classification tasks. Our aim is not to best any existing benchmarks, but to show the differences in test accuracy as a result of changing various properties of the hash function associated with the MLP ArbNet. 46 O. Chang and H. Lipson 4.1 Balance of the Hash Table The balance of the hash table can be measured by Shannon entropy: ∑ H = − pi log pi (7) i where pi is the probability that the ith table entry will be used on a forward pass in the network. We propose to control this with a Dirichlet hash, which involves sampling from a symmetric Dirichlet distribution and using the output as the parameters of a multinomial distribution which we will use as the hash func- tion. The symmetric Dirichlet distribution has the following probability density function: Γ (αN) ∏N P (X) = xα−1 (8) Γ (α)N i i=1 where the xi lie on the N − 1 simplex. The Dirichlet hash can be given by the following function: wi = tableMultinomialα(n) (9) A high α leads to a balanced distribution (high Shannon entropy), and a low α Fig. 1. Heatmap of multinomial parameters drawn from different values of α leads to an unbalanced distribution (low Shannon entropy). The limiting case of α → ∞ results in a uniform distribution, which has maximum Shannon entropy. See Fig. 1 for a visualization of the effects of α on a hash table with 1000 entries. Balanced and Deterministic Weight-Sharing 47 4.2 Noise in the Hash Function A modulus hash and a uniform hash both have the property that the expected load of all the entries in the hash table is the same. Hence, in expectation, both of them will be balanced the same way, i.e. have the same expected Shannon entropy. But the former is entirely deterministic while the latter is entirely ran- dom. In this case, it is interesting to think about the effects of this source of noise, if any, on the performance of the neural network. We propose to investi- gate this with a Neighborhood hash, which involves the composition of a modulus hash and a uniform distribution around a specified radius. This is given by the following hash function: wi = table(i+Uniform([−radius, radius])) mod n (10) When the radius is 0, the Neighborhood hash reduces to the modulus hash, and when the radius is at least half the size of the hash table, it reduces to the uniform hash. Controlling the radius thus allows us to control the intuitive notion of ‘noise’ in the specific setting where the expected load of all the table entries is the same. 4.3 Network Specification On MNIST, our ArbNet is a three layer MLP (200-ELU-BN-200-ELU-BN-10- ELU-BN) with exponential linear units [2] and batch normalization [6]. On CIFAR10, our ArbNet is a six layer MLP (2000-ELU-BN-2000-ELU- BN-2000-ELU-BN-2000-ELU-BN-2000-ELU-BN-10-ELU-BN) with exponential linear units and batch normalization. We trained both networks using SGD with learning rate 0.1 and momentum 0.9, and a learning rate scheduler that reduces the learning rate 10x every four epochs if there is no improvement in training accuracy. No validation set was used. 5 Results and Discussion 5.1 Dirichlet Hash We observe in Fig. 2 that on the MNIST dataset, increasing α has a direct pos- itive effect on test accuracy, across different levels of sparsity. On the CIFAR10 dataset, when the weights are sparse, increasing α has a small positive effect, but at lower levels of sparsity, it has a huge positive effect. This finding seems to indicate that it is more likely for SGD to get stuck in local minima when the weights are both non-sparse and shared unevenly. We can conclude that balance helps with network performance, but it is unclear if it brings diminishing returns. Re-plotting the MNIST graph in Fig. 2 with the x-axis replaced with Shannon Entropy (Eq. 7) instead of α in Fig. 3 gives us a better sense of scale. Note that in this case, a uniform distribution on 1000 entries would have a Shannon entropy of 6.91. The results shown by Fig. 3 suggest a linear trend at high sparsity and a concave trend at low sparsity, but more evidence is required to come to a conclusion. 48 O. Chang and H. Lipson Fig. 2. Effect of α (balance) in Dirichlet hash on network accuracy across different levels of sparsity Fig. 3. Effect of Shannon entropy (balance) in Dirichlet hash on network accuracy across different levels of sparsity 5.2 Neighborhood Hash The trends in Fig. 4 are noisier, but it seems like an increase in radius has the overall effect of diminishing test accuracy. On MNIST, we notice that higher levels of sparsity result in a smaller drop in accuracy. The same effect seems to be present but not as pronounced in CIFAR10, where we note an outlier in the case of sparsity 0.1, radius 0. We hypothesize that this effect occurs because the increase in noise leads to the increased probability of two geometrically distant weights in the network being forced to share the same weights. This is undesirable in the task of image classification, where local weight-sharing is proven to be advantageous, and perhaps essential to the task. When the network is sparse, the positive effect of local weight-sharing is not prominent, and hence the noise does not affect network performance as much. Balanced and Deterministic Weight-Sharing 49 Thus, we can conclude that making the ArbNet hash more deterministic (equivalently, less noisy) boosts network performance, but less so when it is sparse. We notice that convolutional layers, when written as an MLP ArbNet as in Eq. 6, have a hash function that is both balanced (all the weights are used with the same probability) and deterministic (the hash function does not have any noise in it). This helps to explain the role weight-sharing plays in the success of convolutional neural networks. Fig. 4. Effect of radius (noise) in Neighborhood hash on network accuracy across different levels of sparsity 6 Conclusion Weight-sharing is very important to the success of deep neural networks. We proposed the use of ArbNets as a general framework under which weight-sharing can be studied, and investigated experimentally, for the first time, how balance and noise affects neural network performance in the specific case of an MLP ArbNet and two image classification datasets. References 1. Chen, W., Wilson, J.T., Tyree, S., Weinberger, K.Q., Chen, Y.: Compressing neural networks with the hashing trick, vol. 37 (2015) 2. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs) (2016) 3. Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks (2017) 4. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural net- works with pruning, trained quantization and Huffman coding (2016) 5. Inan, H., Khosravi, K., Socher, R.: Tying word vectors and word classifiers: a loss framework for language modeling (2017) 6. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift, vol. 37 (2015) 50 O. Chang and H. Lipson 7. Nowlan, S.J., Hinton, G.E.: Simplifying neural networks by soft weight-sharing. Neural Comput. 4, 473–493 (1992) 8. Roweis, S.: EM algorithms for PCA and SPCA. In: Neural Information Processing Systems, vol. 10 (1997) 9. Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network com- pression (2017) 10. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires re-thinking generalization (2017) Neural Networks with Block Diagonal Inner Product Layers Amy Nesky(B) and Quentin F. Stout Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109, USA {anesky,qstout}@umich.edu Abstract. We consider a modified version of the fully connected layer we call a block diagonal inner product layer. These modified layers have weight matrices that are block diagonal, turning a single fully connected layer into a set of densely connected neuron groups. This idea is a natural extension of group, or depthwise separable, convolutional layers applied to the fully connected layers. Block diagonal inner product layers can be achieved by either initializing a purely block diagonal weight matrix or by iteratively pruning off diagonal block entries. This method condenses network storage and speeds up the run time without significant adverse effect on the testing accuracy. Keywords: Neural networks · Block diagonal · Structured sparsity 1 Introduction Ideally, efforts to reduce the memory requirements of neural networks would also lessen their computational demand, but often these competing interests force a trade-off. Fully connected layers are unwieldy, yet they continue to be present in the most successful networks [13,23,28]. Our work addresses both memory and computational efficiency without compromise. Focusing our attention on the fully connected layers, we decrease network memory footprint and improve network runtime. There are a variety of methods to condense large networks without much harm to their accuracy. One such technique that has gained popularity is prun- ing [3,4,21], but traditional pruning has disadvantages related to network run- time. Most existing pruning processes slow down network training, and the resulting condensed network is usually significantly slower to execute [3]. Sparse format operations require additional overhead that can greatly slow down per- formance unless one prunes nearly all weight entries, which can damage network accuracy. Localized memory access patterns can be computed faster than non-localized lookups. By implementing block diagonal inner product layers in place of fully connected layers, we condense neural networks in a structured manner that ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 51–61, 2018. https://doi.org/10.1007/978-3-030-01424-7_6 52 A. Nesky and Q. F. Stout speeds up the final runtime and does little harm to the final accuracy. Block diagonal inner product layers can be implemented by either initializing a purely block diagonal weight matrix or by initializing a fully connected layer and focus- ing pruning efforts off the diagonal blocks to coax the dense weight matrix into structured sparsity. The first method reduces the gradient computation time and hence the overall training time. The latter method retains higher accuracy and supports the robustness of networks to shaping. That is, pruning can be used as a mapping between architectures—in particular, a mapping to more convenient architectures. Depending on how many iterations the pruning process takes, this method may also speed up training. We have converted a single fully connected layer into a ensemble of smaller inner product learners whose combined efforts form a stronger learner, in essence boosting the layer. These methods also bring artificial neural networks closer to the architecture of biological mammalian brains, which have more local connectivity [6]. 2 Related Work There is an assortment of criteria by which one may choose which weights to prune. With any pruning method, the result is a sparse network that takes less storage space than its fully connected counterpart. Han et al. iteratively prune a network using the penalty method by adding a mask that disregards pruned parameters for each weight tensor [4]. This means that the number of required floating point operations decreases, but the number performed stays the same. Furthermore, masking out updates takes additional time. Han et al. report the average time spent on a forward propagation after pruning is complete and the resulting sparse layers have been converted to CSR format; for batch sizes larger than one, the sparse computations are significantly slower than the dense calculations [3]. More recently, there has been momentum in the direction of structured reduc- tion of network architecture. Node pruning preserves some structure, but dras- tic node pruning can harm the network accuracy and requires additional weight fine-tuning [5,25]. Other approaches include storing a low rank approximation for a layer’s weight matrix [22] and training smaller models on outputs of larger models (distillation) [7]. Group lasso expands the concept of node pruning to con- volutional filters [14,26,27]. That is, group lasso applies L1-norm regularization to entire filters. Sidhawani et al. propose structured parameter matrices char- acterized by low displacement rank that yield high compression rate as well as fast forward and gradient evaluation [24]. Their work focuses on toeplitz-related transforms of the fully connected layer weight matrix. However, speedup is gen- erally only seen for compression of large weight matrices. According to their Fig. 3, for displacement rank higher than 1.5 × 10−3 times the matrix dimen- sion the forward pass is slowed down, and backward pass is slowed down for displacement rank higher than 9× 10−4 times the matrix dimension. Group, or depthwise separable, convolutions have been used in recent CNN architectures with great success [2,8,29]. In group convolutions, a particular filter Neural Networks with Block Diagonal Inner Product Layers 53 does not see all of the channels of the previous layer. Block diagonal inner product layers apply this idea of separable neuron groups to the fully connected layers. This method transforms a fully connected layer into an ensemble of smaller fully connected neuron groups that boost the layer. 3 Methodology We consider two methods for implementing block diagonal inner product layers: 1. We initialize a layer with a purely block diagonal weight matrix and keep the number of connections constant throughout training. 2. We initialize a fully connected layer and iteratively prune entries off the diag- onal blocks to achieve a block substructure. Within a layer, all blocks have the same size. Method 2 is accomplished in three phases: a dense phase, an iterative pruning phase and a block diagonal phase. In the dense phase a fully connected layer is initialized and trained in the standard way. During the iterative pruning phase, focused pruning is applied to entries off the diagonal blocks using the weight decay method with L1-norm. That is, if W is the weight matrix for a fully connected layer we wish to push toward block diagonal, we add ∑ α |1i,jWi,j | (1) i,j to the loss function during the iterative pruning phase, where α is a tuning parameter and 1i,j indicates whether Wi,j is off the diagonal blocks in W . The frequencies of regularization and pruning during this phase are additional hyper- parameters. During this phase, masking out updates for pruned entries is more efficient than maintaining sparse format. When pruning is complete, to maxi- mize speedup it is best to reformat the weight matrix once such that the blocks are condensed and adjacent in memory.1 Batched smaller dense calculations for the blocks use cuBLAS strided batched multiplication [20]. There is a lot of flexibility in method 2 that can be tuned for specific user needs. More pruning iterations may increase the total training time but can yield higher accuracy and reduce overfitting. 4 Experiments: Speedup and Accuracy Our goal is to reduce memory storage of the inner product layers while main- taining or reducing the final execution time of the network with minimal loss in accuracy. We will also see the reduction of total training time in some cases. All experiments are run on the Bridges’ NVIDIA P100 GPUs through the Pittsburgh Supercomputing Center. 1 When using block diagonal layers, one should alter the output format of the previous layer and the expected input format of the following layer accordingly, in particular to row major ordering. 54 A. Nesky and Q. F. Stout 102 102 101 101 100 100 10-1 n=128 n=1024 n=8192 n=256 n=2048 n=16384 n=512 n=4096 n=32768 10-1 -2 200 22 244 266 288 21100 21122 21144 10 200 22 244 266 288 21100 21122 21144 Number of Blocks Number of Blocks Fig. 1. Speedup when performing matrix multiplication using an n× n weight matrix and batch size 100. (Left) Speedup when performing only one forward matrix product. (Right) Speedup when performing all three matrix products involved in the forward and backward pass in gradient descent. Both images in this figure share the same key. For speedup analysis we timed block diagonal multiplications using n × n matrices with varying dimension sizes and varying numbers of blocks; we con- sidered the forward pass and gradient updates. We also calculate an upper bound on the ratio of the number of pruning iterations to the number of pure block iterations that will yield speedup when using block diagonal method 2. For accu- racy results, we ran experiments on the MNIST [16] dataset using a LeNet-5 [15] network, and the SVHN [19] and CIFAR10 [10] datasets using Krizhevsky’s cuda- convnet [11]. Cuda-convnet does not produce state-of-art accuracies for SVHN or CIFAR10, but demonstrates the performance differences between our methods and others. We implement our work in Caffe, which provides these architectures; Caffe’s MNIST example uses LeNet-5 and cuda-convnet can be found in Caffe’s CIFAR10 “quick” example. 4.1 Speedup Figure 1 shows the speedup when performing matrix multiplication using an n × n weight matrix and batch size 100 when the weight matrix is purely block diagonal. The speedup when performing only the forward-pass matrix product is shown in the left pane, and the speedup when performing all gra- dient descent products is shown in the right pane. As the number of blocks increases, the overhead to perform cuBLAS strided batched multiplication can become noticeable; this library is not yet well optimized for performing many small matrix products [17]. However, with specialized batched multiplications for many small matrices, Jhurani et al. attain up to 6 fold speedup [9]. Using cuBLAS strided batched multiplication, maximum speedup is achieved when the number of blocks is 1/27 times the matrix dimension. When only timing the for- ward pass, the speedup is always greater than 1 when the number of blocks is at most 1/25 times the matrix dimension. When timing the forward and backward Speedup (ms) Speedup (ms) Neural Networks with Block Diagonal Inner Product Layers 55 pass, the speedup is always greater than 1 when the number of blocks is at most 1/26 times the matrix dimension. For a given inner product layer, using block diagonal method 2 we would see speedup during training if T (FC)− T (Block) y > (2) T (Prune) x where T (·) is the combined time to perform the forward and backward passes of an inner product layer in the input state, x is the number of pure block iterations, and y is the number of pruning iterations. T (Prune) is the time to regularize and apply a mask to the off diagonal block layer weights, which happens once in a training iteration. Figure 2 plots the upper bound in ratio 2 against the number of blocks for a layer with an n× n weight matrix and batch size 100. 6 n=27 n=210 n=213 5 n=28 n=211 n=214 n=29 n=212 n=215 4 3 2 1 0 200 22 244 266 288 21100 21122 21144 Number of Blocks Fig. 2. Using batch size 100, upper bound on the ratio of the number of pruning iterations to the number of pure block iterations that will result in an overall training speedup when using block diagonal method 2. Figure 3 shows timing results for the inner product layers in Lenet-5 (Left) and cuda-convnet (Right), which both have two inner product layers. We plot the forward time per inner product layer when the layers are purely block diag- onal, the combined forward and backward time to do the three matrix products involved in gradient descent training when the layers are purely block diago- nal, and the runtime of sparse matrix multiplication with random entries in CSR format using cuSPARSE [20]. For brevity we refer to a block diagonal network architecture as (b1, . . . , bn)-BD; bi = 1 indicates that the ith inner prod- uct layer is fully connected. FC is short for all inner product layers being fully connected. The points at which the forward sparse and forward block curves meet in each plot in Fig. 3 indicate the fully connected dense forward run- times for each layer; these are made clearer with dotted, black, vertical lines. In Lenet-5 (Left), the first inner product layer, ip1, has a 500 × 800 weight matrix, and the second has a 10 × 500 weight matrix, so the (b1, b2)-BD archi- tecture has (800 × 500)/b1 + (500 × 10)/b2 nonzero weights across both inner Upper Bound in (2) 56 A. Nesky and Q. F. Stout Fig. 3. For each inner product layer in Lenet-5 (Left) and cuda-convnet (Right): for- ward runtimes of block diagonal and CSR sparse formats, combined forward and back- ward runtimes of block diagonal format. Lenet-5 uses batch size 64, and cuda-convnet uses batch size 100. product layers. There is ≥1.4× speedup for b1 ≤ 50, or 8000 nonzero entries, when timing both forward and backward matrix products, and 1.6× speedup when b1 = 100, or 4000 nonzero entries, in the forward only case. In cuda- convnet (Right), the first inner product layer, ip1, has a 64×1024 weight matrix, and the second has a 10 × 64 weight matrix. The (b1, b2)-BD architecture has (1024× 64)/b1 + (64× 10)/b2 nonzero entries across both inner product layers. In the ip1 layer, there is ≥1.26× speedup for b1 ≤ 32, or 2048 nonzero entries, when timing both forward and backward matrix products, and ≥1.65× speedup for b1 ≤ 64, or 1024 nonzero entries, in the forward only case. In both plots we see sparse format performs poorly until there are less than 50 nonzero entries. 4.2 Accuracy Results All hyperparameters and initialization distributions provided by Caffe’s example architectures are left unchanged. Training is done with batched gradient descent using the cross-entropy loss function on the softmax of the output layer. In our experiments we performed only manual hyperparameter tuning of new hyperpa- rameters introduced by block diagonal method 2 like the coefficient of the new regularization term (see Eq. 1) and the pruning modulus cutoff. In ShuffleNet, Zhang et al. note that when multiple group convolutions are stacked together this can block information flow between channel groups and weaken representation [29]. To correct for this, they suggest dividing the channels in each group into subgroups, and shuffling the outputs of the subgroups in this layer before feeding them to the next layer. Applying this approach to block inner product layers requires either moving entries in memory or doing more, smaller matrix products. Both of these options would hurt efficiency. Using pruning to achieve the block diagonal structure, as in method 2, also addresses information flow. Pruning does add some work to the training iterations, but, unlike the Neural Networks with Block Diagonal Inner Product Layers 57 ShuffleNet method, does not add work to the final execution of the trained network. After pruning is complete, the learned weights are the result of a more complete picture; while the information flow has been constrained, it is preserved like a ghost in the remaining weights. Another alternative is to randomly shuffle whole blocks each pass like in the “random sparse convolution” layer in the CNN library cuda-convnet [12]. We found that for the inner product layers in LeNet-5 and Krizhevsky’s Cuda-convnet, the ShuffleNet method did not show as much improvement in accuracy as randomly shuffling the whole blocks, so we do not include results using the ShuffleNet method. Table 1 shows the accuracy results for block diagonal method 1, method 1 with random block shuffling, method 2 and traditional iterative pruning, which uses the penalty method to prune weight entries not subject to any confinement or organization. We show accuracy results for the most condensed net with block diagonal inner product layers and the net with the fastest speedup in the inner product layers. Table 1. Accuracy results on MNIST, SVHN, and CIFAR10 datasets. Method 1 Rand. shuff Method 2 Trad. it. prune MNIST ( 99.11% accurate when using FC) (10, 1)-BD 98.83% 98.81% 99.02% 99.04% (100, 10)-BD 98.39% 98.42% 98.65% 98.55% SVHN ( 91.96% accurate when using FC) (8, 1)-BD 91.39% 91.46% 91.88% 91.15% (64, 2)-BD 89.21% 89.69% 90.02% 90.93% CIFAR10 ( 76.29% accurate when using FC) (8, 1)-BD 75.07% 75.09% 76.05% 75.64% (64, 2)-BD 72.7% 73.45% 74.81% 75.18% MNIST. We experimented on the MNIST dataset with the LeNet-5 frame- work [15] using a training batch size of 64 for 10000 iterations. LeNet-5 has two convolutional layers with pooling followed by two inner product layers with ReLU activation. FC achieves a final accuracy of 99.11%. In all cases testing accuracy remains within 1% of FC accuracy. Using traditional iterative pruning with L2 regularization, as suggested in [4], pruning until 4000 and 500 nonzero entries survived in ip1 and ip2 respectively gave an accuracy of 98.55%, but the forward multiplication was more than 8 times slower than the dense fully connected case (See Fig. 3 Left). Implementing (100, 10)-BD method 2 with pruning using 15 dense iterations and 350 pruning iterations gave a final accuracy of 98.65%. (10, 1)-BD yielded ≈1.4× speedup for all gradient descent matrix products in both inner product layers after any pruning is complete, and (100, 10)-BD condensed the inner product layers in LeNet-5 ≈81 fold. 58 A. Nesky and Q. F. Stout SVHN. We experimented on the SVHN dataset with Krizhevsky’s cuda- convnet [11] using batch size 100 for 9000 iterations. Cuda-convnet has three convolutional layers with ReLu activation and pooling, followed by two fully connected layers with no activation. (8, 1)-BD yielded ≈1.5× speedup for all gradient descent matrix products in both inner product layers when purely block diagonal, and (64, 2)-BD condensed the inner product layers in Cuda-convnet ≈47 fold. Using FC we obtained a final accuracy of 91.96%. Table 1 shows all methods stayed under a 2.5% drop in accuracy. Using traditional iterative pruning with L2 regularization until 1024 and 320 nonzero entries survived in the final two inner product layers respectively gave an accuracy of 90.93%, but the forward multiplication was more than 8 times slower than the dense fully connected com- putation. On the other hand, implementing (64, 2)-BD method 2 with pruning, which has corresponding numbers of nonzero entries, with 500 dense iterations and <1000 pruning iterations gave a final accuracy of 90.02%. This is ≈47 fold compression of the inner product layer parameters with only a 2% drop in accu- racy when compared to FC. CIFAR10. We experimented on the CIFAR10 dataset with Krizhevsky’s cuda- convnet [11] using batch size 100 for 9000 iterations. Using FC we obtained a final accuracy of 76.29%. Table 1 shows all methods stayed within a 4% drop in accuracy. Using traditional iterative pruning with L2 regularization until 1024 and 320 nonzero entries survived in the final two inner product layers gave an accuracy of 75.18%, but again the forward multiplication was more than 8 times slower than the dense fully connected computation. On the other hand, imple- menting (64, 2)-BD method 2 with pruning, which has corresponding numbers of nonzero entries, with 500 dense iterations and <1000 pruning iterations gave a final accuracy of 74.81%. This is ≈47 fold compression of the inner product layer parameters with only a 1.5% drop in accuracy. The total forward runtime of ip1 and ip2 in (64, 2)-BD is 1.6 times faster than in FC. To achieve comparable speed with sparse format we used traditional iterative pruning to leave 37 and 40 nonzero entries in the final inner product layers giving an accuracy of 73.01%. Thus implementing block diagonal layers with pruning yields comparable accu- racy and memory condensation to traditional iterative pruning with faster final execution time. Whole node pruning decreases the accuracy more than corresponding reduc- tions in the block diagonal setting. Node pruning until ip1 had only 2 outputs, i.e. a 1024× 2 weight matrix, and ip2 had a 2× 10 weight matrix for a total of 2068 weights between the two layers gave a final accuracy of 59.67%. (64, 2)-BD has a total of 1344 weights between the two inner product layers and had a final accuracy 15.14% higher with pruning. The final accuracy on an independent test set was 76.29% on CIFAR10 using the FC net while the final accuracy on the training set itself was 83.32%. Using the (64, 2)-BD net without pruning, the accuracy on an independent test set was 72.49%, but on the training set was 75.63%. With pruning, the accuracy of Neural Networks with Block Diagonal Inner Product Layers 59 (64, 2)-BD on an independent test set was 74.81%, but on the training set was 76.85%. Both block diagonal methods decrease overfitting; the block diagonal method with pruning decreases overfitting slightly more. 5 Conclusion We have shown that block diagonal inner product layers can reduce network size, training time and final execution time without significant harm to the network performance. While traditional iterative pruning can reduce storage, the scattered surviv- ing weights make sparse computation inefficient, slowing down both training and final execution time. Our block diagonal methods address this inefficiency by confining dense regions to blocks along the diagonal. Without pruning, block diagonal method 1 allows for faster training time. Method 2 preserves the learn- ing with focused, structured pruning that reduces computation for speedup dur- ing execution. In our experiments, method 2 saw higher accuracy than the purely block diagonal method. The success of method 2 supports the use of pruning as a mapping from large dense architectures to more efficient, smaller, dense archi- tectures. Both methods make larger network architectures more feasible to train and use since they convert a fully connected layer into a collection of smaller inner product learners working jointly to form a stronger learner. In particular, GPU memory constraints become less constricting. There is a lot of room for additional speedup with block diagonal layers. Dependency between layers poses a noteworthy bottleneck in network paral- lelization. With structured sparsity like ours, one no longer needs a full barrier between layers. Additional speedup would be seen in software optimized to sup- port weight matrices with organized sparse form, such as blocks, rather than being optimized for dense matrices. For example, for many small blocks, one can reach up to 6 fold speedup with specialized batched matrix multiplication [9]. Hardware has been developing to better support sparse operations. Block for- mat may be especially suitable for training on evolving architectures such as neuromorphic systems. These systems, which are far more efficient than GPUs at simulating mammalian brains, have a pronounced 2-D structure and are ill- suited to large dense matrix calculations [1,18]. Acknowledgments. This material is based upon work supported by the National Sci- ence Foundation Graduate Research Fellowship under Grant No. DGE-1256260. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specif- ically, it used the Bridges system, which is supported by NSF award number ACI- 1445606, at the Pittsburgh Supercomputing Center (PSC). 60 A. Nesky and Q. F. Stout References 1. Boahen, K.: Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proc. IEEE 102(5), 699–716 (2014) 2. Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv:1610.02357 (2017) 3. Han, S., et al.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: ICLR (2015) 4. Han, S., et al.: Learning both weights and connections for efficient neural networks. In: NIPS, pp. 1135–1143 (2015) 5. He, T., et al.: Reshaping deep neural network for fast decoding by node-pruning. In: IEEE ICASSP, pp. 245–249 (2014) 6. Herculano-Houzel, S.: The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. PNAS 109(Supplement 1), 10661– 10668 (2012) 7. Hinton, G., et al.: Distilling the knowledge in a neural network. In: NIPS (2014) 8. Ioannou, Y., et al.: Deep Roots: improving CNN efficiency with hierarchical filter groups. In: CVPR (2017) 9. Jhurani, C., et al.: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel Distrib. Comput. 75, 133–140 (2015) 10. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, Computer Science, University of Toronto (2009) 11. Krizhevsky, A.: Cuda-convnet. Technical report, Computer Science, University of Toronto (2012) 12. Krizhevsky, A.: Cuda-convnet: high-performance C++/CUDA implementation of convolutional neural networks (2012) 13. Krizhevsky, A., et al.: Imagenet classification with deep convolutional neural net- works. In: NIPS, pp. 1106–1114 (2012) 14. Lebedev, V., et al.: Fast convnets using group-wise brain damage. In: CVPR (2016) 15. LeCun, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 16. LeCun, Y., et al.: The MNIST database of handwritten digits. Technical report 17. Masliah, I., et al.: High-performance matrix-matrix multiplications of very small matrices. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 659–671. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3 48 18. Merolla, P.A., et al.: A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345(6197), 668–673 (2014) 19. Netzer, Y., et al.: Reading digits in natural images with unsupervised feature learn- ing. In: NIPS (2011) 20. Nickolls, J., et al.: Scalable parallel programming with CUDA. ACM Queue 6(2), 40–53 (2008) 21. Reed, R.: Pruning algorithms-a survey. IEEE Trans. Neural Netw. 4(5), 740–747 (1993) 22. Sainath, T.N., et al.: Low-rank matrix factorization for deep neural network train- ing with high-dimensional output targets. In: IEEE ICASSP (2013) 23. Simonyan, K., et al.: Very deep convolutional networks for large-scale image recog- nition. arXiv:1409.1556 (2014) 24. Sindhwani, V., et al.: Structured transforms for small-footprint deep learning. In: NIPS, pp. 3088–3096 (2015) Neural Networks with Block Diagonal Inner Product Layers 61 25. Srinivas, S., et al.: Data-free parameter pruning for deep neural networks. arXiv:1507.06149 (2015) 26. Wen, W., et al.: Learning structured sparsity in deep neural networks. In: NIPS, pp. 2074–2082 (2016) 27. Yuan, M., et al.: Model selection and estimation in regression with grouped vari- ables. J. Royal Stat. Soc. Ser. B 68(1), 49–67 (2006) 28. Zeiler, M.D., et al.: Visualizing and understanding convolutional networks. arXiv:1311.2901 (2013) 29. Zhang, X., et al.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. arXiv:1707.01083 (2017) Training Neural Networks Using Predictor-Corrector Gradient Descent Amy Nesky(B) and Quentin F. Stout Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109, USA {anesky,qstout}@umich.edu Abstract. We improve the training time of deep feedforward neural networks using a modified version of gradient descent we call Predictor- Corrector Gradient Descent (PCGD). PCGD uses predictor-corrector inspired techniques to enhance gradient descent. This method uses a sparse history of network parameter values to make periodic predictions of future parameter values in an effort to skip unnecessary training iter- ations. This method can cut the number of training epochs needed for a network to reach a particular testing accuracy by nearly one half when compared to stochastic gradient descent (SGD). PCGD can also outper- form, with some trade-offs, Nesterov’s Accelerated Gradient (NAG). Keywords: Neural networks · Accelerated gradient methods 1 Introduction The immense expressional power of artificial neural networks has advanced machine learning and data science a great deal. Large networks can achieve unprecedented accuracy in intricate learning problems, yet their size consumes significant computational resources and, consequently, time [13]. Advances in compute power allow neural networks with millions of parameters to be trained on enormous, complex data sets, and the use of GPUs has decreased training time drastically, but new techniques for reducing network training time must arise for deep learning to progress. In this work, we propose a new training technique called Predictor-Corrector Gradient Descent (PCGD) that reduces the number of iterations required to learn. In PCGD we monitor the trend of the parameters as the network learns with gradient descent, and periodically adjust each parameter by inferring future values from the trend. A number of standard gradient descent iterations between predictions act to refine the predicted approximations. This alternating process works in much the same way that predictor-corrector methods for solving ordi- nary differential equations work. We will show that incorporating prediction into the training process of networks makes learning significantly more efficient. The human brain already utilizes predictions. Predictions are crucial to sur- vival because they allow us to respond more appropriately to our surroundings ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 62–72, 2018. https://doi.org/10.1007/978-3-030-01424-7_7 Training Neural Networks Using Predictor-Corrector Gradient Descent 63 and they improve reaction time. Perception is also impacted by brain predic- tions: our perceptions are a combination of expectations and sensory information [7,14]. Thus, if we wish to improve artificial neural network efficiency, integrating prediction into training is a natural modification.1 2 Related Work There is a plethora of work that supplements standard gradient descent in hopes of improving neural network training. Gradient noise and stale gradients have been successful adaptations to gradient descent [8,15]. Adaptive Gradi- ent techniques give frequently occurring features low learning rates and infre- quent features high learning rates; these methods use the information theoretic idea that infrequent features carry more information about the data distribution [5,6,10,23,24]. Momentum and Nesterov’s Accelerated Gradient (NAG) accumu- late a descent direction across iterations to alleviate zig-zagging and accelerate convergence [17,19] . There are also meta-learning methods that allow networks to be trained jointly with their learning algorithm. Meta-methods may intelli- gently adjust hyperparameters like the learning rate, or learn the entire update term perhaps as a function of the batched gradient [1,4]. Each of these tech- niques complement gradient descent to improve network learning and can be used in conjunction with our methods. Prediction-correction methods are traditionally used in numerical analysis to integrate ordinary differential equations [22]. Since their inception, predictor- corrector methods have been used in a variety of fields that require optimization like theoretical study of chemical reactions and time-varying convex optimization [9,21]. Prediction-correction has been incorporated into neural network training in the past by convolving a pair of neural networks, a prediction network and a correction network [25,26]. Scieur et al. propose a related learning algorithm to the one presented in this paper called Regularized Nonlinear Acceleration (RNA) [20]. RNA computes estimates of the optimum from a nonlinear average of a history of iterations produced by an optimization method like gradient descent. Like in RNA, the prediction step in PCGD is based on a history of parameter values obtained with gradient descent. However, our predictions use parameter specific linear regression rather than a nonlinear average of complete historical iterations. Mak- ing parameter specific predictions with linear regression allows our method to update predictions incrementally, which removes the need to keep all historical iterations relevant to a particular prediction. RNA must store the entire iteration history relevant to a particular prediction, which makes this method unfeasible for training large neural networks. 1 One caution ought to be mentioned here: brain predictions also enable prejudices, so one must be careful how much trust is placed in predictions. 64 A. Nesky and Q. F. Stout 3 Methodology PCGD uses best fit predictions and stochastic gradient descent in tandem. When estimating the trend in the network parameters through training, we will use fit functions for which the least squares problem has a closed form solution using the normal equations. One could use more complex fit functions, but we want to avoid needing an extra iterative process. Using only least squares problems with closed form solutions to make parameter predictions also saves memory because they can be solved incrementally, avoiding the need to store a long history of network snapshots. We will define the algorithm around the gradient descent iterations. We will make parameter predictions every p gradient descent iterations and collect snap- shots of the network parameters every sth gradient descent iteration where p > s and s|p. Parameter predictions only consider the previous p/s network snapshots. Since p > s, only a sparse history of snapshots are considered. We’ll call p the prediction increment and s the snapshot increment. For the remainder of this paper, the variables p and s will retain this definition. Suppose our network has n weight and bias parameters. Let f(a, x) : c R × R → R be our chosen fit function class for parameter prediction. For each network parameter, θ, we aim to solve for a, such that f(a, x) estimates a future value of θ for a chosen prediction length x. f(a, x) has c unknowns where c ≤ p/s. Define F (A, x) : c×n × → nR R R such that the ith entry of F (A, x) is f(ai, x) where ai is the ith column of A. When using PCGD, network parameter vector θ ∈ nR receives the update, vt ={− ∇L(θt) F (At+1, lt+1) if t+ 1 ≡ 0 mod p (1) θt+1 = θt + vt otherwise where L is the desired loss function,  is some learning rate, lt+1 ≥ p/s is an increasing prediction length and At+1 ∈ c×nR , minimizes the L2-norms o/f the columns of JA − Θ . Here, J ∈ (p/s)×ct+1 t+1 R has entries Ji,j = ∂f(a, i) ∂aj , and the ith row of Θt+1 is the vector θt+1−p+is for i < p/s and θ  t + v  t for i = p/s.2 Note that the columns of At+1 each solve independent least squares problems for particular network parameters; the systems are overdetermined if c < p/s. We use one fit function class, f , but calculate network-parameter specific fit function variables. One could easily add regularizers or momentum to the velocity term, vt. lt+1 is an increasing prediction length dependent on the gradient descent iteration, but one could also consider an adaptive, or parameter specific prediction length. Iterations, t, in which t ≡ 0 mod p constitute the ‘predictive’ step in PCGD, and all other gradient descent iterations comprise the ‘corrective’ step. We solve for prediction fit function variables At+1 incrementally so as to minimize the extra storage required to perform PCGD. Fit function variables 2 Note that the jacobian, J , is not specific to the column of At+1. Training Neural Networks Using Predictor-Corrector Gradient Descent 65 are updated at snapshot intervals. Let (i)Θt+1 denote the shorter matrix containing only the first i rows of Θ (i)t+1. Similarly, J is the shorter matrix containing only the first i rows of J . When c snapshots have been recorded, we solve J (c)At+1 = (c) (c) (c)Θt+1 for the fit function variable matrix At+1; with c snapshots J At+1 = Θt+1 is a determined system. After this initial solve, only At+1 must still be stored, (c) Θt+1 is no longer needed. At snapshot intervals c+1 through p/s we update the fit function variable matrix using the incremental least squares algorithm found in [3]. That is, for i ∈ [c+ 1, p/s], we u(pdate,( ) ) ← + (i)A t+1 At+1 yi θt+1 − ji At+1 (2) ( ) where (i)θt+(1 )is the ith row in Θt+1, ji is the ith row of J , and yi is the solution to J (i) J (i)yi = ji. This process then repeats writing over old fit function variables and param- eter history in memory. Since fit functions variables are parameter specific, they can be updated layer-wise. If a network has n total parameters, PCGD requires storing at most an additional O(cn) values in memory at any one time during training when using a fit function with c unknowns. The size of the extra storage is c times the size of layers not being currently being updated plus at most 2c times the size of the layer currently being updated. By using an incremental least squares approach and solving for parameter specific best fit functions, we are able to conserve memory during training; with- out this approach one would need to store np/s parameter history values. This makes PCGD a feasible technique for training large networks provided c is small. Given the same history, RNA would solve for p/s coefficients for p/s entire net- work snapshots to obtain a nonlinear average of the whole snapshots [20]. Hence, RNA would require storing all np/s parameter history values. However, for the memory conservation afforded by incrementally updating fix functions, one pays a little extra work. Rather than solving for At+1 directly, one must perform p/s− c+ 1 incremental updates to At+1. It should be noted that this is a general adaptation to stochastic gradient descent that is not specific to neural networks. This method may also appropriate for other high dimensional optimization problems. 4 Relationship to Nesterov’s Accelerated Gradient One could make predictions every iteration, which would bring our method closer to some existing accelerated gradient schemes. If one made predictions every iteration using a linear fit function our algorithm could be written, 66 A. Nesky and Q. F. Stout ⎨⎧θt [ ] if t < pzt =⎩A t 1 lt otherwise θt+1 =zt − ∇L(zt) wher[e At minimizes the L2-]norms of the columns of JAt−Θt. Here, J ∈ (p/s)×2R has 1i−1 2i−1 · · · (p/s)i−1  for its ith column vector, and Θ ∈ (p/s)×nt R has θ tht−p+is for its i row vector. With p = 2 and s = 1, this begins to look quite a bit like NAG algorithm which makes the update, zt =(1− γt−1)θt + γt−1θt−1 with z0 = θ0 θt+1 =zt − ∇L(zt) for specifically chosen series {γ }∞t t=0. With lt = 2 − γt−1 these methods are identical. For continuously differentiable, smooth, convex loss functions NAG can achieve a global convergence rate of O(1/t2) [2,17]. A natural extension of NAG incorporates a history of three points such that the update is ( √ )/ λt ={1 + 1 + 4λ 2 t−r 2 λt−1 θ + (λt−1)θ − (λt−1−1)λ t λ t−r+1 λ θt−r if t > r (3)z t t tt = θt otherwise θt+1 =zt − ∇L(zt) where λ0, · · · , λr−1 = 0 and r ∈ >0Z . Theorem 1. Let L be a convex, continuously differentiable and β-smooth func- tion that admits a minimizer θ∗ ∈ nR . Given an arbitrary initialization θ0 ∈ nR , for T > r and  = 1/β, update scheme (3) satisfies, ∑T (t+ 1)/r 2 (L(θ ∗t+1)− L(θ )) ≤ 2β‖zr − θ∗‖22. t=T−r When r = 1 this reduces to NAG . If in addition we assume strong convexity of our objective function L the convergence rate becomes clearer. Corollary 1. Let L be strongly convex with parameter m > 0, continuously differentiable and β-smooth function that admits a minimizer θ∗ ∈ nR . Given an arbitrary initialization θ0 ∈ nR , for T > r and  = 1/β, update scheme (3) satisfies, ∑T 2 ∗ 2 β ‖θ − θ ‖(t+ 1) 0/r 2 (L(θ )− L(θ∗)) ≤ 2t+1 . mr t=T−r Training Neural Networks Using Predictor-Corrector Gradient Descent 67 The order of r in the denominator on each side of the above inequality is the same. Hence, for m = β, mint∈{T−r,··· ,T}{L(θt+1) − L(θ∗)} converges at the same rate as NAG. The proof of Theorem1 and Corollary 1 can be found in AppendixA [16]. In this well-behaved, theoretical environment, updating based on a linear combination of older values maintains the convergence rate of NAG. However, update method (3) is not practical for deep learning because it requires r× the memory to save a history of network parameter values. Instead, making param- eter predictions every pth iteration, as in update method (1), makes the addi- tional memory requirement significantly more practical. In the setting of neural network parameters, update method (1) has the capacity to outperform NAG. Considering an evenly distributed history of values extending further in the past allows one to de-noise trends. By incorporating a longer history, method (1) can afford to make predictions further into the future while minimizing additional memory requirements. In comparison to NAG, employing update scheme (1) requires more memory for the fit function variables At, but performs less work as snapshot increment s and prediction increment p increase since fit function updates and parameter predictions happen less often. One must strike a balance though: for large p and large p/s one should be able to predict network parameters with more confidence provided the chosen fit function is well suited for the trend, but large p will exhibit delayed performance. Method (1) introduces a number of new hyperparameters that can be tuned for a particular task. 5 Experimental Results The goal of our approach is to decrease the number of training epochs needed for a network to reach a particular testing accuracy. To test this, we ran experiments on the SVHN [18], and CIFAR10 [11] datasets using Krizhevsky’s cuda-convnet with 4 hidden layers [12]. This net does not produce state-of-art accuracies for these datasets, but rather highlights the improvement seen by PCGD when com- pared to SGD. We implement our work in Caffe, which provides this architecture in their CIFAR10 “quick” example. We trained using batch size 100. Unless oth- erwise specified, hyperparameters and initialization distributions provided by Caffe’s “quick” architecture are left unchanged. All experiments are run on the Bridges’ NVIDIA P100 GPUs through the Pittsburgh Supercomputing Center. Training is done with batched gradient descent using the cross-entropy loss func- tion on the softmax of the output layer. In this paper we will only use linear fit functions to make parameter pre- dictions. That is, the fit function class is f(a, x) = a1 + a2x and the number of fit function variables to solve for each network parameter is c = 2. In this case, a network with n parameters requires storing an additional 2n values. If m is the maximum number of iterations we will train, p is the prediction incre- ment and s is the snapshot increment, define dg(d,u)(b, t) = b1 + b2 (t/p) where and b is chosen such that g(d,u)(b, 0) = p/s + u1 and g(d,u)(b,m) = p/s + u2 68 A. Nesky and Q. F. Stout for some u1, u2 ∈ [0, 2p/s], u1 < u2. We chose our prediction length such that lt = g(d,u)(b, t). This means that at iteration p, PCGD tries to predict what the network weights will be at iteration p + su1 and sets the weights to those predicted values. Similarly, at iteration m, PCGD would try to predict what the network weights would be at iteration m + su2, but we do not make the last, or last few, predictions because immediately after predicting there is often a slight drop in accuracy that needs to be corrected by some gradient descent steps. This slight drop after predicting could be minimized by less aggressive predictions or better fit function choices, but we chose to simply leave out the last few predictions. It is a good idea to have u1 small because parameter trends can alter and we do not want to be over-influenced by start-up trends.1 We will compare PCGD with NAG and SGD. We also consider a hybrid method combining NAG and PCGD, abbreviated as NAG-PCGD. To combine the two methods we nest NAG updates inside PCGD updates; the update scheme for NAG-PCGD is written out explicitly in AppendixB [16]. When training with PCGD and NAG-PCGD, we use prediction increment p = 150, snapshot increment s = 15 for all of our experiments. When plotting accuracy results, we will plot the maximum testing accuracy seen so far by that training iteration against iterations. While training, testing accuracy is usually noisy, which can obscure differences in performance when comparing different methods. Plotting the maximum testing accuracy seen so far displays these differences more clearly. There was no noticeable difference in the amount of noise seen in the testing accuracy for the various methods in our experiments. 5.1 SVHN We experimented on the SVHN dataset with Krizhevsky’s cuda-convnet [12]. The base learning rate was 0.001 and dropped by a factor of 10 after 4,000 iterations. Testing took place every 50 training iterations. When training with PCGD and NAG-PCGD, we use prediction length lt = g(6,[5,10])(b, t). Figure 1 (Left) plots the maximum accuracy seen so far against iterations using standard SGD, NAG, PCGD and NAG-PCGD. Figure 1 (Right) plots the slopes of the curves in Fig. 1 (Left) versus iteration. We show the iterations of steepest accuracy increase to highlight the difference in convergence rates of the various methods. NAG and NAG-PCGD initially increase at nearly the same rate which is ≈4× faster than PCGD and SGD. Around iteration 450 PCGD leaves behind SGD, begins to catch up to NAG and eventually supersedes it. NAG-PCGD tends to hug the top of all the other curves exhibiting the benefits of both sub-methods. Confined to 2000 iterations, NAG-PCGD gives the best results. At iterations 4000 when the learning rate decreases by a factor of 10, there is another jump in accuracy where we can see the difference in convergence rates again on a smaller scale. After 9000 iterations, the network trained using traditional SGD achieves a final accuracy of 91.96%, NAG has a final accuracy of 92.38%, PCGD has a final accuracy of 92.42%, and NAG-PCGD has a final accuracy of 92.34%. SGD hit a maximum testing accuracy of 92.06% at iteration 8600, NAG took Training Neural Networks Using Predictor-Corrector Gradient Descent 69 4700 iterations to reach this accuracy level, PCGD also took 4700 iterations and NAG-PCGD took 5100 iterations. That is, PCGD reached SGD’s testing maximum in just over half the number of training iterations that SGD took. 95 0.5 90 0.4 SGDNAG 85 SGD PCGD NAG 0.3 NAG-PCGD80 PCGD NAG-PCGD 0.275 70 0.1 65 0 1000 2000 3000 4000 0 200 400 600 800 1000 Iterations Iterations Fig. 1. (Left) Maximum accuracy results on the SVHN data set. Testing takes place every 50 training iterations (Right) slope of left figure versus iterations. 5.2 CIFAR10 We also trained Krizhevsky’s cuda-convnet on the CIFAR10 for 195,000 iter- ations. The base learning rate was 0.001. We dropped the learning rate by a factor of 10 after 60,000 iterations and again after 125,000 iterations. Testing took place every 250 training iterations. We used lt = g(4,[5,10])(b, t) for our prediction length at prediction intervals. Figure 2 (Left) shows maximum accuracy results through training using SGD, NAG, PCGD and NAG-PCGD. Again, we show only the iterations of steepest accuracy increase. Here, the testing increment is larger than our prediction incre- ment which may hide any initial convergence advantage of NAG over PCGD. Given more time to excel, PCGD shows performance advantages over NAG; NAG does not even consistently outperform SGD per iteration. At any one time, NAG is at most 3.18% more accurate than SGD, PCGD is at most 3.91% more accurate than SGD, and NAG-PCGD is at most 6.49% more accurate than SGD. Figure 2 (Right) shows, for a given accuracy, the percent of SGD iterations each method took to reach that accuracy. That is, if it took SGD x iterations to reach a particular accuracy for the first time, and PCGD took y iterations to reach that accuracy for the first time, then the value plotted for PCGD at that accuracy is 100× y/x. This figure shows PCGD generally reaching particu- lar accuracies before SGD and NAG-PCGD generally reaching accuracies before PCGD. SGD took 114,000 iterations to become 81.7% accurate. Training with NAG yielded 81.7% accuracy in 73% of the iterations required by SGD to reach this accuracy, training with PCGD yielded 81.7% accuracy in 56% of the itera- tions required by SGD and training with NAG-PCGD yielded 81.7% accuracy in 50% of the iterations required by SGD. That is, PCGD took only 77% of the iterations required by NAG to reach 81.7% accuracy. Accuracy Accuracy Slope 70 A. Nesky and Q. F. Stout For these values of s and p, using PCGD does not noticeably increase the average iteration runtime when compared with SGD. For both methods, the aver- age forward-backward pass took ≈46ms when using batch size 100 on Bridges’ NVIDIA P100 GPU; time was measured using caffe time benchmarks. Fig. 2. Results on the CIFAR10 data set. (Left) Maximum accuracy versus iterations. Testing takes place every 250 training iterations. (Right) Percent of SGD iterations each method took to reach a particular accuracy. 6 Conclusion We have developed a general adaptation to gradient descent and considered the impact in the case of training neural networks. Predictor-Corrector Gradient Descent reduces the number of iterations required to learn by incorporating traditional predictor-corrector inspired ideas into classic gradient descent. We have shown that PCGD can significantly decreases the number of train- ing epochs needed for a network to reach a particular testing accuracy when compared to stochastic gradient descent. On both datasets considered, PCGD reduced the number of required iterations to reach SGD maximum accuracy by nearly one half. When two identical networks are allowed to train for the same number of iterations, the networks trained using PCGD regularly outperforms the network trained using SGD. We have also shown that PCGD can outperform Nesterov’s Accelerated Gradient for more complex learning problems requiring more training. By substantially reducing the number of iterations required to reach a particular accuracy, PCGD can make training large networks more fea- sible in cases where one can afford to increase the training storage by a small constant multiple. We have also considered the theoretical case of a strongly convex, contin- uously differentiable and smooth objective function and showed that updating parameters as a linear combination of historical values preserves the convergence rate of NAG. Although our experimental environment is far from this hypotheti- cal one, this theory holds true when using PCGD to train neural networks. After an initial delay, we found PCGD can outperform NAG. Training Neural Networks Using Predictor-Corrector Gradient Descent 71 In this work, we only used linear fit functions and a single prediction length for every network parameter. These choices worked well, but there is room for additional exploration. One may see further improvement by using a dynamic value for the prediction interval p. Acknowledgments. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1256260. This work used the Extreme Science and Engineering Discovery Environment, which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center. References 1. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: NIPS (2016) 2. Beck, A., et al.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009) 3. Cassioli, A., et al.: An incremental least squares algorithm for large scale linear classification. Eur. J. Oper. Res. 224(3), 560–565 (2013) 4. Daniel, C., et al.: Learning step size controllers for robust neural network training. In: AAAI (2016) 5. Dozat, T.: Incorporating Nesterov momentum into Adam. In: ICLR Workshop (2016) 6. Duchi, J., et al.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011) 7. Heeger, D.J.: Theory of cortical function. Proc. Natl. Acad. Sci. USA 114(8), 1773–1782 (2016) 8. Ho, Q., et al.: More effective distributed ML via a stale synchronous parallel param- eter server. In: NIPS, pp. 1223–1231 (2013) 9. Hratchian, H., et al.: Steepest descent reaction path integration using a first-order predictor-corrector method. J. Chem. Phys. 133(22), 224101 (2010) 10. Kingma, D., et al.: Adam: a method for stochastic optimization. In: ICLR (2015) 11. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, Computer Science, University of Toronto (2009) 12. Krizhevsky, A.: cuda-convnet. Technical report, Computer Science, University of Toronto (2012) 13. Krizhevsky, A., et al.: ImageNet classification with deep convolutional neural net- works. In: NIPS, pp. 1106–1114 (2012) 14. Luca, M.D., et al.: Optimal perceived timing: integrating sensory information with dynamically updated expectations. Sci. Rep. 6, 28563 (2016) 15. Neelakantan, A., et al.: Adding gradient noise improves learning for very deep networks. arXiv:1511.06807 (2015) 16. Nesky, A., et al.: Training neural networks using predictor-corrector gradi- ent descent: Appendix (2018). http://www-personal.umich.edu/∼anesky/PCGD appendix.pdf 17. Nesterov, Y.: A method of solving a convex programming problem with conver- gence rate o(1/sqr(k)). Soviet Mathematics Doklady 27, 372–376 (1983) 72 A. Nesky and Q. F. Stout 18. Netzer, Y., et al.: Reading digits in natural images with unsupervised feature learn- ing. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 19. Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964) 20. Scieur, D., et al.: Regularized nonlinear acceleration. In: NIPS (2016) 21. Simonetto, A., et al.: Prediction-correction methods for time-varying convex opti- mization. In: IEEE Asilomar Conference on Signals, Systems and Computers (2015) 22. Süli, E., et al.: An Introduction to Numerical Analysis, pp. 325–329 (2003) 23. Tieleman, T., et al.: Lecture 6a - rmsprop. COURSERA: Neural Networks for Machine Learning (2012) 24. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv:1212.5701 (2012) 25. Zhang, Y., et al.: Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition. arXiv:1510.08985 (2015) 26. Zhang, Y., et al.: Speech recognition with prediction-adaptation-correction recur- rent neural networks. In: IEEE ICASSP (2015) Investigating the Role of Astrocyte Units in a Feedforward Neural Network Peter Gergel’(B)and Igor Farkaŝ Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava Mlynská dolina, 84248 Bratislava, Slovak Republic peter.gergel@gmail.com, farkas@fmph.uniba.sk http://cogsci.fmph.uniba.sk Abstract. Current research in neuroscience has begun to shift perspec- tive from neurons as sole information processors to including the astro- cytes as equal and cooperating units in this function. Recent evidence sheds new light on astrocytes and presents them as important regulators of neuronal activity and synaptic plasticity. In this paper, we present a multi-layer perceptron (MLP) with artificial astrocyte units which lis- ten to and regulate hidden neurons based on their activity. We test the behavior and performance of this bio-inspired model on two classifica- tion tasks, N-parity problem and the two-spirals problem and show that proposed models outperform the standard MLP. Interestingly, we have also discovered multiple regimes of astrocyte activity depending on the complexity of the problem. Keywords: Glial cells · Astrocytes · MLP · Classification Computational model 1 Introduction Glial cells, predominantly astrocytes, have gained a lot of attention in neu- roscience during the last few decades, as compelling evidence has shown that these cells are no longer considered as passive and supportive but are actively involved in neuronal regulation and synaptic plasticity [1,12]. The classical view on astrocytes supports the idea that they are inevitable in the development of the central nervous system, providing metabolic and physical support to other neural cells, or maintaining homeostasis. It was assumed that astrocytes were not able to generate actions potentials similar to neurons, or be involved in brain functions such as information transfer and processing, learning, and plasticity, i.e. functions attributed solely to neurons. However, recent research has challenged this view as it was discovered that astrocytes were characterized as having resting membrane potential of∼−80mV, pairing ∼1.4 astrocytes for every neuron in the human cortex [3] and encapsu- lating ∼105 synapses [5]. This led to a novel concept of an intimate connection between neurons and astrocytes named the tripartite synapse. Moreover, astro- cytes release gliotransmitters to local neurons and propagate Ca2+ waves using ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 73–83, 2018. https://doi.org/10.1007/978-3-030-01424-7_8 74 P. Gergel’ and I. Farkaŝ a cellular network called glial syncytium, although the signalization occurs on a much slower time scale ranging from seconds to minutes, as opposed to neurons whose time scale is milliseconds. This implies the existence of a bidirectional communication between astrocytes and neurons whose importance is still not well understood. Still, it is assumed that the brain function and possibly higher cognition emerge from the coordinated activity of astrocytes and neurons in neuron–glia networks [11]. A better understanding of astrocyte–neuron coupling may lead to providing building blocks for studying the regulatory capability of astrocytic networks on a larger scale. Computational models of neural networks extended with artificial glia may not only be used as an interesting novel concept, but mainly to provide space for hypotheses for the potential roles of glial cells in biological neuronal circuits and networks. In this paper we propose a model of a MLP extended with artificial astrocytes whose role is to regulate neuronal activity. For evaluating the model performance we chose the classification task using two datasets: N-parity and two spirals. The paper is organized as follows. Section 2 includes the related work. In Sect. 3, we describe various versions of the investigated model. In Sect. 4, we provide the experimental results. Section 5 concludes the paper. 2 Related Work In computational neuroscience two modeling paradigms have so far been con- sidered: (1) biophysical with the focus on low–level physical and chemical prop- erties of a biological system or (2) connectionist which does not try to model every single aspect of a system, but instead focuses on abstractions. Despite the plethora of biophysical models of astrocytes, connectionist modeling is still in a pre-mature state. The concept of artificial astrocytes in connectionist systems was first intro- duced in [6] where authors augmented the hidden layer of an MLP with an astrocytic network whose function was to generate chaotic noise according to the given tent map formula as a means of avoiding local minima during gradient optimization. On the two-spirals problem the model achieved better performance than the regular MLP. Later, the same authors presented multiple works includ- ing impulse astrocytes with active listening and regulation of neurons based on their activity [7], Hopfield network augmented with astrocytes [9], or neurogen- esis driven by astrocytes [8]. Similar approach was taken in [13] and [2] where instead of modeling the neuronal regulation, the authors focused on modeling synaptic plasticity driven solely by astrocytes. Using an MLP with combination of evolutionary algo- rithms they showed that the model with artificial astrocytes was superior to the model without them. Using computer simulations they demonstrated that the model was able to learn various problems despite the fact that no gradient- based method was used for training neural networks. Finally, in [10], the authors presented a model, SONG-Net, that combines an MLP, a self-organizing map (SOM) and neuron–glial interactions. By evaluating Investigating the role of astrocyte units in a feedforward neural network 75 the performance on four tasks, they showed that the proposed model achieved faster convergence up to twelve times with a lower error rate. However, the authors did not present glia as individual functional units, but instead they were used only as an inspiration for the concept of neuronal regulation. 3 Proposed Models Here we present multiple models, all based on an MLP combined with artificial astrocytes. We start with a simplest model to allow faster in–depth exploration, and we gradually move toward adding more complex, yet biologically plausible mechanisms. 3.1 A-MLP Since the human cortex contains on average 1.4 astrocytes for each neuron, we simplify this notion and present a model with the ratio of astrocyte to neu- ron being 1:1. Inspired by [7] we combine the hidden layer of an MLP with impulse astrocytes that listen to and modulate neuronal activity of hidden neu- rons (Fig. 1). Fig. 1. Basic MLP architecture with astrocyte units (A-MLP). Each hidden neuron is paired with an astrocyte that listens to and regulates its regime based on activity. Since we consider binary classification problems, only one output unit is used. The output of i-th hidden neuron is given by the following formula ∑M hi(t+ 1) = f( wijxj(t) + αψi(t)) (1) j=0 76 P. Gergel’ and I. Farkaŝ where the activation function is 1 f(net) = (2) 1 + exp (−net) and the astrocyte activity is modified as { 1, if θ < h (t− 1) ψi(t) = i (3) γψi(t− 1), otherwise Each astrocyte contributes, with a weight α, to the activity of the hidden neuron (Eq. 1). When the neuron output exceeds the given threshold θ, the astrocyte activation is set to 1 and then starts to decay by a factor γ, where 0 < γ < 1. Note that the model consists of three free hyperparameters whose optimal values have to be found experimentally. Since each problem requires a different set of optimal parameters, finding them requires time-intensive computations. We try to solve these issues by replacing constant parameters with modifiable versions. 3.2 A-MLP(α) Traditionally in supervised model learning, the neuron weights are updated using a gradient descent method, better known as error backpropagation algorithm. Since the astrocytic weight in Eq. 1 can be treated as any other weight, we can apply the same optimization method for its update (derivation of the formula is provided in appendix). Next, instead of using a single mutual weight for all astrocytes, we equip each astrocyte unit with an individual weight. The activation rule for the hidden unit then becomes ∑M hi(t+ 1) = f( wijxj(t) + αiψi(t)) (4) j=0 3.3 A-MLP(θ) Since we cannot directly update the parameter θ (Eq. 3) using a gradient-based method, we propose an unsupervised learning rule. It is relatively common that during training some neurons may get trapped in one of the two extremes, by becoming either dead or permanently active. The weight update of such neurons becomes problematic, because the gradient is close to zero and no errors would propagate through a dead neuron, therefore no update would occur. On the other hand, weights might grow into large values affecting other neurons in the network, making the model unstable. The same issue may happen in artificial astrocytes when the threshold θ is set too low, making the astrocytes fire all the time, or too high, preventing the neighboring neurons from exceeding the required activation. Moreover, since Investigating the role of astrocyte units in a feedforward neural network 77 each neuron in the neural network develops its own role in the classification task, single shared θ for all neurons may become more of a burden than benefit. To solve these problems, we propose an individual θi for each astrocyte and two variations of an update rule. In order to stabilize the astrocytic regime, we can set the threshold θ either directly to the mean value 〈.〉t (Eq. 5) of an astrocyte unit or only shift the threshold slightly closer to the mean value (Eq. 6) using the learning speed ηθ. This forces the astrocyte to move only within its mean values avoiding the critical values of 0 and 1. With a higher θ it becomes harder for the neuron to overpass, thus the activity decays and vice versa. Hence, the update rules are θi(t+ 1) = 〈ψi(t)〉t (5) and θi(t+ 1) = θi(t) + ηθ(〈ψi(t)〉t − θi(t)) (6) introducing another free parameter, namely the length of an averaging window. 3.4 A-MLP(γ) Hyperparameter γ can be updated based on the same principle as explained in the previous section. This time we update γ to achieve inverse correlation with the mean value of the astrocytic activity γi(t+ 1) = 1− 〈ψi(t)〉t (7) γi(t+ 1) = γi(t) + ηγ(1− 〈ψi(t)〉t − γi(t)) (8) Higher values of γ are achieved during a lower activity, thus a hypo-excited astrocyte holds its activation value for a longer period. On the other hand, lower γ triggers faster output decay forcing the astrocyte to avoid excessive simulation. 3.5 A-MLP(γ, θ), A-MLP(α, γ, θ) Finally, the last two models are simple combinations of previous ideas. A-MLP (γ, θ) combines models with dynamic θs and γs and A-MLP(α, γ, θ) includes dynamic αs as well. 4 Experiments We assess the performance of all six proposed models and standard MLP (with- out astrocyte units) as a baseline, on two difficult classification tasks: (1) N- parity problem and (2) two spirals problem. All results are averaged over 100 simulations with different initial setups. The learning rate in backpropagation algorithm is set to η = 0.1. 78 P. Gergel’ and I. Farkaŝ 4.1 N-parity Problem The task is to determine whether a binary input vector has even or odd number of ones. More formall∑y, an input vector has the form [x1, . . . , xN ], xi = {0, 1} and the target y = (1 + Ni=1 xi) mod 2. Since the problem is notoriously difficult to generalize to unseen patterns for machine learning algorithms, we train the models on full dataset (no train/test split) whose total size is 2N . Starting with MLP, we chose the hidden layer with N neurons (a higher amount did not yield better results) and output layer of only single neuron (0 = odd input vector, 1 = even input vector). Proposed models with astrocyte units had the following values for fixed hyperparameters: α = −0.5, γ = 0.5, θ = 0.5 (previously found using the grid search). In Table 1 we present performance of all models and although we see models with astrocyte units lead on average to better performance, the differences are not statistically significant (p > 0.1). Next, in order to get insight into learned parameters, we displayed the distri- butions of astrocyte activities (shown in Fig. 2). It can be seen that astrocytes develop various regimes depending on the problem complexity. With lower N it is possible to clearly detect N astrocyte regimes, but with higher N the profiles gradually lose their multimodality, albeit remaining non uniformly distributed. Table 1. Mean squared error (MSE)+ standard deviation of 100 instances on three parity problems trained for 10000 epochs. Models with astrocyte units yield lower error rate although no statistical significance was found. In each task, the best model is denoted with ∗. Model 4-parity 6-parity 8-parity MLP 0.081± 0.060 0.065± 0.035 0.046± 0.070 A-MLP 0.083± 0.086 0.059± 0.034∗ 0.039± 0.023 A-MLP(α) 0.080± 0.065 0.072± 0.054 0.073± 0.069 A-MLP(θ) 0.083± 0.075 0.065± 0.036 0.037± 0.021∗ A-MLP(γ) 0.087± 0.065 0.062± 0.034 0.042± 0.026 A-MLP(γ, θ) 0.074± 0.051∗ 0.063± 0.055 0.042± 0.027 A-MLP(α, γ, θ) 0.092± 0.072 0.078± 0.056 0.056± 0.028 4.2 Two-Spirals Problem The two spirals consist of two interleaved sets of points in 2D space (Fig. 3). The problem is, given point (x, y), to decide whether it belongs to the first or the second spiral. This is considered a complex nonlinear problem and hard for a standard MLP due to a high number of local minima which are generally rather problematic for gradient-based models. Investigating the role of astrocyte units in a feedforward neural network 79 Fig. 2. Distributions of astrocyte activity (across 100 simulations) after being fully trained on a parity problem. With lower N it is possible to detect N peaks assuming that each astrocyte handles a single bit from an input vector. On the other hand, with higher N , the peaks become less visible. Fig. 3. Two-spirals problem where the task is to separate the interleaved classes. For the simulations we firstly found optimal hyperparameter values for MLP and then used them in models with astrocyte units. We used N = 30 hidden neurons (more units did not produce better results), 5000 training epochs and train/test dataset split in ratio 80:20. For models with astrocytes we found opti- mal hyperparameters using grid search (presented in Fig. 4) and hence used the values: α = −0.1, γ = 0.5, θ = 0.1. Results averaged over 100 simulations are in Table 2 with A-MLP(γ, θ) being the best model yielding 50% lower error rate compared to the standard MLP. Similarly we looked at astrocyte activities of the fully trained network and observed normal distribution shown in Fig. 6. 80 P. Gergel’ and I. Farkaŝ Fig. 4. Grid search for optimal values of hyperparameters. Each heatmap uses a fixed single parameter (shown in the title) and displays all combinations for the other two parameters. Each cell in every heatmap is averaged over 5 simulations with lighter color denoting better performance. Table 2.Mean-squared error+ standard deviation over 100 instances on the two-spirals task trained for 5000 epochs. The best model, A-MLP(γ, θ), yields 50% lower error rate compared to the MLP with statistical significance (p < 0.001) (Fig. 5). Model Train Test MLP 0.075± 0.067 0.094± 0.066 A-MLP 0.073± 0.067 0.088± 0.068 A-MLP(α) 0.050± 0.049 0.078± 0.050 A-MLP(θ) 0.034± 0.045 0.049± 0.046 A-MLP(γ) 0.068± 0.065 0.085± 0.063 A-MLP(γ, θ) 0.030± 0.035∗ 0.051± 0.041∗ A-MLP(α, γ, θ) 0.060± 0.051 0.095± 0.051 Fig. 5. Performance of the best model, A-MLP(γ, θ), compared to MLP on both train- ing and testing sets. Investigating the role of astrocyte units in a feedforward neural network 81 Fig. 6. Normal distribution of astrocyte activity (N = 30) at the end of training, accumulated over 100 simulations. 5 Conclusion Inspired by [7] and the recent findings from biological research of astrocyte phys- iology and their interactions with surrounding neurons, we proposed artificial astrocyte units to be integrated in a MLP. It is known that astrocytes in CNS form networks in which they communicate using Ca2+ waves whose purpose according to current knowledge is to regulate neuronal activity and synaptic plasticity. In this paper we focused exclusively on neuronal regulation using separate astrocytes each maintaining a single neuron. Astrocytes contribute in neuronal summation formula (Eq. 4) weighted by the factor αi which was either constant or dynamic. However, the dynamic change of a weight along the negative gradient of the loss function does not always provide better results (as in N-parity problem). We also proposed two methods for dynamic update of both the astrocyte threshold and the decay (Eqs. 5–8) with the second formula performing better than the first one which we used in all our simulations. We chose two classification problems, N-parity and two spirals, which are known to be rather problematic for machine learning algorithms, so we used them for analysis of the performance and behavior of our models. For both problems we first selected an MLP with optimal parameters (the number of hidden neurons, the learning rate, initial weight distribution) and then used them in models with astrocyte units. The results obtained for N-parity did not outperform MLP, assuming that all models already converged to the global minimum. However, for the two spirals all our models performed better in terms of the lower errors with statistical significance (p < 0.001). Both problems developed unique astrocyte regimes in terms of output distributions whose shape depended on the number of astrocytes in case of N-parity problem and was gaussian in the two spirals task. Understanding of this phenomenon requires further investigations. For our future research we would like to focus on a different set of problems trying to explain why astrocyte regimes develop and how important they are for the given problem. We only focused on feedforward models, but it makes sense to apply the very same idea to recurrent neural networks. Another issue worth investigation would be to adjust the dynamics of astrocytes. In our models, astrocyte parameters were updated at the same speed as weights, but it is known 82 P. Gergel’ and I. Farkaŝ that the dynamics of the biological astrocytes is much slower [4]. Last but not least, since we only focused on modulations of single neurons, we would like to connect astrocytes within the syncytium and incorporate their role in synaptic plasticity. Acknowledgments. This work was supported by grant UK/256/2018 from Comenius University in Bratislava (P.G.) and Slovak Grant Agency for Science, project VEGA 1/0796/18 (I.F.) Appendix: Derivation of the update formula Here we derive formula for stochastic (online) update of astrocyte weights αi in models A-MLP(α) and A-MLP(α, θ, γ). The goal is to minimize the loss function E(w) = 1/2(d − y(x))2, by moving the astrocytic weights along the negative gradient, i.e. Δαi = −∂E(w)/∂αi. Since E is differentiable with respect to αi, we can write using the chain rule, −∂E ∂y ∂nety ∂hi ∂nethiΔαi = (9) ∂y ∂nety ∂hi ∂nethi ∂αi δ ︷ ︸y︸ ︷ Δαi = − (d− y(x))y(x)(1− y(x))wyh hi(1− hi)ψi (10)i ︷ ︸δ︸i ︷ Δαi = − δywyh hi i(1− hi)ψi (11) which yields the final formula: Δαi = −δiψi (12) References 1. Allen, N.J., Barres, B.A.: Signaling between glia and neurons: focus on synaptic plasticity. Curr. Opin. Neurobiol. 15(5), 542–548 (2005) 2. Alvarellos-González, A., Pazos, A., Porto-Pazos, A.B.: Computational models of neuron-astrocyte interactions lead to improved efficacy in the performance of neural networks. Comput. Math. Methods Med. (2012). https://doi.org/10.1155/2012/ 476324 3. Bass, N.H., Hess, H.H., Pope, A., Thalheimer, C.: Quantitative cytoarchitectonic distribution of neurons, glia, and DNA in rat cerebral cortex. J. Comp. Neurol. 143(4), 481–490 (1971) 4. Cornell-Bell, A.H., Finkbeiner, S.M., Cooper, M.S., Smith, S.J.: Glutamate induces calcium waves in cultured astrocytes: long-range glial signaling. Science 247(4941), 470–473 (1990) 5. Halassa, M.M., Fellin, T., Takano, H., Dong, J.H., Haydon, P.G.: Synaptic islands defined by the territory of a single astrocyte. J. Neurosci. 27(24), 6473–6477 (2007) Investigating the role of astrocyte units in a feedforward neural network 83 6. Ikuta, C., Uwate, Y., Nishio, Y.: Multi-layer perceptron with chaos glial network. In: IEEE Workshop on Nonlinear Circuit, Networks, pp. 11–13 (2009) 7. Ikuta, C., Uwate, Y., Nishio, Y.: Multi-layer perceptron with impulse glial network. In: IEEE Workshop on Nonlinear Circuit, Networks, pp. 9–11 (2010) 8. Ikuta, C., Uwate, Y., Nishio, Y.: Investigation of multi-layer perceptron with pulse glial chain including neurogenesis. In: IEEE Workshop on Nonlinear Circuit, Net- works, pp. 70–72 (2014) 9. Ikuta, C., Uwate, Y., Nishio, Y., Yang, G.: Hopfield neural network with glial network. In: International Workshop on Nonlinear Circuits, pp. 369–372 (2012) 10. Marzouki, K.: Neuro-glial interaction: SONG-Net. In: Arik, S., Huang, T., Lai, W.K., Liu, Q. (eds.) ICONIP 2015. LNCS, vol. 9491, pp. 619–626. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26555-1 70 11. Nedergaard, M., Ransom, B., Goldman, S.A.: New roles for astrocytes: redefining the functional architecture of the brain. Trends Neurosci. 26(10), 523–530 (2003) 12. Parpura, V., Basarsky, T.A., Liu, F., Jeftinija, K., Jeftinija, S., Haydon, P.G.: Glutamate-mediated astrocyte-neuron signalling. Nature 369(6483), 744 (1994) 13. Porto-Pazos, A.B., et al.: Artificial astrocytes improve neural network performance. PLoS ONE 6(4), e19109 (2011) Interactive Area Topics Extraction with Policy Gradient Jingfei Han1, Wenge Rong1(B), Fang Zhang2, Yutao Zhang2, Jie Tang2, and Zhang Xiong1 1 School of Computer Science and Engineering, Beihang University, Beijing, China {jfhan,w.rong,xiongz}@buaa.edu.cn 2 Department of Computer Science and Technology, Tsinghua University, Beijing, China {fang-zha15,yt-zhang13}@mails.tsinghua.edu.cn, jietang@tsinghua.edu.cn Abstract. Extracting representative topics and improving the extrac- tion performance is rather challenging. In this work, we formulate a novel problem, called Interactive Area Topics Extraction, and propose a learn- ing interactive topics extraction (LITE) model to regard this problem as a sequential decision making process and construct an end-to-end framework to use interaction with users. In particular, we use recur- rent neural network (RNN) decoder to address the problem and policy gradient method to tune the model parameters considering user feed- back. Experimental result has shown the effectiveness of the proposed framework. Keywords: Interactive area topics extraction · RNN decoder Policy gradient 1 Introduction Extracting representative topics of an area plays an increasingly important role in trend analysis or historical analysis. It can help researchers learn overview of some disciplines or areas, grasp the development trend, and discover the potential research points [20]. In addition, the newcomers to an area can be guided to find hot topics by the topics extraction of the area. Much attention has been paid to extracting hypernym-hyponym relationship from big corpora or knowledge base [10,18]. In academic vocabulary, extract- ing hypernym-hyponym relationship is equivalent to the topic of a given area [1]. However, there are too many topics in an area according to the automatic extraction from text. For example, the hypernym “AI” (Artificial Intelligence) includes many coarse-grained hyponym such as “Machine Learning” and fine- grained hyponym such as “Support Vector Machine”. It is necessary to extract the representative topics of a given area automatically because people cannot gain useful information from too many hypernym-hyponym relationship. Earlier works [2,12] mainly focused on topics extraction from documents but not areas. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 84–93, 2018. https://doi.org/10.1007/978-3-030-01424-7_9 Interactive Area Topics Extraction with Policy Gradient 85 Recently, Zhang et al. [20] tried extract topics for areas based on knowledge base [14], while the overall performance is greatly affected by the knowledge base. Hence, we try to use additional information such as user feedback to improve the extraction performance. Interactive area topics extraction has rarely been explored. The major chal- lenges lie in formally formulating the problem, extracting the representative topics with user feedback, and designing experiments and evaluation to prove the method’s effectiveness. To address the aforementioned challenges, we design a Learning Interactive Topics Extraction (LITE) to consider topics sequences extraction of a given area and user feedback and map it into an end-to-end framework. The major contributions of this paper include: (1) We formulate interactive area topics extraction as a sequential decision making problem, and model interaction with users. (2) We propose an LITE, which applies a recurrent decoder to model the topics generation of a given area and uses policy gradient based reinforcement learning method to introduce user feedback or interaction. (3) We design experiments on synthetic dataset to evaluate the proposed model. Experimental results prove the effectiveness of the proposed model. 2 Related Work Various approaches have been proposed to extract topics from document and knowledge base, including topic model and keyphrase extraction. As to topic model, Blei et al. [2] proposed Latent Dirichlet Allocation (LDA), whose main idea is that each document consists of many topics and the probability of each word appearing in the topic is different. The topic’s probability provides an explicit representation of a specific document. However, it is hard to identify what topics stand for because topics in LDA are multinomial distributions over words. We need to label the topics if we need this information. Although many researchers have conducted extensive research on automation of LDA [9,11], there is a clear gap between automatic labeling and manual labeling. Keyphrase extraction mainly includes two approaches: supervised learning and unsupervised learning. In supervised learning, Jiang et al. [7] sort candidate keyword set by features. They regard the problem as a ranking problem and use Ranking SVM to address this problem. In unsupervised learning, Hasan et al. divide unsupervised learning researches into four groups [6]: graph-based ranking, topic-based clustering, simultaneous learning, and language modeling. In addition, Zhang et al. [20] propose a FastKATE model feed knowledge bases into the model to extract topics of a given area. However, it is difficult to generate or gain a clean taxonomy. Hence, we try to introduce interaction with users to improve the performance of extraction. Many researchers also try to introduce user feedback to improve performance for their tasks. Yang et al. [19] predict a user’s intention based on user’s feedback for some questions by their proposed model in order to understand user intention interactively. Carlson et al. [3] design a knowledge base called NELL, which can make an iterative learning by interacting with a human for 10–15min each day. 86 J. Han et al. Deldjoo et al. [5] use interactive information to alter the recommendation results so that the recommendation can increase the user satisfaction. 3 Methodology 3.1 Problem Formulation Firstly we give formal definition of the basic terms in this research. Concept in the following section refers to a set of all knowledge entities, like a vocabulary list. It contains any knowledge from coarse-grained concept such as “Computer Science” to fine-grained concept such as “Backpropagation”. Area is a subset of concept, whose elements include hyponym concept. Topic is also a subset of concept, but all topic’s elements have hypernym concept. Let C denote concept space, X denote area space and Y denote topic space where X ⊂ C and Y ⊂ C, and an area can be regarded as a topic. Now the problem we are solving in this research can be formally defined as follows: Given an specific area x ∈ X and an integer K ∈ Z+, we can extract a topic set y = {y1, y2, . . . , yK}, which can represent the given area, where y ⊂ Y. Intuitively, we hope that the extraction result will be closer to the feedback of most users by tuning our model’s output. User feedback U will be provided and uik represents the ith user’s evaluation for the topics extraction of a given area xi. When considering user feedback, our target is to improve extraction performance with feedback. Thus, a mapping function f can be learnt, which is formally defined as follows: f : {x, y, u∗x|x ∈ X , y ⊂ Y, u∗x ∈ U} → {ŷ = {ŷi|ŷi ∈ Y, i = 1, 2, . . . ,K}} (1) where u∗x refers to all feedback of the area x. 3.2 LITE Model In this research a learning interactive topics extraction (LITE) model is proposed to obtain topics with user feedback. The model is divided into two steps, i.e., pre-training step and updating step, as shown in Fig. 1. Pre-training. Considering that the length of the topics sequence of a given area is K, a mapping function from x to {y1, y2, . . . , yK} need to be learnt. Assume we extracted part of results {y1, y2, . . . , yk−1}, as such we should consider the area x and the part of extraction when extracting the kth topic. Here we use recurrent neural network (RNN) decoder model, which use RNN to generate topics sequence of a given area, to address the problem and we define the input and output space (IO space) V = X ∪ Y. The left part of Fig. 1 is the pre-training step of area topics extraction and here we set K = 4 for just illustration. Given an area x, which is from the user Interactive Area Topics Extraction with Policy Gradient 87 Fig. 1. An overview of LITE model framework. x indicates a given area. Dash node indicates a virtual hidden state, which is usually initialized to zero vector. input in real application, the model can generate topic sequence one by one. Formally, the conditional distribution of y given x can be written as: ∏K p(y|x; θ) = p(yk|x, y1, y2, . . . , yk−1; θ) (2) k=1 where y thi is the i topic extracted from the IO space V. We define y0 = x to simplify the formula. The area and topic representation is one hot vector, and x, y ∈ |V|R , where |V| is the size of IO space. Let h ∈ dk R be a hidden state at the kth step extraction. We have hk = g1(hk−1, yk−1), where k = 1, 2, . . . ,K. g1 is an activation function such as tanh, ReLU, or a more complicated structure like GRU unit [4]. The condi- tional distribution of the kth topic is p(yk|yk−1, yk−2, . . . , y1, y0; θ) = g2(hk; θ) (3) where g2 must produce valid probabilities such as softmax. Thus, the extraction at step k is y∗k = argmax g2(hk; θ) (4) y0,y1,...,yk−1∈V We first train this model and find an optimal parameter θ by maximizing the conditional log-probability on a training set S. Then we infer the topics sequence for every user’s query, which is regarded as an area input. In other words, we divide the pre-trained model into two phases: training and inference. In the training phrase, we need a training set to find optimal solution from the large space. We use the training set to pre-train model for the following two 88 J. Han et al. reasons. First, it is difficult to converge because of the too many parameters. Second, we hope to improve performance using user feedback. Therefore our model should have ability of generating diverse results so as to adjust the model parameters using user feedback. Hence, we use K-Nearest Neighbors (KNN) to generate a fixed sort of topics sequence of every area and shuffle the extracted topics. In particular, we try to add noise into raw distance, which can be mea- sured by word2vec [13]. The training set can be generated as Algorithm 1. Kd tree can be adopted in Line 8 to reduce time complexity [16]. Algorithm 1. Generate training set using K-Nearest Neighbor. Input: Topic space X , the number of shuffle data for one area T , output size K Output: Training set S 1: Initialize training set S = φ 2: for each x ∈ X do 3: tuple ← ComputeDistance(x) 4: y ← tuple[0] 5: distance ← tuple[1] 6: for each t ∈ [1, T ] do 7: newDistance ← distance+ noise 8: sample ← SORTED(y, key = newDistance, descending = True)[1 : K] 9: Append sample into S 10: end for 11: end for 12: return S In the inference phrase, when we make inference using the pre-trained model, we hope to find the optimal topics sequence ŷ∗ = {ŷ∗ ∗ ∗1 , ŷ2 , . . . , ŷK} and the proba- bility of each extraction step ŷ∗k depends on the input area x and the previously extracted sequence {ŷ∗1 , ŷ∗2 , . . . , ŷ∗k−1}. However, finding the global optimal is intractable. Thus, we try to make inference one by one. It means that we choose the current topic with the highest probability using input and previous output, which is similar with text generation. Thus, kth topic for the given area x and previous output {ŷ∗1 , ŷ∗ ∗2 , . . . , ŷk−1} is ŷ∗k = argmax log p(y ∗ ∗ k|x, ŷ1 , ŷ2 , . . . , ŷ∗∈Y k−1) (5)yk Updating. The pre-trained model can extract a topics sequence by training in lots of samples. However, the performance depends on training set, which is from an existing knowledge base and text corpus. Through analysis of popular knowledge base such as Wiki Taxonomy and Microsoft Field of Study, which will be introduced in Sect. 4.1, there is a lot of noise in the existing knowledge base. What’s more, the representative hypernym-hyponym relationship itself is subjective and we cannot capture all users’ thoughts by a static knowledge base. Hence, we try to improve performance by interaction with users. The user feed- back matrix DU (x, y) denotes how well (x, y) pair considering user feedback, Interactive Area Topics Extraction with Policy Gradient 89 where y is the inference results for area x. Given that, we can define the reward as a function of user feedback, which can be written as: R(x, y) = g(DU (x, y)), where g is a function mapping feedback into reward. Our current goal is to max- imize the expected reward: J(θ|x) = Ey∼P (y|x;θ)[R(x, y)] We use policy gradient [17] to maximize J(θ|x), and the gradient of J(θ|x) can be written as follows: ∇θJ(θ|x) = Ey∼P (y|x;θ)[R(x, y)∇θ log p(y|x; θ)] (6) We then update parameters using Stochastic Gradient Descent (SGD) or other advanced optimization algorithms such as Adam, RMSProp [15]. In sum- mary, the algorithm can be described as Algorithm 2. Algorithm 2. An overview of LITE model using SGD. Input: Training set S, user feedback DU (x, y) Output: Model parameters θ 1: Compute p(y|x; θ) using pre-trained model 2: for each user ui from all users U do 3: Collect area query x of ui 4: Sample topics extraction y for the given x according to p(y|x; θ) 5: Calculate R(x, y) according to Dui(x, y) 6: Calculate gradient ∇θJ(θ|x) by Eq. (6) 7: θ ← θ + α∇θJ(θ|x) 8: end for 9: return θ 4 Experimental Study 4.1 Experiment Configuration Computer Science Taxonomy Knowledge Base. We use three knowledge bases to gain concepts, including Wiki Taxonomy tree1, ACM CCS classifica- tion tree2, and Microsoft Field of Study3. We extract all concepts from “Com- puter Science” and merge them into a new CS taxonomy knowledge base, called Computer Science Taxonomy Knowledge Base (CSTKB), where |C| = 13, 738. We define Y = C. Then we extract some concepts, which include more than threshold hyponyms (threshold = 100 in our experiments), and regard them as areas. Finally, we select 100 of them. I.e. |X | = 100. 1 https://dumps.wikimedia.org. 2 https://www.acm.org/publications/class-2012. 3 https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/. 90 J. Han et al. Synthetic Interaction Data. We need user feedback to update our model. There are two reasons why we use synthetic user feedback to evaluate perfor- mance of our model. Firstly, everyone may have different opinions for the same pair (x, y) because feedback is subjective. Secondly, We need label the specific data by crowdsourcing and annotators must have domain knowledge if we try to evaluate the performance of the proposed model. There is a huge cost to collect huge user feedback in short time. Hence, we try to generate Synthetic Interaction Data according a fixed rule. Every user has an ideal topics sequence y, called user ideal list. We assume all users ideal lists are samples of groundtruth. For example, we define area x has a best topics extraction y∗, which is groundtruth, y∗ denotes the ith topic in y∗i . For user u1, his or her top-3 ideal list may be , and for user u2. We can gain the groundtruth if we count enough users ideal lists statistically. Evaluation Metrics. When we measure the effect of extraction, we need to remove the individual difference of each user. Hence, we evaluate the result by comparing them with groundtruth. (1) P@k P refers to precision. We define the number of topics extraction for a given area is the same as groundtruth. Given an area x ∈ X , a model’s out- put sequence will be compared with groundtruth. P@k measures accuracy of the first k topics of the models output compared with groundtruth. {y1, y2, . . . , y@ = k} ∩ lP k (7) k where l is the set of groundtruth. Since the topics order is also important, we introduce MAP as follows. (2) MAP@k. Average Precision (AP) emphasizes ranking right topics higher. ∑k AP@k = r=1 (P@r × rel(r) (8) k where rel(i) is a binary function on the relevence of a topics sequence. When AP is used to measure the score from users, l denotes the user ideal list. MAP@k denotes the mean AP@k of every user feedback. Baseline Methods. We compare our method with the following baselines. (1) KNN. We use KNN to generate training set for pre-training the model. Hence, KNN is the best result before introducing user feedback. (2) Counting. Considering user feedback, an intuitive method is to adjust results by users’ click. However, we only get the score of entire topics sequence. Thus, we assume that we can get all forms of user feedback in the method. (3) -greedy. This method can explore the better topics in candidate set and exploit current experience. However, we only replace a topic that is selected randomly because we only get the score of entire topics sequence, not of each topic in output sequence. We set  = 0.1 in our experiments. Interactive Area Topics Extraction with Policy Gradient 91 4.2 Results and Discussion In Algorithm 1, we set T = 100,K = 10. Assuming groundtruth is generated only considering the first layer’s topics in the CSTKB, the user ideal lists can be generated by adding noise into groundtruth. User feedback can be collected from clicking every right topic, or giving a score considering the entire sequence. We choose the latter because users can give a comprehensive evaluation based on the order and accuracy. We regard AP@K as the score by users. Table 1. Quantitative results comparing several methods. Methods P@3 P@5 P@10 MAP@3 MAP@5 MAP@10 KNN 0.2267 0.2080 0.1810 0.1833 0.1490 0.1086 -greedy 0.2067 0.1660 0.1450 0.1722 0.1267 0.0906 Counting 0.4633 0.3380 0.1810 0.4633 0.3380 0.1810 LITE 0.7700 0.6280 0.3590 0.7656 0.6202 0.3506 Table 1 demonstrates that LITE model outperforms all other baseline meth- ods, which proves LITE model can adjust the pre-training model by interaction with users. The performance of -greedy is inferior to KNN, because the solution space is huge and we only update the part of value from any topic samples. (a) The range of T is 50 to 2000 (b) The range of T is 50 to 500 Fig. 2. MAP w.r.t the number of user feedback of one area. In the experiments, we collect 100 users’ feedback of each given area x ∈ X , |X | = 100 to improve pre-training model’s performance. We change the num- ber of user feedback and observe the performance. Assuming we have T feedback for one area, Fig. 2 illustrates the performance with respect to the number of user feedback of one area and T = 100 can get the best performance. However, we cannot get global optimal parameters but local optimal parameters because of the huge solution space. The performance may not be significantly improved even though we collect more feedback because the extraction results may fluctu- ate near the local optimal solution and we can improve performance using better initialization parameters. We list the top-10 topics extraction by three methods. As a case, Table 2 presents the extracted topics in “Data Mining” area. 92 J. Han et al. Table 2. Top-10 topics in “Data Mining” area using different methods, where bold items represent the same as groundtruth. -greedy Counting LITE Data Warehouse Data Visualization Data Visualization Business Intelligence Big Data Big Data Data Management Text Mining Information Extraction Big Data Business Intelligence Sentiment Analysis Expert System Machine Learning Text Mining Machine Learning Data Analysis Business Analytics Natural Language Processing Information Retrieval Decision Support System Analytics Data Integration Business Intelligence Data Visualization Data Management Deep Learning Data Analysis Data Warehousing Data Integration 5 Conclusion In this paper, we propose LITE, an end to end framework, aiming to extract topics extraction of given area with interaction. We did experiments on real knowledge base and synthetic interaction data. Experimental results prove the effectiveness of the proposed method. We deployed the proposed model in Aminer system4 and collect user feedback to improve extraction performance. A/B test [8] can be used to evaluate the performance and we leave this to our future work. References 1. Al-Zaidy, R.A., Giles, C.L.: Extracting semantic relations for scholarly knowl- edge base construction. In: Proceedings of 12th IEEE International Conference on Semantic Computing, pp. 56–63 (2018) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 3. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of 24th AAAI Conference on Artificial Intelligence, pp. 1306–1313 (2010) 4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734 (2014) 5. Deldjoo, Y., Frà, C., Valla, M., Cremonesi, P.: Letting users assist what to watch: an interactive query-by-example movie recommendation system. In: Proceedings of 8th Italian Information Retrieval Workshop, pp. 63–66 (2017) 6. Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1262–1273 (2014) 4 https://aminer.org. Interactive Area Topics Extraction with Policy Gradient 93 7. Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In: Pro- ceedings of 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 756–757 (2009) 8. Kohavi, R., Longbotham, R., Sommerfield, D., Henne, R.M.: Controlled experi- ments on the web: survey and practical guide. Data Min. Knowl. Discov. 18(1), 140–181 (2009) 9. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic mod- els. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics, pp. 1536–1545 (2011) 10. Liang, J., Zhang, Y., Xiao, Y., Wang, H., Wang, W., Zhu, P.: On the transitivity of hypernym-hyponym relations in data-driven lexical taxonomies. In: Proceedings of 31st AAAI Conference on Artificial Intelligence, pp. 1185–1191 (2017) 11. Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pp. 490–499 (2007) 12. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004) 13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed rep- resentations of words and phrases and their compositionality. In: Proceedings of 27th Annual Conference on Neural Information Processing Systems, pp. 3111–3119 (2013) 14. Ponzetto, S.P., Strube, M.: Wikitaxonomy: a large scale knowledge resource. In: Proceedings of 18th European Conference on Artificial Intelligence, pp. 751–752 (2008) 15. Ruder, S.: An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016) 16. Samet, H.: The Design and Analysis of Spatial Data Structures. Addison-Wesley, Boston (1990) 17. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient meth- ods for reinforcement learning with function approximation. In: Proceedings of 1999 Annual Conference on Neural Information Processing Systems, pp. 1057–1063 (1999) 18. Wang, C., Fan, Y., He, X., Zhou, A.: Predicting hypernym-hyponym relations for Chinese taxonomy learning. Knowl. Inf. Syst. 1–26 (2018, in press) 19. Yang, Y., Tang, J.: Beyond query: interactive user intention understanding. In: Proceedings of 2015 IEEE International Conference on Data Mining, pp. 519–528 (2015) 20. Zhang, F., Wang, X., Han, J., Wang, S.: Fast top-k area topics extraction with knowledge base. In: Proceedings of 2018 IEEE International Conference on Data Science in Cyberspace (2018) Implementing Neural Turing Machines Mark Collier(B) and Joeran Beel(B) Trinity College Dublin, Dublin, Ireland {mcollier,joeran.beel}@tcd.ie Abstract. Neural Turing Machines (NTMs) are an instance of Memory Augmented Neural Networks, a new class of recurrent neural networks which decouple computation from memory by introducing an external memory unit. NTMs have demonstrated superior performance over Long Short-Term Memory Cells in several sequence learning tasks. A number of open source implementations of NTMs exist but are unstable during training and/or fail to replicate the reported performance of NTMs. This paper presents the details of our successful implementation of a NTM. Our implementation learns to solve three sequential learning tasks from the original NTM paper. We find that the choice of memory contents initialization scheme is crucial in successfully implementing a NTM. Net- works with memory contents initialized to small constant values converge on average 2 times faster than the next best memory contents initializa- tion scheme. Keywords: Neural Turing Machines Memory Augmented Neural Networks 1 Introduction Neural Turing Machines (NTMs) [4] are one instance of several new neural net- work architectures [4,5,11] classified as Memory Augmented Neural Networks (MANNs). MANNs defining attribute is the existence of an external memory unit. This contrasts with gated recurrent neural networks such as Long Short- Term Memory Cells (LSTMs) [7] whose memory is an internal vector main- tained over time. LSTMs have achieved state-of-the-art performance in many commercially important sequence learning tasks, such as handwriting recogni- tion [2], machine translation [12] and speech recognition [3]. But, MANNs have been shown to outperform LSTMs on several artificial sequence learning tasks that require a large memory and/or complicated memory access patterns, for example memorization of long sequences and graph traversal [4–6,11]. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 94–104, 2018. https://doi.org/10.1007/978-3-030-01424-7_10 Implementing Neural Turing Machines 95 The authors of the original NTM paper, did not provide source code for their implementation. Open source implementations of NTMs exist1,2,3,4,5,6,7 but a number of these implementations (See footnote 5, 6 and 7) report that the gradients of their implementation sometimes become NaN during training, causing training to fail. While others report slow convergence or do not report the speed of learning of their implementation. The lack of a stable open source implementation of NTMs makes it more difficult for practitioners to apply NTMs to new problems and for researchers to improve upon the NTM architecture. In this paper we define a successful NTM implementation8 which learns to solve three benchmark sequential learning tasks [4]. We specify the set of choices governing our NTM implementation. We conduct an empirical comparison of a number of memory contents initialization schemes identified in other open source NTM implementations. We find that the choice of how to initialize the contents of memory in a NTM is a key factor in a successful NTM implementation. Our Tensorflow implementation is available publicly under an open source license (See footnote 8). 2 Neural Turing Machines NTMs consist of a controller network which can be a feed-forward neural network or a recurrent neural network and an external memory unit which is a N ∗ W memory matrix, where N represents the number of memory locations and W the dimension of each memory cell. Whether the controller is a recurrent neural network or not, the entire architecture is recurrent as the contents of the memory matrix are maintained over time. The controller has read and write heads which access the memory matrix. The effect of a read or write operation on a particular memory cell is weighted by a soft attentional mechanism. This addressing mechanism is similar to attention mechanisms used in neural machine translation [1,9] except that it combines location based addressing with the content based addressing found in these attention mechanisms. In particular for a NTM, at each timestep (t), for each read and w∑rite head the controller outputs a set of parameters: kt, βt ≥ 0, gt ∈ [0, 1], st (s.t. k st(k) = 1 and ∀k st(k) ≥ 0) and γt ≥ 1 which are used to compute the weighting wt over the N memory locations in the memory matrix Mt as follows: c ← exp(βtK[k ,M( ) t t(i)])wt i ∑N− (1)1 j=0 exp(βtK[kt,Mt(j)]) 1 https://github.com/snowkylin/ntm. 2 https://github.com/chiggum/Neural-Turing-Machines. 3 https://github.com/yeoedward/Neural-Turing-Machine. 4 https://github.com/loudinthecloud/pytorch-ntm. 5 https://github.com/camigord/Neural-Turing-Machine. 6 https://github.com/snipsco/ntm-lasagne. 7 https://github.com/carpedm20/NTM-tensorflow. 8 Source code at: https://github.com/MarkPKCollier/NeuralTuringMachine. 96 M. Collier and J. Beel wct allows for content based addressing where kt represents a lookup key into memory and K is some similarity measure such as cosine similarity: u · v K[u,v] = ‖u‖ · ‖ ‖ (2)v Through a series of operations NTMs also enable iteration from current or previously computed memory weights as follows: wgt ← g ctwt + (1− gt)wt−1 (3) N∑−1 w̃t(i) ← wgt (j)st(i− j) (4) j=0 w̃ (i)γtt wt(i) ← ∑ (5)N−1 j=0 w̃t(j)γt where (3) enables the network to choose whether to use the current content based weights or the previous weight vector, (4) enables iteration through memory by convolving the current weighting by a 1-D convolutional shift kernel and (5) corrects for any blurring occurring as a result of the convolution operation. The vector rt read by a particular read head at timestep t is computed as: N∑−1 rt ← wt(i)Mt(i) (6) i=0 Each write head modifies the memory matrix at timestep t by outputting additional erase (et) and add (at) vectors: M̃t(i) ← Mt−1(i)[1− wt(i)et] (7) Mt(i) ← M̃t(i) + wt(i)at (8) Equations (1) to (8) define how addresses are computed and used to read and write from memory in a NTM, but many implementation details of a NTM are open to choice. In particular the choice of the similarity measure K, the initial weightingsw0 for all read and write heads, the initial state of the memory matrix M0, the choice of non-linearity to apply to the parameters outputted by each read and write head and the initial read vector r0 are all undefined in a NTM’s specification. While any choices for these satisfying the constraints on the parameters out- putted by the controller would be a valid NTM, in practice these choices have a significant effect on the ability of a NTM to learn. Implementing Neural Turing Machines 97 3 Our Implementation Memory contents initialization - We hypothesize that how the memory con- tents of a NTM are initialized may be a defining factor in the success of a NTM implementation. We compare the three memory contents initialization schemes that we identified in open source implementations of NTMs. In particular, we compare constant initialization where all memory locations are initialized to 10−6, learned initialization where we backpropagate through initialization and random initialization where each memory location is initialized to a value drawn from a truncated Normal distribution with mean 0 and standard deviation 0.5. We note that five of the seven implementations (See footnote 1, 2, 3, 4 and 5) we identified randomly initialize the NTM’s memory contents. We also identified an implementation which initialized memory contents to a small constant value (See footnote 6) and an implementation where the memory initialization was learned (See footnote 7). Constant initialization has the advantage of requiring no additional param- eters and providing a stable known memory initialization during inference. Learned initialization has the potential advantage of learning an initialization that would enable complex non-linear addressing schemes [6] while also provid- ing stable initialization after training. This comes at the cost of N ∗ W extra parameters. Random initialization has the potential advantage of acting as a regularizer, but it is possible that during inference memory contents may be in a space not encountered during training. Other parameter initialization - Instead of initializing the previously read vectors r0 and address weights w0 to bias values as per [4] we backpropagate through their initialization and thus initialize them to a learned bias vector. We argue that this initialization scheme provides sufficient generality for tasks that require more flexible initialization with little cost in extra parameters (the number of additional parameters is W ∗ Hr + N ∗ (Hr + Hw) where Hr is the number of read heads and Hw is the number of write heads). For example, if a NTM with multiple write heads wishes to write to different memory locations at timestep 1 using location based addressing thenw0 must be initialized differently for each write head. Having to hard code this for each task is an added burden on the engineer, particularly when the need for such addressing may not be known a priori for a given task, thus we allow the network to learn this initialization. Similarity measure - For K, we follow [4] in using cosine similarity (2) which scales the dot product into the fixed range [−1, 1]. Controller inputs - At each timestep the controller is fed the concatenation of the input coming externally into the NTM xt and the previously read vectors rt−1 from all of the read heads of the NTM. We note that such a setup has achieved performance gains for attentional encoder-decoders in neural machine translation [9]. 98 M. Collier and J. Beel Parameter non-linearities - Similarly to a LSTM we force the contents of the memory matrix to be in the range [−1, 1], by applying the tanh function to the outputs of the controller corresponding to kt and at while we apply the sigmoid function to the corresponding erase vector et. We apply the function softplus(x) ← log(exp(x) + 1) to satisfy the constraint βt ≥ 0. We apply the logistic sigmoid function to satisfy the constraint gt ∈ [0, 1]. In order to make the convolutional shift vector st a valid probability distribution we apply the softmax function. In order to satisfy γt ≥ 1 we first apply the softplus function and then add 1. 4 Methodology 4.1 Tasks We test our NTM implementation on three of the five artificial sequence learning tasks described in the original NTM paper [4]. Copy - for the Copy task, the network is fed a sequence of random bit vectors followed by an end of sequence marker. The network must then output the input sequence. This requires the network to store the input sequence and then read it back from memory. In our experiments we train and test our networks on 8-bit random vectors with sequences of length sampled uniformly from [1, 20]. Repeat Copy - similarly to the Copy task, with Repeat Copy the network is fed an input sequence of random bit vectors. Unlike the Copy task, this is followed by a scalar that indicates how many times the network should repeat the input sequence in its output sequence. We train and test our networks on 8-bit random vectors with sequences of length sampled uniformly from [1, 10] and number of repeats also sampled uniformly from [1, 10]. Associative Recall - Associative Recall is also a sequence learning problem with sequences consisting of random bit vectors. In this case the inputs are divided into items, with each item consisting of 3× 6-dimensional vectors. After being fed a sequence of items and an end of sequence marker, the network is then fed a query item which is an item from the input sequence. The correct output is the next item in the input sequence after the query item. We train and test our networks on sequences with the number of items sampled uniformly from [2, 6]. 4.2 Experiments We first run a set of experiments to establish the best memory contents initial- ization scheme. We compare the constant, random and learned initialization schemes on the above three tasks. We demonstrate below (Sect. 5) that the best such scheme is the constant initialization scheme. We then compare the NTM implementation described above (Sect. 3) under the constant initialization scheme to two other architectures on the Copy, Repeat Copy and Associative Implementing Neural Turing Machines 99 Recall tasks. We follow the NTM authors [4] in comparing our NTM imple- mentation to a LSTM network. As no official NTM implementation has been made open source, as a further benchmark, we compare our NTM implementa- tion to the official implementation9 of a Differentiable Neural Computer (DNC) [5], a successor to the NTM. This provides a guide as to how a stable MANN implementation performs on the above tasks. In all of our experiments for each network we run training 10 times from different random initializations. To measure the learning speed, every 200 steps during training we evaluate the network on a validation set of 640 examples with the same distribution as the training set. For all tasks the MANNs had 1 read and 1 write head, with an external memory unit of size 128×20 and a LSTM controller with 100 units. The controller outputs were clipped elementwise to the range (−20, 20). The LSTM networks were all a stack of 3 × 256 units. All networks were trained with the Adam optimizer [8] with learning rate 0.001 and on the backward pass gradients were clipped to a maximum gradient norm of 50 as described in [10]. 5 Results 5.1 Memory Initialization Comparison We hypothesized that how the memory contents of a NTM were initialized would be a key factor in a successful NTM implementation. We compare the three mem- ory initialization schemes we identified in open source NTM implementations. We then use the best identified memory contents initialization scheme as the default for our NTM implementation. Copy - Our NTM initialized according the constant initialization scheme converges to near zero error approximately 3.5 times faster than the learned ini- tialization scheme, while the random initialization scheme fails to solve the Copy task in the allotted time (Fig. 1). The learning curves suggest that initializing the memory contents to small constant values offers a substantial speed-up in convergence over the other two memory initialization schemes for the Copy task. Repeat Copy - A NTM initialized according the constant initialization scheme converges to near zero error approximately 1.43 times faster than the learned ini- tialization scheme and 1.35 times faster than the random initialization scheme (Fig. 2). The relative speed of convergence between learned and random ini- tialization is reversed as compared with the Copy task, but again the constant initialization scheme demonstrates substantially faster learning than either alter- native. Associative Recall - A NTM initialized according the constant initialization scheme converges to near zero error approximately 1.15 times faster than the learned initialization scheme and 5.3 times faster than the random initialization scheme (Fig. 3). 9 https://github.com/deepmind/dnc. 100 M. Collier and J. Beel Fig. 1. Copy task memory initialization comparison - learning curves. Median error on 10 training runs (each) for a NTM initialized according to the constant, learned and random initialization schemes. Fig. 2. Repeat Copy task memory initialization comparison - learning curves. Median error on 10 training runs (each) for a NTM initialized according to the constant, learned and random initialization schemes. The constant initialization scheme demonstrates fastest convergence to near zero error on all three tasks. We conclude that initializing the memory contents of a NTM to small constant values results in faster learning than backpropa- gating through memory contents initialization or randomly initializing memory contents. Thus, we use the constant initialization scheme as the default scheme for our NTM implementation. Implementing Neural Turing Machines 101 Fig. 3. Associative Recall task memory initialization comparison - learning curves. Median error on 10 training runs (each) for a NTM initialized according to the constant, learned and random initialization schemes. 5.2 Architecture Comparison Now that we have established the best memory contents initialization scheme is constant initialization we wish to test whether our NTM implementation using this scheme is stable and has similar speed of learning and generalization ability as claimed in the original NTM paper. We compare the performance of our NTM to a LSTM and a DNC on the same three tasks as for our memory contents initialization experiments. Copy - Our NTM implementation converges to zero error in a number of steps comparable to the best published results on this task [4] (Fig. 4). Our NTM converges to zero error 1.2 times slower than the DNC and as expected both MANNs learn substantially faster (4–5 times) than a LSTM. Fig. 4. Copy task architecture comparison - learning curves. Median error on 10 train- ing runs (each) for a DNC, NTM and LSTM. 102 M. Collier and J. Beel Repeat Copy - As per [4], we also find that the LSTM performs better relative to the MANNs on Repeat Copy compared to the Copy task, converging only 1.44 times slower than a NTM, perhaps due to the shorter input sequences involved (Fig. 5). While both the DNC and the NTM demonstrate slow learning during the first third of training both architectures then rapidly fall to near zero error before the LSTM. Despite the NTM learning slower than the DNC during early training, the DNC converges to near zero error just 1.06 times faster than the NTM. Fig. 5. Repeat Copy task architecture comparison - learning curves. Median error on 10 training runs (each) for a DNC, NTM and LSTM. Associative Recall - Our NTM implementation converges to zero error in a number of steps almost identical to the best published results on this task [4] and at the same rate as the DNC (Fig. 6). The LSTM network fails to solve the task in the time provided. Our NTM implementation learns to solve all three of the five tasks proposed in the original NTM paper [4] that we tested. Our implementation’s speed to convergence and relative performance to LSTMs is similar to the results reported in the NTM paper. Speed to convergence for our NTM is only slightly slower than a DNC - another MANN. We conclude that our NTM implementation can be used reliably in new applications of MANNs. 6 Summary NTMs are an exciting new neural network architecture that achieve impressive performance on a range of artificial tasks. But the specification of a NTM leaves many free choices to the implementor and no source code is provided that makes these choices and replicates the published NTM results. In practice the choices left to the implementor have a significant impact on the ability of a NTM to learn. We observe great diversity in how these choices are made amongst open source Implementing Neural Turing Machines 103 Fig. 6. Associative Recall task architecture comparison - learning curves. Median error on 10 training runs (each) for a DNC, NTM and LSTM. efforts to implement a NTM, many of which fail to replicate these published results. We have demonstrated that the choice of memory contents initialization scheme is crucial to successfully implementing a NTM. We conclude from the learning curves on three sequential learning tasks that learning is fastest under the constant initialization scheme. We note that the random initialization scheme which was used in five of the seven identified open source implemen- tations was the slowest to converge on two of the three tasks and the second slowest on the Repeat Copy task. We have made our NTM implementation with the constant initialization scheme open source. Our implementation has learned the Copy, Repeat Copy and Associative Recall tasks at a comparable speed to previously published results and the official implementation of a DNC. Training of our NTM is stable and does not suffer from problems such as gradients becoming NaN reported in other implementations. Our implementation can be reliably used for new applications of NTMs. Additionally, further research on NTMs will be aided by a stable, performant open source NTM implementation. References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009) 3. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013) 104 M. Collier and J. Beel 4. Graves, A., Wayne, G., Danihelka, I.: Neural Turing machines. arXiv preprint arXiv:1410.5401 (2014) 5. Graves, A., et al.: Hybrid computing using a neural network with dynamic external memory. Nature 538(7626), 471 (2016) 6. Gulcehre, C., Chandar, S., Cho, K., Bengio, Y.: Dynamic neural Turing machine with soft and hard addressing schemes. arXiv preprint arXiv:1607.00036 (2016) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neu- ral machine translation. In: Proceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pp. 1412–1421 (2015) 10. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013) 11. Sukhbaatar, S., Weston, J., Fergus, R.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, pp. 2440–2448 (2015) 12. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) A RNN-Based Multi-factors Model for Repeat Consumption Prediction Zengwei Zheng1, Yanzhen Zhou1,2, Lin Sun1(B), and Jianping Cai1 1 Hangzhou Key Laboratory for IoT Technology and Application, Zhejiang University City College, Hangzhou, China {zhengzw,sunl,jpcai}@zucc.edu.com 2 College of Computer Science and Technology, Zhejiang University, Hangzhou, China zhouyanzhen@zju.edu.com Abstract. Consumption is a common activity in people’s daily life, and some reports show that repeat consumption even accounts for a greater portion of people’s observed activities compared with novelty- seeking consumption. Therefore, modeling repeat consumption is a very important study to understand human behavior. In this paper, we pro- posed a multi-factors RNN (MF-RNN) model to predict the users’ repeat consumption behavior. We analysed some factors which can influence customers’ daily repeat consumption and introduced those factor in MF-RNN model to predict the users’ repeat consumption behavior. An empirical study on real-world data sets shows encouraging results on our approach. In the real-world dataset, the MF-RNN gets good prediction performance, better than Most Frequent, HMM, Recency, DYRC and LSTM methods. We compared the effect of different factors on the cus- tomers’ repeat consumption behavior, and found that the MF-RNN gets better performance than non-factor RNN. Besides, we analyzed the dif- ferences in consumption behaviors between different cities and different regions in China. Keywords: Repeat consumption · Recurrent Neural Network (RNN) Multi-factors 1 Introduction Nowadays, with the rapid development of mobile payment technology, people can make payment in an store by smartphones apps (such as Alipay, WeChat pay and Apple pay etc.) instead of by cash. Therefore, how to use previous consumption record and model user’s repeat consumption behavior to predict which store the user likely to go in future time is very important. The study of consumption behavior is to know the way an individual spends his resources in the process of consuming items. This is an approach that comprises of studies of the items that they buy and the reason for buying and the timing. It is also about where ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 105–115, 2018. https://doi.org/10.1007/978-3-030-01424-7_11 106 Z. Zheng et al. they make the purchase and how frequently. Due to the fact that people prefer the things they are familiar with, they may repeatedly interact with same items overtime. Therefore, users always like to visit the same stores that they have purchased previously, such as shopping at a same fruit shop and eating regularly at a same restaurant. For the reason of that, repeat consumption accounts for a major portion of people’s daily consumption behavior, and we focus on the repeat consumption behavior study in this paper. In real life, some factors can affect peoples’ daily activities. For instance, we usually visit some places in the vicinity of our office during the workdays, but usually visit some places near our home in holidays; we probably go out for some outdoor activities when the weather is nice but stay indoor when the weather is terrible; we like to take cool drinks in summer day but choose some hot things in winter instead. Therefore, we believe that peoples’ daily repeat consumption behavior can be affected by some factors too. In this paper, we proposed an MF-RNN model which is based on RNN and introduce some factors to pre- dict peoples’ repeat consumption behavior. Through analysis, we selected three factors as influential factors: holiday factor, weather factor and temperature fac- tor. An empirical study on real-world dataset shows encouraging results on our approach. The MF-RNN gets encouraging performance for repeat consumption behavior prediction, better than MF, HMM, Recency, DYRC and LSTM model. And the MF-RNN with all three factors gets better performance than without any factors. 2 Related Works Consumption behavior is an approach to know the way that an individual spends his resources in the process of consuming items. Consumption behavior analy- sis is critically extending the domain of behavior analysis and behavioral eco- nomics into marketing theory. In past, the ways of predicting the consumers’ behavior involved Content-based recommendation, collaborative filtering-based recommendation, time series analysis and data mining. Content-based recom- mender systems are based on the idea that the features of items are useful in suggesting relevant and interesting items for users [1]. Collaborative filtering- based recommender systems identify users whose tastes are similar to that of a target user and then recommend items that the others have liked [2,3]. Time series analysis and data mining method used the historical data to extra some feature to model the user’s consumption behavior [4]. But the study of repeat consumption behavior is a bit different of consump- tion behavior, it focuses on predicting whether or not the user will repeat pur- chase items which he has consumed in previous time. The problems of how and why users repeatedly consume certain items have been approached from several angles in various discipline [5]. Some of the earliest works focus on understanding repeat behavior on the web, like re-searching queries and website revisitation. Adar, Teevan and Fumais [6] carried out a large-scale analysis of revisitation, and classified websites into different groups based in how often they attract A RNN-Based Multi-factors Model for Repeat Consumption Prediction 107 revisitors. Then, those researchers explored the relationship between the con- tent change in web pages and people’s revisition to these pages [7]. Teevan et al. [8] studied query logs to find repeat queries in web research, and that more than 40% of the queries are repeat queries. Then, many methods have been proposed to predict people’s repeated consumption behavior. Anderson et al. [9] analyzed the dynamics of repeat consumption. They studied the pat- tern by which a user consumes the same item repeatedly over time, in some wide variety domains ranging from check-ins at the same business location to re-watches of the same video, and found that recency of consumption is the strong predictor consumption. Chen et al. [10] formulate the problem of recom- mendation for repeat consumption with user implicit feedback, then proposed a time-sensitive personalized pairwise ranking (TS-PPR) method based on user behavioral features. Rafailidis and Nanopoulos [11] present the CTF model and W-CTF model for recommend items with repeat consumption, by capturing the rate with which the current preferences of each user shift over time and by exploiting side information in a coupled tensor factorization technique. Zhang et al. [12] proposes a dining recommender system termed NDRS, which gives associated recommendation strategies according to different novelty-seeking sta- tuses. They first designed a CRF (Conditional Random Field) with constrains to infer novelty-seeking status, then proposed a context-aware collaborative filtering method and a HMM (Hidden Markov Model) with temporal regularity method are proposed for novel and regular restaurant recommendation. Christina and Lars [13] developed the multinomial SVM (Support Vector Machine) item rec- ommender system MN-SVM-IR to calculate personalized item recommendation for a repeat-buying scenario. Although there are many methods for predicting repeated consumption behavior, most methods focus on the features of the con- sumers or the items and rarely care about other relevant informations. 3 Methodolody Recurrent Neural Network (RNN) is a type of feedforward neural network whose output is not only depend on the weight of the current input, but also depend on the present state of the network. Augmented by the inclusion of recurrent edges that span adjacent time steps, the RNN introducing a notion of time to the model [14]. In other words, the feedback from the hidden layer not only goes to the output, but also goes into the next time step hidden layer. Thus, the RNN has some memory. In the previous research, RNN proved to be very useful in sequence learning problem. RNN can be employed in text processing, image captioning, machine translation, video captioning and handwriting recognition. In this paper, we proposed a prediction model based on RNN and combine with several other influential factors to predict the users’ repeat consumption behavior. 108 Z. Zheng et al. 3.1 RNN-based Multi-factors Prediction Model As mentioned above, we selected 3 different factors as the influential factors in the repeat consumption behavior prediction case. Then, we defined the MF- RNN model which is a three-layers network include input layer, hidden layer and output layer, shown in Fig. 1. Fig. 1. The framework of MF-RNN. Input Layer: The input layer X is a vector consists of four normalized input data as Eq. (1): S is visited offline store sequence, H is the holiday factor sequence, C is the weather factor sequence, T is the temperature factor sequence. The output data Y is the prediction result represent the offline store which this customer will visit in the next time. X = [SHCT ] (1) Hidden Layer: Z is hidden layer, its state in time t z t is affected by the current input x t and the state of the previous time step hidden layer z t−1: zt = f(Uxt +Wzt−1 + bz) (2) where U is the weight between the input and hidden layers, W is the recurrent weight between the hidden layers at adjacent time steps, bz is the bias in hidden layer. Output Layer: The output layer Y is the prediction result represent offline stores the user will visit at next times. The output in time t calculate as Eq. (3), A RNN-Based Multi-factors Model for Repeat Consumption Prediction 109 and V is the weight between the output and hidden layers, g is an activation function and by is the bias in output layer. yt = g(V zt + by) (3) Then, we used the historical data to training this network. A Back Propa- gation Through Time (BPTT) algorithm is employed in the training process to calculate the parameters U V W and bz by. The loss function of the networks defined as Eq. (4), and et is the loss at each step. ∑ E = et (4) t The gradient of V calculate as Eq. (5), yt’ is the supervision information at time step t. ∂E ∑∇V = = (y − y′t t)⊗ zt∂V (5) t Then we defined two operator δzt as Eq. (6) and δ y t as Eq. (7), and calculate the gradient of U,W as Eqs. (8) and (9), finally calculate the gradient of two bias as Eqs. (10) and (11). After parameters training process, we got a trained network to calculate the output data Y, then to predict user’s repeat consumption behavior in future time. z ∂Eδt = (6)∂(f(Uxt +Wzt−1 + bz)) y ∂Eδt = (7)∂(g (V zt + by)) ∂E ∑ ∂e ∑∇ = = tU = δzt × xt∂U ∂U (8) t t ∇ ∂E ∑ ∂e ∑ W = = t = δz ∂W ∂W t × zt−1 (9) t t ∂E ∑ ∂e ∑ Δbz = = t = δz ∂bz ∂b t (10) t z t ∂E ∑ ∂e ∑ Δby = = t = δy ∂b ∂b t (11)y t y t 3.2 Influential Factors Selection Holiday Factor: There is a big difference between peoples’ daily activities on holidays and on workdays. For example, people prefer to choose the restaurant near their office room to have lunch on workdays but choose the restaurant near home to have lunch on holidays, and people can often visit supermarket during holidays but can’t do it when they are at work. Therefore, the holiday factor will affect peoples’ repeat consumption behavior. In this paper, according to Chinese statutory holiday arrangements, we generate a holiday sequence for each user, 1 represent holiday and 0 represent workday. 110 Z. Zheng et al. Weather Factor: Weather can also affect peoples’ daily activities. For instance, people can do some outdoor activities like go to a playground and visit the park when the weather is good, but they always stay indoors when rainy and snowy. Then, we choose the weather as an influential factor in this repeat consumption study. The types of weather are diverse, including sunny, cloudy, rainy, snowy and etc. In this paper, we classify the weather into the following categories based on the types and the severity of the weather, and give them different labels, as shown in Table 1. Temperature Factor: In addition to holiday factor and weather factor, tem- perature can also influence peoples’ daily consumption behavior. When the tem- perature is very high, people may buy some cool drink or ice-cream. And when the temperature is low, people may prefer to buy some hot tea or hot coffee. We generate two temperature sequence including the highest and lowest temperature in each day for each user. Table 1. Different weather conditions and their labels. Weather type Label Sunny 0 Light Rain −0.5 Heavy Rain −1 Light Snow −1.5 Heavy Snow −2 4 Experiment 4.1 Dataset The dataset we used in this study is a real-world dataset [15]. It’s the consump- tion record of consumer to use Alipay at offline stores. This dataset includes 2000 shops in different city over the country. The dataset time covers from July 1st 2015 to October 31th 2016. We selected 1057 consumers who consumed more than 120 times and more than 3 different stores. We calculated the information entropy of user’s consumption sequence according to Eq. (12). ∑ H(x) = E(log2(1/p(xi))] = − (p(xi)log2(1/p(xi))), (i = 1, 2, . . . , n) (12) P(x i) in Eq. (12) represent the probability of random variables event x i. The information entropy can be used to measure the uncertainty of random vari- ables events. The higher the information entropy of the user’s consumption record sequence, the more complex and unpredictable of consumer’s consump- tion behavior is. Then we divided the all consumer to 3 groups by the information entropy, show in Table 2. The users’ consumption behavior in Group3 is most unpredictable. A RNN-Based Multi-factors Model for Repeat Consumption Prediction 111 Table 2. Consumers grouping according to their consumption record sequence infor- mation entropy. Group1 Group2 Group3 Information entropy 0∼0.5 0.5∼1.0 >1.0 Numbers of consumer 448 296 313 4.2 Baselines Comparison In this paper, we compared the performance of the proposed method with some other baselines. These methods are: Most Frequent (MF): We considerate the consumption frequency is a par- ticular aspect of people’s consumption behavior, so we choose the most frequent as the baseline of our data experiment. Hidden Markov Model (HMM): HMM is a powerful statistical tool for modeling generative sequences that can be characterized by an underlying pro- cess generating an observable sequence. It’s one of the most basic and extensively used statistical tools for modeling the discrete time series. Recency [9]: This baseline assumes that the recently consumed items are more likely to be reconsumed. DYRC [16]: This method proposes a mixed weighted scheme to recommend repeat items based on item popularity and recency effect. Long Short-Term Memory (LSTM): LSTM introducing a memory cell and generating a unit of computation to replace traditional artificial neurons in the hidden layer of a network. With these memory cells, networks are able to overcome some difficulties with training encountered in earlier recurrent nets. In the experiment, we set 50 hidden units in the networks, and choose MAE(Mean Absolute Error) as loss function and Adam as optimizer to train this networks. A linear function was selected as the activation function in this network. In all six methods, we set 60 as the length of training sequence. The Fig. 2 illustrates the prediction accuracy of all the baselines and MF-RNN model on the whole three groups of customers. The MF method undoubtedly got the lowest prediction accuracy, nearly to the HMM. And we can find that the neu- ral network method has a great performance improvement over the other four methods. Finally, the model we proposed gets 83.5% prediction accuracy on the most unpredictable group and win the best perform among all six methods. The MF-RNN improve 26.0% than MF on Group3, 23.8% than HMM, 24.1% than Recency, 21.8% than DYRC, and 6.8% than LSTM. 112 Z. Zheng et al. Fig. 2. Prediction accuracy comparison among six methods. 4.3 Influential Factor Analyze In order to understand which factor has the greatest impact on consumer behav- ior, we compared the prediction accuracy of different influential factors on MF- RNN model. The experiment on Group3 customers shown in Table 3. We can find that the prediction model with all three factor has the best performance on the real-world dataset. The MF-RNN improve 2.5% than the RNN without any factors. This shows that the introduction of nature influential factors can improve the performance of the prediction model. The MF-RNN improve 1.8% than the non-factor RNN by introducing holiday factor, improve 1.3% by intro- ducing weather factor, and improve 1.7% by introducing temperature factor. Table 3. Prediction result of different influential factors on Group3. Non-factor +Holiday +Weather +Temperature +All factor (RNN) factor factor factor Prediction 0.810 0.828 0.823 0.827 0.835 accuracy In general, consumers in different cities may have different lifestyles and lead to different daily activities. Thence, we made some data experiment to compare the repeat consumption behavior in cities of different level. We divided 313 customers in Group3 into two groups according to the city they living in. The first group includes 219 users who lives in the first and second tier cities, such as Beijing, Shanghai, Hangzhou and etc. And the second group includes 94 user who lives in other small cities. We compared the differences in repeat consumption behavior between these two groups of users. The result shows in Table 4. We can A RNN-Based Multi-factors Model for Repeat Consumption Prediction 113 find that holiday factor has the greatest impact on users in first and second tier cities. We think this is because the pace of life in these cities is fast and most of the users in those big cities are office workers. Those users’ daily consumption behavior between workdays and holidays are different. But the lifestyles in small cities are different. From the results, it can be seen that not holiday factor but the weather factor is the most important factor for the consumers in small cities. Table 4. Prediction accuracy comparison between very large cities and small cities on Group3. +Holiday factor +Weather factor +Temperature factor Very large cities 0.837 0.820 0.826 Other small cities 0.808 0.810 0.806 Besides, we try to analyze the consumption behavior differences in different regions. We divided the customers in Group3 into south group and north group according to their location. The south group includes 255 customers and the north group includes 58 customers. We compared the differences in consumption behavior between these two groups of users. The result shows in Table 5. It illustrates that the North China group is most sensitive to temperature factor, probably because of the extreme temperature changes in the northern China regions. And weather factor has a greater impact on South group than on north group. Table 5. Prediction accuracy comparison between south China and north China on Group3. +Holiday factor +Weather factor +Temperature factor South China 0.826 0.820 0.824 North China 0.838 0.837 0.840 5 Conclusion In this paper, we proposed a prediction framework that based on MF-RNN to predict the customer’s repeat consumption behavior. This method uses an three-layer RNN structure, and introduce three factors include holiday factor, weather factor and temperature factor to model customer’s repeat consumption behavior. We compared the method with some other baseline methods. The experiment result shows that our MF-RNN gets better performance than MF, HMM, Recency, DYRC and LSTM. Then we compared the effect of different factors on the customers’ repeat consumption behavior. And the result shows 114 Z. Zheng et al. that after introduced three factors the MF-RNN get better performance, the prediction accuracy improved 2.5% than RNN without any factors. Finally, we found there is a large difference in consumption behavior between different cities and regions in China. Therefore, to a certain extent, our research has practical significance for predicting the repeat consumption behavior. Acknowledgement. This work was supported by Zhejiang Provincial Natural Science Foundation of China (NO. LY17F020008). References 1. Ricci, F.: Recommender Systems Handbook. Springer, New York (2011). https:// doi.org/10.1007/978-1-4899-7637-6 2. Herlocker, J.L., Konstan, J.A., Borchers, A., et al.: An algorithmic framework for performing collaborative filtering. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 230–237. ACM (1999) 3. Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems. Foundations and trends? Hum. Comput. Interact. 4(2), 81–173 (2011) 4. Yi, Z., Wang, D., Hu, K., et al.: Purchase behavior prediction in M-commerce with an optimized sampling methods. In: IEEE International Conference on Data Mining Workshop, pp. 1085–1092. IEEE Computer Society (2015) 5. Russell, C.A., Levy, S.J.: The temporal and focal dynamics of volitional recon- sumption: a phenomenological investigation of repeated hedonic experiences. J. Consum. Res. 39(2), 341–359 (2011) 6. Adar, E., Teevan, J., Dumais, S.T.: Large scale analysis of web revisitation pat- terns. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 1197– 2008. ACM (2008) 7. Adar, E., Teevan, J., Dumais, S.T.: Resonance on the web: web dynamics and revisitation patterns. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 1381–1390. ACM (2009) 8. Teevan, J., Adar, E., Jones, R., et al.: Information re-retrieval: repeat queries in Yahoo’s logs. In 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 151–158. ACM (2007) 9. Anderson, A., Kumar, R., Tomkins, A., et al.: The dynamics of repeat consump- tion. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 419–430. International World Wide Conference Committee (2014) 10. Chen, J., Wang, C., Wang, J.: Recommendation for repeat consumption from user implicit feedback. IEEE Trans. Knowl. Data Eng. 28(11), 3083–3097 (2016) 11. Rafailidis, D., Nanopoulos, A.: Repeat consumption recommendation based on users preference dynamics and side information. In 24th International Conference on World Wide Web, pp. 99–100. ACM (2015) 12. Zhang, F., Zheng, K., Yuan, N.J., et al.: A novelty-seeking based dining recom- mender system. In: 24th International Conference on World Wide Web, pp. 1362– 1372. International World Wide Conference Committee (2015) 13. Lichtenthäler, C., Schmidt-Thieme, L.: Multinomial SVM item recommender for repeat-buying scenarios. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds.) Data Analysis, Machine Learning and Knowledge Discovery. SCDAKO, pp. 189–197. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-01595-8 21 A RNN-Based Multi-factors Model for Repeat Consumption Prediction 115 14. Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. Comput. Sci. (2015) 15. Tianchi big data contest. https://tianchi.aliyun.com/competition/index.htm. Accessed 2 May 2018 16. Benson, A.R., Kumar, R., Tomkins, A.: Modeling user consumption sequences. In: 25th International Conference on World Wide Web, pp. 519–529. International World Wide Conference Committee (2016) Practical Fractional-Order Neuron Dynamics for Reservoir Computing Taisuke Kobayashi(B) Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan kobayashi@is.naist.jp Abstract. This paper proposes a practical reservoir computing with fractional-order leaky integrator neurons, which yield longer memory capacity rather than normal leaky integrator. In general, fractional-order derivative needs all memories leading to the current state from the initial state. Although this feature is useful as a viewpoint of memory capacity, to keep all memories is intractable, in particular, for reservoir computing with many neurons. A reasonable approximation to the fractional-order neuron dynamics is therefore introduced, thereby deriving a model that exponentially decays past memories before threshold. This derivation is regarded as natural extension of reservoir computing with leaky integra- tor that has been used most commonly. The proposed method is com- pared with reservoir computing methods with normal neurons and leaky integrator neurons by solving four kinds of regression and classification problems with time-series data. As a result, the proposed method shows superior results in all of problems. Keywords: Reservoir computing · Fractional-order leaky integrator Regression and classification 1 Introduction Recently, recurrent neural network (RNN) is a general approach to predict and classify time-series data, coupled with recent deep learning technology [6]. RNN is one of the neural networks with recursive connections in hidden layer, which enables to store past inputs for a certain period as internal states, which are useful in solving real problems that does not have a Markov process. Let us call “memory capacity” how much past inputs (and outputs) can be reflected on the next outputs. The long memory capacity is suitable to handle the time-series data. As a means to improve the memory capacity, long-short term memory (LSTM) and its relatives [3,7] have been major proposals, and actually, they have achieved excellent results. For embedded systems, however, backpropaga- tion through time (BPTT) in RNN is sometimes intractable in terms of calcula- tion cost since its calculation graph grows with time. A method to update LSTM ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 116–125, 2018. https://doi.org/10.1007/978-3-030-01424-7_12 Practical Fractional-Order Neuron Dynamics 117 W rec −m u y W in W out + β 1− I x W fb + Fig. 1. Concept of reservoir computing with fractional-order leaky integrator: reservoir layer consists of fractional-order leaky integrator neurons; each neuron has a practical memory trace, which approximates the original one for computing with constant cost, to improve the memory capacity. parameters using evolutionary algorithm instead of BPTT was proposed [18], but sufficient computational resources are still needed. Under such circumstances, reservoir computing (RC) [11], typified by echo state network (ESN) [8] and liquid state machine (LSM) [13], has been proposed as a special case of RNN (see the left side of Fig. 1). As well as RNN, RC has recursive connections in hidden layer (called reservoir layer), although the connections are sparsely given in general. The decisive difference between RNN and RC is the weights to be learned: in RC, only readout weights to generate outputs from the internal states of the reservoir layer. The other weights, i.e., inputs and reservoir weights, are randomly given as constants. That is, BPTT is no longer required in RC, and instead, RC is learned easily in a linear regression manner. Even in terms of performance to predict or classify the time-series data, it is known that RC is not inferior to RNN. In addition, since learning parameters increased only linearly with respect to the number of neurons, a huge number of neurons would easily be set in the reservoir layer like cerebellum [23]. To improve the memory capacity in RC, two important dynamics should be considered: (i) network dynamics of the reservoir layer [5,17] and; (ii) neuron dynamics in the reservoir layer [9,12,15,21]. Note that these combination has been reported to be remarkable improvement of memory capacity [22]. Although each summary is described in the below, a new method for (ii) the neuron dynam- ics is proposed in this paper. With regard to (i) the network dynamics, the reservoir layer, which is ran- domly given in general, has been structured explicitly. For instance, Rodan and Tino showed that several networkmodels withminimal recursive connections have sufficient performance for prediction [17]. Gallicchio et al. achieved longer memory capacity by deepening the reservoir layer [5]. These results are, however, gained 118 T. Kobayashi through trial-and-error (or heuristic) design based on the intuition of researchers, namely not derived mathematically. On the other hand, (ii) the neuron dynamics often employs firing models of a neuron of organism. In particular, leaky integrator has shown utility in LSM [21]. After that, it has been diverted to ESN, which does not directly deal with neuron firing, by Jaeger et al., and has improved the memory capacity [9]. Recently, Lun et al. extended the leaky integrator to the one, which holds past internal states with different leaking rates for a certain period [12]. This model actually improved the performance of RC, while it is heuristically designed. Teka et al. have proposed a new leaky integrator that explicitly holds all past internal states according to a mathematically sophisticated approach, i.e., fractional- order derivative [19,20]. The fractional-order leaky integrator has recently been introduced to ESN by Pahnehkolaei et al. [15]. However, to hold all past internal states in memory is practically infeasible, in particular in RC with many neurons. Hence, this paper proposes the RC with practical fractional-order leaky inte- grators (FLRC), a block diagram of which is shown in the right side of Fig. 1. That is, a reasonable approximation is given to the fractional-order neurons so as to calculate their dynamics with a constant cost. With this approximation, the past internal states before threshold are treated in a recursive manner without holding their values explicitly. FLRC is also regarded as the extension of the RC with leaky integrator (LRC) by introducing a new parameter, named fractional rate. In this paper, the proposed FLRC was evaluated via four kinds of time- series data prepared as benchmarks for regression or classification problems. We found that the proposed FLRC outperformed the conventional RC and LRC in all benchmarks. 2 Preliminaries 2.1 Reservoir Computing with Leaky Integrator Neurons RC is one of the recurrent neural networks, which updates only readout weights, W out. Other weights, i.e., inputs weights, W in, feedback weights, W fb, and recursive weights, W rec, are randomly given to be constants. Regarding W rec, however, the magnitude of eigenvalues is limited to satisfy the echo state prop- erty. This paper employs ρ(|W rec|) = 0.999, where ρ(·) gives the maximum eigenvalue, as the echo state property according to ref. [24]. RC dynamics with N leaky integrator neurons is given as follows: I = f(W recx − +W int t 1 u fbt +W yt−1) (1) xt = (1− aβ)xt−1 + βIt (2) y = g(W out[x,ut t t ] ) (3) where x are internal states of respective neurons, u and y are inputs and outputs of this system, respectively. f(·) is the activation function (hyperbolic tangent in general), and g(·) is task-dependent function: linear function in regression and Practical Fractional-Order Neuron Dynamics 119 softmax function in classification. a is usually given to be 1 for simplicity. Note that, when β = 1, the above equations match the basic RC. β is generally a scalar, but in this paper, it is vectorized so as to have different time constants for each neuron. In addition, W fb and direct inputs to outputs are ignored for simplicity. That is, the LRC dynamics in this paper is defined as follows: I = f(W rect xt−1 +W inut) (4) xt = (1− β) xt−1 + βIt (5) yt = g(W outxt) (6) 2.2 Learning of Readout Weights Let us introduce the way to update readout weights, W out. Depending on the conducted task (regressio{n∑or classification), loss function is defined as follows:∑nb 1L i=1 2‖yi − ti‖22 Regression= − n (7)bi=1 ln(yi)ti Classification where n is the size of mini batch and t are supervisory signals. W outb is updated to minimize L generally by recursive least square method for linear regressor. In this paper, however, stochastic gradient decent (SGD) is employed since the latest SGD (Adam [10] with L2 regularization in this paper) can generate a stable gradient every time step. Note that learning rate η is given as 10−3/N . 3 Fractional-Order Neuron Dynamics 3.1 Derivation of Fractional-Order Leaky Integrator In this section, the practical dynamics of fractional-order leaky integrator neu- rons are derived. Although derivation process is basically in accordance with refs. [19,20], it is noticed that the way to handle discrete time is partially cor- rected. In addition, the dynamics for single neuron is derived for simplicity of description. First of all, the derivative of fractional-order leaky integrator neuron is defined as follows: dαxt = (−xt + I −1t)τ (8) dtα where τ > 0 is the time constant, and α ∈ (0, 1] is the order of the fractional derivative, named fractional rate. The left side of the above equation can be approximated by the following numerical integration using the L1 scheme of the Caputo fractional derivat[ive [4].∑ ]α −α t−1d xt  δ (x − { − }x ) (t k)1−α − (t− 1− k)1−α (9) dtα Γ (2− α) k+1 k k=0 120 T. Kobayashi where δ is time step and Γ (·) is gamma function. When δατ−1Γ (2− α) is replaced as C, the above two equations are merged as follows: C(−xt + It) = xt − x∑ t−1t−2 { } + (xk+1 − xk) (t− k)1−α − (t− 1− k)1−α (10) k=0 The last term on the right side of the above equation is defined as mt−1, named a memory trace. ∑t−2 { } m − := (x − x ) (t− k)1−α − (t− 1− k)1−αt 1 k+1 k (11) k=0 In that case, xt is derived in a recursive manner. (1 + C)xt = xt−1 + CIt −mt−1 1 C xt = (x −m1 + t−1 t−1) + IC 1 + tC ∴ xt = (1− β)(xt−1 −mt−1) + βIt (12) where β replaces C/(1 + C). As can be seen in Eq. (12), (5) is extended to it by adding the memory trace. When α = 1, this equation is equivalent to Eq. (5) since the memory trace is no longer stored (see Eq. (11)). In addition, when β = 1, the memory trace does not affect the internal state, thereby matching the basic RC. In the original derivation in refs. [19,20], (1 − β) was not multiplied with the memory trace, which could always affect the internal state unless α = 1. 3.2 Approximation to Memory Trace However, the memory trace defined in Eq. (11) is intractable to calculate numer- ically because it requires all internal states from the initial time 0 to the current time t. A single neuron still has room for computing, but it is infeasible in RC with many neurons. A reasonable approximation is therefore applied to calculate the memory trace feasibly. Now, a parameter, n ∈ N, is introduced for approximation. The memory trace mt is divided by using n as follows: t−∑1−n { } mt = (xk+1 − xk) (t+ 1− n− k)1−α − (t− n− k)1−α k=0 × (t+ 1− k) 1−α − (t− k)1−α (t+ 1− n− k)1−α − (t− n− k)1−α ∑t−1 { } + (xk+1 − xk) (t+ 1− k)1−α − (t− k)1−α (13) k=t−n Practical Fractional-Order Neuron Dynamics 121 α = 0.2 α = 0.4 α = 0.6 α = 0.8 n = 1 n = 4 n = 7 n = 10 1.00 1.00 0.95 0.95 0.90 0.90 0.85 0.85 0.80 0.80 0.75 0.75 0.70 0.70 0.65 0.65 0 t-1-n 0 t-1-n k k (a) With respect to α (b) With respect to n Fig. 2. Decay rate γ(α, n, t, k): t is fixed with 500; as α and n increase, the approxi- mation accuracy is expected to worsen; note that n yields the precise memory trace up to n, although this fact cannot be shown in this figure. The latter summation can be computed with a constant cost, and an efficient solver has been proposed in ref. [14]. The former summation matches mt−n if the multiplied coefficients, named decay rates, are excluded. It is therefore approximated as a value unrelated to k. (t+ 1− k)1−α − (t− k)1−α − − − − − − − =: γ(α, n, t, k)  γ(α, n) (14)(t+ 1 n k)1 α (t n k)1 α Since γ(α, n) is independent on k, it can be putted out of the summation. Namely, mt is approximated in a recursive manner as follows: ∑t−1 mt  { } γ(α, n)mt−n + (x − x 1−α 1−αk+1 k) (t+ 1− k) − (t− k) (15) k=t−n Note that, even after this approximation, the memory trace is 0 when α = 1. The effect of this approximation can be confirmed from plots of the decay rate γ(α, n, t, k) summarized in Fig. 2. The smaller α yields quicker convergence to 1, which makes it easier to improve the approximation accuracy. The smaller n also improves the approximation accuracy, while the latest memory trace up to n can be calculated without the approximation. Then, to prioritize minimization of cost, n = 1 is employed in this paper. mt  γ(α, 1)mt−1 + (xt − xt−1)(21−α − 1) (16) As approximation methods of γ, several methods are considered: an opti- mistic method by approximating γ(α, n) = 1; a worst-used method by approxi- mating γ(α, n) with a minimum decay rate in a range of approximation; and a method mixing them. In the mixed method with mixing rate ζ (= 0.1, in this paper), γ(α, 1) is given as follows: 31−α − 21−α γ(α, 1) = ζ + (1− ζ) (17) 21−α − 1 Decay rate γ Decay rate γ 122 T. Kobayashi 4 Performance Evaluation 4.1 Benchmark Problems 10th NARMA System (NARMA). The task in the nonlinear auto- regressive moving average (NARMA) system [1] is to predict the next output from the current input and output. The input s is generated uniformly from an interval [0, 0.5]. The output in the 10th order system, y, is given by the following equation. ∑9 y(t+ 1) = 0.3y(t) + 0.05y(t) y(t− i) + 1.5s(t− 9)s(t) + 0.1 (18) i=0 That is, long memory capacity is required to predict the output since inputs and outputs up to 10 steps before are necessary. In this paper, 500 steps have been generated as one sequence, and 50 sequences are prepared as a dataset. Inverse Kinematics for Two-Link Arm (IK). The task in the two-link arm, which has links 1 and 2 with 0.5 m, is to predict the joint angles and their angular velocities, (θ1, θ2) and (θ̇1, θ̇2), from the reference of the tip of arm, (x, y). The reference trajectory is generated by sine waves with several frequencies for respective axes. This inverse kinemat(ic√s can be solved an)alytically as follows: θ = atan2(y, x)− atan2 x2 + y2 − d21 ( 1, d√ 1 ) θ 2 22 = −θ1 + atan2(y, x) + atan2 x + y − d22, d2 (19) where d1 = (x2 + y2 + 2 2 2 2 2 21 − 2)/(2 1), d2 = (x + y − 1 + 2)/(2 2) Their angular velocities are given as backward difference. In this paper, 500 steps have been generated from the specific reference trajectory as one sequence, and 50 sequences are prepared as a dataset. Walking Path Classification (MovementAAL). This dataset is provided by ref. [2]. The task in this paper is to classify walking path into six paths accord- ing to the current four radio signal strengths (RSS). Although classification task with time-series data is generally evaluated by classifications for one sequence, it is evaluated by classifications at respective times in this paper. This setting is harder than general one and leads to clarifying the performance of the classifiers. Note that the size of mini batch is smaller than the other datasets (nb = 5 in this dataset and nb = 50 in the other datasets) due to the shorter sequences. Activity Classification (AReM). This dataset is provided by ref. [16]. Seven activity, i.e., bending1, bending2 cycling, lying, sitting, standing, and walking, are classified from six inputs, i.e., respective means and variances of three RSS. As well as the walking path classification, this task is also evaluated by the classification results at respective times. Practical Fractional-Order Neuron Dynamics 123 4.2 Evaluation Criteria The prepared dataset is divided into training data with 75% and test data with 25%. Note that this division conducted 10 patterns for statistics. After learning 20 epochs with the training data, each method is evaluated with the test data. Evaluation differs between regression problems (first two benchmarks) and clas- sification problems (remaining two). In the regression problems, criterion is given as normalized mean square error (NMSE). ‖y − t‖2 NMSE = 2‖t‖2 (20)2 A smaller value means higher regression performance. The classification cri- terion is given as accuracy (ACC) of classification at each time. T ACC = cor (21) Tall where Tcor is the cumulative time successfully classified and Tall is the total time of dataset. A larger value means higher classification performance. 4.3 Results Three methods, i.e., RC with α,β = 1, LRC with α = 1 and β ∈ (0, 1), and FLRC (proposal) with α,β ∈ (0, 1), were compared in terms of the evaluation criteria. Note that α and β were generated uniformly (or fixed to 1) for respective neurons, although they are usually scalars, which is specialized for a task to be learned. This design aims to eliminate the task-dependent optimization and to improve generalization. Except for α and β, all other network constants, such as W rec, are commonly used. Table 1. Means and standard deviations of evaluation results: in each benchmark, the best results were shown in bold; the means of FLRC with N = 500 outperformed the other methods in all benchmarks. Benchmark RC LRC FLRC (proposal) N = 100 N = 500 N = 100 N = 500 N = 100 N = 500 NARMA (NMSE×103) 37.8 ± 0.6 55.2 ± 1.3 37.0 ± 0.5 35.0 ± 0.4 35.9 ± 0.5 33.2 ± 0.5 IK (NMSE×102) 13.3 ± 1.7 13.3 ± 1.7 12.4 ± 1.6 12.2 ± 1.6 12.3 ± 1.6 12.0 ± 1.6 MovementAAL (ACC%) 38.4 ± 3.5 38.7 ± 3.0 52.6 ± 4.8 54.6 ± 4.2 58.6 ± 3.5 60.6 ± 3.4 AReM (ACC%) 63.0 ± 4.1 62.8 ± 4.2 68.2 ± 4.6 68.8 ± 4.8 68.7 ± 4.7 69.8 ± 4.7 Results were shown in Fig. 3(a) and (b) and Table 1. As can be seen in them, FLRC outperformed the conventional RC and LRC in all benchmarks. In NARMA and MovementAAL, significant superiorities were confirmed, although the remaining two had no significances. This is because the former two bench- marks required the longer memory capacity rather than the latter two. In partic- ular, NARMA absolutely needs the longer memory capacity in accordance with Eq. (18), while LRC is enough to predict the trajectory of the two-link arm from the current states and the next references in IK (see Eq. (19)). 124 T. Kobayashi RC-100 LRC-100 FLRC-100 RC-100 LRC-100 FLRC-100 RC-500 LRC-500 FLRC-500 RC-500 LRC-500 FLRC-500 0.16 0.8 0.14 0.7 0.12 0.6 0.10 0.08 0.5 0.06 0.4 0.04 NARMA IK MovementAAL AReM Benchmark Benchmark (a) Regression problems (b) Classification problems Fig. 3. Box plots of evaluation results: the number behind the names of methods represented the number of neurons N , which basically improved the performance to some extent; FLRC outperformed the conventional methods, namely RC and LRC, in all benchmarks, although its superiorities in IK and AReM were not so significant. 5 Conclusion This paper proposed a practical fractional-order leaky integrator neurons for RC, named FLRC, which yielded the long memory capacity. Although fractional- order derivative generally needs all memories leading to the current state from the initial state and this feature is intractable for RC with many neurons, a reasonable approximation to the fractional-order neuron dynamics derives the model that exponentially decays past memories before threshold. This deriva- tion is regarded as natural extension of the normal leaky integrator. FLRC was compared with the conventional RC and LRC by solving four kinds of regression and classification problems with time-series data. As a result, FLRC achieved superior results in all of problems. Future work of this study is to analyze the optimal design of α and β. Alternatively, they will be dynamically optimized according to SGD. References 1. Atiya, A.F., Parlos, A.G.: New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Trans. Neural Netw. 11(3), 697– 709 (2000) 2. Bacciu, D., Barsocchi, P., Chessa, S., Gallicchio, C., Micheli, A.: An experimental characterization of reservoir computing in ambient assisted living applications. Neural Comput. Appl. 24(6), 1451–1464 (2014) 3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur- rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 4. Diethelm, K., Ford, N.J., Freed, A.D., Luchko, Y.: Algorithms for the fractional calculus: a selection of numerical methods. Comput. Methods Appl. Mech. Eng. 194(6–8), 743–773 (2005) NMSE ACC Practical Fractional-Order Neuron Dynamics 125 5. Gallicchio, C., Micheli, A., Pedrelli, L.: Deep reservoir computing: a critical exper- imental analysis. Neurocomputing 268, 87–99 (2017) 6. Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural net- works. In: Advances in Neural Information Processing Systems, pp. 190–198 (2013) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and sav- ing energy in wireless communication. Science 304(5667), 78–80 (2004) 9. Jaeger, H., Lukoševičius, M., Popovici, D., Siewert, U.: Optimization and appli- cations of echo state networks with leaky-integrator neurons. Neural Netw. 20(3), 335–352 (2007) 10. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations, pp. 1–15 (2015) 11. Lukoševičius, M., Jaeger, H.: Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3(3), 127–149 (2009) 12. Lun, S.x., Yao, X.s., Hu, H.f.: A new echo state network with variable memory length. Inf. Sci. 370, 103–119 (2016) 13. Maass, W., Markram, H.: On the computational power of circuits of spiking neu- rons. J. Comput. Syst. Sci. 69(4), 593–616 (2004) 14. Marinov, T., Ramirez, N., Santamaria, F.: Fractional integration toolbox. Fract. Calc. Appl. Anal. 16(3), 670–681 (2013) 15. Pahnehkolaei, S.M.A., Alfi, A., Machado, J.T.: Uniform stability of fractional order leaky integrator echo state neural network with multiple time delays. Inf. Sci. 418, 703–716 (2017) 16. Palumbo, F., Gallicchio, C., Pucci, R., Micheli, A.: Human activity recognition using multisensor data fusion based on reservoir computing. J. Ambient. Intell. Smart Environ. 8(2), 87–107 (2016) 17. Rodan, A., Tino, P.: Minimum complexity echo state network. IEEE Trans. Neural Netw. 22(1), 131–144 (2011) 18. Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent net- works by evolino. Neural Comput. 19(3), 757–779 (2007) 19. Teka, W., Marinov, T.M., Santamaria, F.: Neuronal spike timing adaptation described with a fractional leaky integrate-and-fire model. PLoS Comput. Biol. 10(3), e1003526 (2014) 20. Teka, W.W., Upadhyay, R.K., Mondal, A.: Fractional-order leaky integrate-and- fire model with long-term memory and power law dynamics. Neural Netw. 93, 110–125 (2017) 21. Verstraeten, D., Schrauwen, B., Stroobandt, D., Van Campenhout, J.: Isolated word recognition with the liquid state machine: a case study. Inf. Process. Lett. 95(6), 521–528 (2005) 22. Xue, F., Li, Q., Li, X.: The combination of circle topology and leaky integrator neurons remarkably improves the performance of echo state network on time series prediction. PloS one 12(7), e0181816 (2017) 23. Yamazaki, T., Igarashi, J.: Realtime cerebellum: a large-scale spiking network model of the cerebellum that runs in realtime using a graphics processing unit. Neural Netw. 47, 103–111 (2013) 24. Yildiz, I.B., Jaeger, H., Kiebel, S.J.: Re-visiting the echo state property. Neural Netw. 35, 1–9 (2012) An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning Giuseppe Marra1,2(B), Andrea Zugarini1,2, Stefano Melacci2, and Marco Maggini2 1 DINFO, University of Firenze, Florence, Italy {g.marra,andrea.zugarini}@unifi.it 2 DIISM, University of Siena, Siena, Italy mela@diism.unisi.it, marco.maggini@unisi.it Abstract. In the last few years, neural networks have been intensively used to develop meaningful distributed representations of words and con- texts around them. When these representations, also known as “embed- dings”, are learned from unsupervised large corpora, they can be trans- ferred to different tasks with positive effects in terms of performances, especially when only a few supervisions are available. In this work, we fur- ther extend this concept, and we present an unsupervised neural architec- ture that jointly learns word and context embeddings, processing words as sequences of characters. This allows our model to spot the regular- ities that are due to the word morphology, and to avoid the need of a fixed-sized input vocabulary of words. We show that we can learn com- pact encoders that, despite the relatively small number of parameters, reach high-level performances in downstream tasks, comparing them with related state-of-the-art approaches or with fully supervised methods. Keywords: Recurrent Neural Networks · Unsupervised learning Word and context embeddings · Natural Language Processing Deep learning 1 Introduction Recent advances in Natural Language Processing (NLP) are characterized by the development of techniques that compute powerful word embeddings and by the extensive use of neural language models. Word Embeddings (WEs) aim at representing individual words in a low-dimensional continuous space, in order to exploit its topological properties to model semantic or grammatical relationships between different words. In particular, they are based on the assumption that functionally or semantically related words appear in similar contexts. Despite the idea of continuous word representations was proposed a several years ago [4], their importance became strongly popular mostly after the work of Mikolov et al. [13], when the CBOW and Skip-Gram models were introduced as ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 126–136, 2018. https://doi.org/10.1007/978-3-030-01424-7_13 An Character-Aware Model to Word and Context Representation Learning 127 implementations of the word2vec idea. Key features of these models are the unsu- pervised scheme of the learning process and the simplicity of the computation that allows a highly efficient training from very large unlabeled corpora. More- over, the learning objective function is task-independent, such that it allows the development of embeddings suitable for several NLP tasks. WEs are generally constituted by a single vector to represent each specific word in a vocabulary V of N = |V | words. The requirement of a predefined vocabulary is an important limitation for every NLP model. Rare and Out-Of-Vocabulary (OOV) words will not have a meaningful vector representation. Moreover, WEs do not take into account morphological properties of words. For instance, the same suffix ing may suggest that two words have some functional similarity. Hence, the informa- tion conveyed by the sequence of characters representing a word may be useful to tackle both the problem of unseen words and the modelling of morphology for in- vocabulary tokens. For instance, the character structure of tokens can also help to detect Named Entities, usually treated as OOV elements, recognizing proper nouns, by means of capital letters, or acronyms. Furthermore, a character-based model can deal with noise caused by typos, slang, etc., that are common issues in open-domain systems such as conversational agents or sentiment analysis tools. There are several NLP tasks in which it is useful to generate vectorial rep- resentations of contexts too. In fact, polysemy and homonymy cause inherent semantic ambiguities in language interpretation, that can only be resolved by looking at the surrounding context, that is the goal of the Word Sense Disam- biguation (WSD) task. Neural approaches have been developed to learn context embeddings, such as context2vec [12]. In this work we propose a character-based unsupervised model to learn both context and word embeddings from generic text. The model consists in a hierar- chy of two distinct Bidirectional Long Short Term Memories (Bi-LSTMs) [18], to encode words as sequences of characters and word-level contextual represen- tations, respectively. Our unsupervised learning approach, despite being more compact than other related algorithms, yields generic embeddings with features that can be efficiently exploited in different NLP tasks requiring either word or context embeddings, such as chunking and WSD, as we show in our comparisons. The paper is structured as follows. First, in Sect. 2 the related work is sum- marized. Then, we describe the proposed model in Sect. 3. Section 4 reports our experimental results and Sect. 5 draws our conclusions and the directions for future work. 2 Related Work Our unsupervised computational scheme follows the one of the CBOW instance of the word2vec algorithm [13]. The method we propose in this paper is inspired by the ideas behind context2vec [12], that we extend with a bidirectional recur- rent neural model that processes words as sequence of characters. We also focus on a single encoder that we use both to represent words alone and words belong- ing to a context. 128 G. Marra et al. There are several approaches that jointly learn task-oriented (supervised) word and character-based representations, that are subsequently either concate- nated or combined by a non-linear function. In [14] a gate adaptively decides how to mix the two representations, whereas the models proposed in [16,17] exploit the concatenation of word embeddings and character representations to address Part-Of-Speech (POS) Tagging and Named Entity Recognition (NER), respectively. Differently, our work focusses on a single character-level encoder that is trained in an unsupervised manner. There exists a number of different approaches that extract vectorial repre- sentations directly from the character sequences of words, mostly focused on Language Modeling (LM) or Character Language Modeling (CLM). These rep- resentations are generally computed by either Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) - mostly LSTMs [5]. Ling et al. [11] applied Bidirectional LSTMs [18] to learn task-dependent character level features for Language Modeling and POS tagging, showing particular improve- ments in morphologically rich languages such as Turkish. A multi-layered Hier- archical Recurrent Neural Network was applied in [7] to solve CLM. Differently from our approach, the output of this model is a distribution over characters, while we exploit word level predictions. The character-aware model of [10], is based on a highway-network on top of 1-d convolutional filters processing input characters. The resulting output is then handled by a LSTM for a LM task. The highway-network output does provide the distributed representation of a word. In [9] different architectures, mostly based on CNNs, are studied in LM tasks. The proposed approach differs from most of the previous ones (1) for the learning mechanism, that is completely unsupervised on large text corpora, thus allowing the development of task-independent representations, and (2) for the architecture that is aimed at obtaining character-aware representations of both contexts and words, that are suitable for a large variety of NLP applications. 3 The Character-Aware Neural Model The proposed model is organized as a hierarchical architecture based on Bi- LSTMs processing sentences. Each sentence is first split into a sequence of words using space characters (i.e. whitespaces, tabs, newlines, etc.) as separators. Words are further split into sequences of characters, such that there is no need to specify a vocabulary in advance. Then, the character sequence of an input word x is processed to obtain its vectorial representation (word embedding), while the character sequences of the surrounding words are used to encode the context to which x belongs (context embedding). Given the current sentence, the context of x comprises the words that precede and follow x. Inspired by the CBOW scheme [13], our model is trained to predict the current word given its context. In the following we describe each layer of the proposed architecture. An Character-Aware Model to Word and Context Representation Learning 129 3.1 Word and Context Embeddings We consider an input sentence s composed of n words, s = (x1, . . . , xn), where each word is a sequence of characters xi = (ci,1, . . . , ci,|x |), being |xi| the lengthi of the sequence xi. Each character cij is encoded as an index in a dictionary of C characters and it is mapped to a real vector ĉ dij ∈ R c as ĉij = Wc · 1(cij), (1) where W ∈ C×dR cc is the matrix of the learnable character representations, each of them of size dc, while 1(·) is a function returning a one-hot representation of its integer input. Note that C is quite small, in the order of hundreds, compared to common word vocabularies, whose size is in the order of hundreds of thousands. For each input word xi, the first layer of the model extracts a word embedding ei, using→−a bidi←−rectional recurrent neural network with LSTM cells (Bi-LSTM)[2]. Let rc and rc be the forward and backward components of a Bi-LSTM taking a→−sequen←−ce of character embeddings as input and returning their internal states hc and hc after the entire sequence has b→−een pro←−cessed. The embeddings ei of the word xi is then the concatenation of hc and hc: →− ←− e = [h , h ] = [−→i c c r ←−ĉ (ĉi,1 . . . ĉi,|x |), rc (ĉi,|x | . . . ĉi,1)], (2)i i where we indicated with←−[· , ·] the concatenation operation and we emphasizedthe backward nature of rc by showing the character sequence in reverse order. The second layer follows a similar scheme−→to compute the contextual embed-ding êi of the word xi in the sentence s. Let re and ←r−e be the forward and back- ward components of a Bi-LSTM taking as inputs the embeddings of left context of xi (i.e. [e1, . . . , ei−1]) and of the right con−→text of x←−i (i.e. [ei+1, . . . , en]), respec- tively. Given the Bi-LSTM internal states he and he obtained after processing the input left and right context sequences, the contextual emb−→edding←−êi of the word xi is then obtained by projecting the concatenation of he and he into a lower-dimensional space by means of a Multi-Layer Perceptron (MLP), with the goal of merging and compressing the left and right context representations, −→ ←− êi = MLP ([he, he]) = MLP ([→−re(e1 . . . , e − ),←r−i 1 e(en . . . ei+1)]). (3) The overall architecture is sketched in Fig. 1. Notice that ei is the embedding of word xi, whereas êi is the representation of xi in the context of s without including xi itself. Hence, the model computes at the same time word (Eq. (2)) and context (Eq. (3)) embeddings for a specific word. 3.2 Learning Algorithm Both word and context representations are learned following the unsupervised approach used in CBOW [12,13]. Given a corpus of textual data, the objective of our model is to predict each word given the representation of its surrounding 130 G. Marra et al. context (Eq. (3)). In particular, the context embedding of Eq. (3) is projected into the space of the corpus vocabulary using a linear projection. Instead of per- forming a softmax activation and minimizing the cross-entropy (as commonly done in LM tasks), the whole network is trained by minimizing the Noise Con- trastive Estimation (NCE) loss function [3]. NCE belongs to a family of classifi- cation algorithms, which approximate a softmax regression by means of sampling methods. NCE is particularly helpful in all those cases in which the number of output units is prohibitively high, as it is for our (and related) model. cat MLP LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ['T' 'h' 'e'] ['c' 'a' 't' ] ['i' 's'] ['s' 'l' 'e' 'e' 'p' 'y'] Legend Character Embedding Word Embedding Context Embedding Fig. 1. The sentence “The cat is sleepy” is fed to our model, with target word cat. The sequence of character embeddings (orange squares on the bottom) are processed by the word-level Bi-LSTM yielding the word embeddings (yellow-red squares in the middle). The context-level Bi-LSTM processes the word embeddings in the left and right contexts of cat, to compute a representation of the whole context (blue-green squares on the top). Such representation is used to predict the target word cat, after having projected it by means of a MLP. (Color figure online) One could argue that a vocabulary of words is still needed, since it is required to make the aforementioned word prediction. However, this is not a limitation, since it is only necessary at training time, while it is not needed when deploying the model. In principle, a different approach would be feasible, where the context representation of Eq. (3) is decoded into a sequence of characters that represent the word to predict. We tried both approaches and we found the word level prediction to give the best results. Thanks to the dynamic behaviour of the context-level RNNs, our model can deal with contexts of any length. In this work, the state of the RNN re is reset at the beginning of a new sentence, to reduce the variability of the contexts. An Character-Aware Model to Word and Context Representation Learning 131 4 Experimental Results We conducted different experiments to evaluate the word and context representa- tions developed by the proposed model. In particular, we first trained our model on a large corpora. Then, we detached the learned word and context encoders and considered the tasks of Chunking and Word Sense Disambiguation (WSD), exploiting our word and context embeddings as features for each task-specific classifier, as shown in Fig. 2. Depending on the problem at hand, it may be useful to use either both the word and context embeddings or only one of them. Any other additional features can also be concatenated to these representations to obtain a richer input vector. We also evaluated the robustness of our model to character-level noise. Hence, we considered the WSD task when the input words are perturbed by typos modelled as random replacements of single characters. Finally we report some qualitative examples, showing the nearest neighbours for both word and context representations of a set of sample words. Model Setup. Our model has been trained on the ukWaC corpus1 (2 billion words). The size dc of the character embeddings is set to 50, whereas word and context embeddings are of sizes 1000 and 600, respectively. The MLP, that maps the RNN states into the context embeddings, has one hidden layer of 1200 units with ReLU activation functions. These settings are inspired by those used in the context2vec architecture [12] (the structure of the last projection layer described in Subsect. 3.2 is the same). The complete encoding model has around 7 million trainable parameters, which is about 16 times smaller than the context2vec model in [12]; this is due to the fact that words are encoded using a RNN that does not depend on the vocabulary size. Generic Chunker WSD task "bank output I-NP (geography)" Task-related Chunk Predictor Sense Classifier Classifier Text surrounding [word] that we considered The black [dog] was barking Cook it right on the [bank] of the river. Legend Word Embedding Context Embedding Other Features Fig. 2. Examples of how word and context embeddings can be used in a generic task, and in the cases of Chunking and WSD of this paper. Chunking. Chunking is a classical NLP problem whose goal is to tag text seg- ments with labels defining their syntactic roles, e.g. noun phrase (NP) or verbal 1 http://wacky.sslmit.unibo.it/doku.php?id=corpora 132 G. Marra et al. phrase (VP). Each word is uniquely associated with a single tag expressing the segment class and its position within the phrase. An instance of Chunking clas- sification is shown in Fig. 2, where the word dog is marked with the label I-NP, standing for Inside-chunk Noun Phrase. A standard benchmark for Chunking is the CoNLL 2000 dataset that contains 211,727 tokens in the training set and 47,377 tokens in the test set. The chunk tag is predicted by training a classifier that receives as input only the concatenation of the word and context embed- dings computed by the model. This vector is projected onto a 600 dimensional space, and further processed by a Bi-LSTM that outputs vectors of size 500 that are finally mapped to the space of 23 classes, representing the chunk tags. Weights are updated using Adam Optimizer with default hyper-parameters and weight decay regularization with a factor of 0.001. We compared several variants of the proposed model and the resulting F1 scores are shown in Table 1. We report results when using only Word Embeddings (WE), only Context Embed- dings (CE), and both of them (WE+CE). In this case we also considered WE and CE that are not generated by our model, but that are variables of the whole architecture trained with the task-level supervision. Both the feature types (WE and CE) are needed to achieve better performances, as expected. This experi- ment highlights the importance of using embeddings that are pre-trained with our model, that allows us to obtain the best F1 score of 93.30. This value can be compared with the results reported by Collobert et al. [1] (94.32) and by Huang et al. [6] (94.46), taking into account that in our case we did not make use of any hand-crafted feature nor of any kind of post-processing to adjust incoherent predictions. Moreover, when adding POS tagging features, our model reaches the same performances (93.94) of the state-of-the-art architecture [6] without Conditional Random Fields. Hence, we can conclude that the proposed architec- ture provides word and context embeddings that convey enough information to reach competitive performances. Furthermore, it should be considered that the number of parameters in the model is dramatically reduced with respect to such competitors, since there is no word vocabulary. Table 1. Results on the Chunking task - different input features. Input features F1 % Our WE only 89.68 Our CE only 89.59 Our WE + Our CE 93.30 WE + CE trained on task 89.83 Word Sense Disambiguation. Experiments on WSD were carried out within the evaluation framework proposed in [15], that collects multiple benchmarks (Senseval*, SemEval*, and a merged collection - ALL). The goal of WSD is to identify the correct sense of words. We followed the commonly used IMS An Character-Aware Model to Word and Context Representation Learning 133 approach [19], that is based on an SVM classifier on top of conventional WSD features. We compare our method against the original IMS model and other instances of it in which the WSD features are augmented with different context embeddings. We report the results in Tables 2 and 3. Our embeddings outper- form both the IMS with only conventional features and word2vec embeddings, opportunely averaged [8], moreover it is competitive with context2vec represen- tations. It is also worth to mention that, to the best of our knowledge, the use of context2vec features as input of the IMS is a novel attempt in the literature. Table 2. Word Sense Disambiguation in the benchmarks collected in [15]. The best results (F1 %) are obtained by the contex2Vec model that however has 16 times more parameters than the proposed model and no capability to deal with OOV tokens. Model Senseval2 Senseval3 SemEval2007 SemEval2013 SemEval2015 ALL IMS 70.2 68.8 62.2 65.3 69.3 68.1 IMS+word2vec 72.2 69.9 62.9 66.2 71.9 69.6 IMS+context2vec 73.8 71.9 63.3 68.1 72.7 71.1 IMS+Our CE 72.8 70.5 62.0 66.2 71.9 69.9 Table 3. Overall results (F1%) grouped by Part of Speech (ALL benchmark [15]). Model Noun Adjective Verb Adverb IMS 70.0 75.2 56.0 83.2 IMS+word2vec 71.8 76.1 57.4 83.5 IMS+context2vec 73.1 77.0 60.5 83.5 IMS+Our CE 71.3 76.6 58.1 83.8 Robustness to Typos. Many NLP applications should deal with noisy tex- tual data. Indeed, misspelled words are likely to be set as OOV in models based on word dictionaries. We compare the proposed model against context2vec on a WSD task (ALL benchmark), when introducing an increasing probability to randomly perturb a character of a word. Conventional WSD features are com- pletely removed for both the models, that only use context-level representations. Figure 3 shows how the F1 score decreases with the increase of the noise prob- ability. Both the models suffer for word perturbations, but the character-aware embeddings yield a slower degradation in performances, that allows it to out- performs context2vec for high levels of noise. Qualitative Evaluation. One of the most intriguing properties of embeddings is their capability to capture semantic and syntactic similarities into the topology of the embedding space. Such characteristic is illustrated by means of examples for both the representations (word and context) obtained by the proposed model. Distance between the distributed representations are computed by the cosine 134 G. Marra et al. 72 Context2Vec 70 Our Method 68 66 64 62 60 0 0.2 0.4 0.6 0.8 1 Noise Probability Fig. 3. Robustness to typos in a WSD task (ALL benchmark [15]). The “noise proba- bility” represents the probability of having a typo in a word. similarity. In Table 4 we show the 5 nearest neighbours for some given words. The examples show that the character based model is capable of capturing both morphological and semantic similarities. Table 4. Top-5 closest words for a given target word. Turkish Sometimes Usually Happiness Danish Somehow Normally Weirdness Welsh Altogether Basically Fairness French Perhaps Barely Deformity Kurdish Nonetheless Typically Ripeness Swedish Heretofore Formerly Smoothness For the evaluation of context representations, we considered 8 sentences related to 2 different topics (4 sentences each): capitals of states and pizza. A context embedding is obtained by considering the tokens around the word capital or pizza. Then, a random sentence is chosen as query, and the remaining Table 5. Some contexts sorted by descending cosine similarity with respect to the query context “I like eating [ ] with cheese and ham” of (unused) target word pizza. Query: I like eating [ ] with cheese and ham. pizza Contexts sorted by Do you like to eat [ ] with cheese and salami? pizza descending cosine Did you eat [ ] at lunch? pizza similarity What is the best [ ] i can eat here? pizza Paris is the [ ] and most populous city in France ... capital London is the [ ] and most populous city of England ... capital Rome is the [ ] of Italian Republic capital Washington , D.C. , .... , is the [ ] of the United States capital F1 % An Character-Aware Model to Word and Context Representation Learning 135 sentences are sorted according to the distance between the query context embed- ding and their vectors. An example is shown in Table 5, where it is clear that all the contexts related to pizza instances are closer to the query than sentences concerning capitals. 5 Conclusions We presented an unsupervised neural model that can develop task-independent word and context representations using character-level inputs. We trained our model on a 2 billion word corpus, and the resulting word and context encoders were used to produce robust input features to approach some popular NLP tasks (Chunking, WSD). The proposed model has shown the capability of build- ing powerful representations that are competitive to state-of-the-art embeddings generated by models with a significantly larger number of parameters. Our future work will include applications of this model to conversational systems. References 1. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 2. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005) 3. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation prin- ciple for unnormalized statistical models. In: AISTATS, pp. 297–304 (2010) 4. Hinton, G.E., Mcclelland, J.L., Rumelhart, D.E.: Distributed Representations, Par- allel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: foundations (1986) 5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 6. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 7. Hwang, K., Sung, W.: Character-level language modeling with hierarchical recur- rent neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724. IEEE (2017) 8. Iacobacci, I., Pilehvar, M.T., Navigli, R.: Embeddings for word sense disambigua- tion: an evaluation study. In: ACL (Volume 1: Long Papers), pp. 897–907 (2016) 9. Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016) 10. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016) 11. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: EMNLP, pp. 1520–1530 (2015) 12. Melamud, O., Goldberger, J., Dagan, I.: context2vec: learning generic context embedding with bidirectional LSTM. In: Proceedings of the 20th SIGNLL Confer- ence on Computational Natural Language Learning, pp. 51–61 (2016) 136 G. Marra et al. 13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013) 14. Miyamoto, Y., Cho, K.: Gated word-character recurrent language model. In: Pro- ceedings of the 2016 Conference on EMNLP, pp. 1992–1997 (2016) 15. Raganato, A., Camacho-Collados, J., Navigli, R.: Word sense disambiguation: a unified evaluation framework and empirical comparison. In: EACL (2017) 16. Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of- speech tagging. In: ICML, pp. 1818–1826 (2014) 17. Santos, C.N.d., Guimaraes, V.: Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008 (2015) 18. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) 19. Zhong, Z., Ng, H.T.: It makes sense: a wide-coverage word sense disambiguation system for free text. In: ACL, pp. 78–83 (2010) Towards End-to-End Raw Audio Music Synthesis Manfred Eppe(B), Tayfun Alpay, and Stefan Wermter Knowledge Technology, Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527 Hamburg, Germany {eppe,alpay,wermter}@informatik.uni-hamburg.de http://www.informatik.uni-hamburg.de/WTM/ Abstract. In this paper, we address the problem of automated music synthesis using deep neural networks and ask whether neural networks are capable of realizing timing, pitch accuracy and pattern generaliza- tion for automated music generation when processing raw audio data. To this end, we present a proof of concept and build a recurrent neu- ral network architecture capable of generalizing appropriate musical raw audio tracks. Keywords: Music synthesis · Recurrent neural networks 1 Introduction Most contemporary music synthesis tools generate symbolic musical representa- tions, such as MIDI messages, Piano Roll, or ABC notation. These representa- tions are later transformed into audio signals by using a synthesizer [8,12,16]. Symbol-based approaches have the advantage of offering relatively small prob- lem spaces compared to approaches that use the raw audio waveform. A problem with symbol-based approaches is, however, that fine nuances in music, such as timbre and microtiming must be explicitly represented as part of the symbolic model. Established standards like MIDI allow only a limited representation which restricts the expressiveness and hence also the producible audio output. An alternative is to directly process raw audio data for music synthesis. This is independent of any restrictions imposed by the underlying representation, and, therefore, offers a flexible basis for realizing fine tempo changes, differences in timbre even for individual instruments, or for the invention of completely novel sounds. The disadvantage of such approaches is, however, that the representation space is continuous, which makes them prone to generating noise and other inappropriate audio signals. In this work, we provide a proof of concept towards filling this gap and develop a baseline system to investigate how problematic the large continuous representation space of raw audio music synthesis actually is. We hypothesize ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 137–146, 2018. https://doi.org/10.1007/978-3-030-01424-7_14 138 M. Eppe et al. Fig. 1. The practical application and workflow of our system. that a recurrent network architecture is capable of synthesizing non-trivial musi- cal patterns directly in wave form while maintaining an appropriate quality in terms of pitch, timbre, and timing. The practical context in which we situate our system is depicted in Fig. 1. Our system is supposed to take a specific musical role in an ensemble, such as gener- ating a bassline, lead melody, harmony or rhythm and to automatically generate appropriate audio tracks given the audio signals from the other performers in the ensemble. To achieve this goal, we train a recurrent artificial neural network (ANN) architecture (described in Fig. 2) to learn to synthesize a well-sounding single instrument track that fits an ensemble of multiple other instruments. For example, in the context of a classic rock ensemble, we often find a composition of lead melody, harmony, bass line, and drums. Our proposed system will learn to synthesize one of these tracks, say bass, given the others, i.e., lead melody, harmony and drums. Herein, we do not expect the resulting system to be able to fully replace a human musician, but rather focus on specific measurable aspects. Specifically, we investigate: 1. Timing and beat alignment, i.e., the ability to play a sequence of notes that are temporally aligned correctly to the song’s beat. 2. Pitch alignment, i.e., the ability to generate a sequence of notes that is correct in pitch. 3. Pattern generalization and variation, i.e., the ability to learn general musical patterns, such as alternating the root and the 5th in a bass line, and to apply these patterns in previously unheard songs. We hypothesize that our baseline model offers these capabilities to a fair degree. 2 Related Work An example for a symbolic approach for music generation, melody invention and harmonization has been presented by Eppe et al. [4,6], who build on con- cept blending to realize the harmonization of common jazz patterns. The work by Liang et al. [12], employs a symbol-based approach with recurrent neural networks (RNNs) to generate music in the style of Bach chorales. The authors demonstrate that their system is capable of generalizing appropriate musical patterns and applying them to previously unheard input. An advanced general Towards End-to-End Raw Audio Music Synthesis 139 artistic framework that also offers symbol-based melody generation is Magenta [16]. Magenta’s Performance-RNN module is able to generate complex poly- phonic musical patterns. It also supports micro timing and advanced dynamics, but the underlying representation is still symbolic, which implies that the pro- ducible audio data is restricted. For example, novel timbre nuances cannot be generated from scratch. As another example, consider the work by Hung et al. [8], who demonstrate an end-to-end approach for automated music generation using a MIDI representation and Piano Roll representation. Contemporary approaches for raw audio generation usually lack the gener- alization capability for higher-level musical patterns. For example, the Magenta framework also involves NSynth [3], a neural synthesizer tool focusing on high timbre quality of individual notes of various instruments. The NSynth framework itself is, however, not capable of generating sequences of notes, i.e., melodies or harmonies, and the combination with the Performance-RNN Magenta melody generation tool [16] still uses an intermediate symbolic musical representation which restricts the produced audio signal. Audio generation has also been inves- tigated in-depth in the field of speech synthesis. For example, the WaveNet architecture [15] is a general-purpose audio-synthesis tool that has mostly been employed in the speech domain. It has inspired the Tacotron text-to-speech framework which provides expressive results in speech synthesis [18]. To the best of our knowledge, however, WaveNet, or derivatives of it, have not yet been demonstrated to be capable of generalizing higher-level musical patterns in the context of generating a musical track that fits other given tracks. There exist some recent approaches to sound generation operating on raw waveforms with- out any external knowledge about musical structure, chords or instruments. A simple approach is to perform regression in the frequency domain using RNNs and to use a seed sequence after training to generate novel sequences [9,14]. We are, however, not aware of existing work that has been evaluated with appropri- ate empirical metrics. In our work, we perform such an evaluation and determine the quality of the produced audio signals in terms of pitch and timing accuracy. 3 A Baseline Neural Model for Raw Audio Synthesis For this proof of concept we employ a simple baseline core model consisting of two Gated Recurrent Unit (GRU) [2] layers that encode 80 Mel spectra into a dense bottleneck representation and then decode this bottleneck representation back to 80 Mel spectra (see Fig. 2). Similar neural architectures have proven to be very successful for various other audio processing tasks in robotics and signal processing (e.g. [5]), and we have experimented with several alternative architec- tures using also dropout and convolutional layers but found that these variations did not improve the pitch and timing accuracy significantly. We also performed hyperparameter optimization using a Tree-Parzen estimator [1] to determine the optimal number of layers and number of units in each layer. We found that for most experiments two GRU layers of 128 units each for the encoder and the decoder, and a Dense layer consisting of 80 units as a bottleneck representation 140 M. Eppe et al. produced the best results. The dense bottleneck layer is useful because it forces the neural network to learn a Markovian compressed representation of the input signal, where each generated vector of dense activations is independent of the previous ones. This restricts the signals produced during the testing phase of the system, such that they are close to the signals that the system learned from during the training phase. Fig. 2. Our proposed network for mapping the Mel spectra to a dense bottleneck representation, back to Mel spectra, and then to linear frequency spectra. To transform the Mel spectra generated by the decoding GRU layers back into an audio signal, we combine our model with techniques known from speech synthesis that have been demonstrated to generate high-quality signals [15]. Specifically, instead of using a Griffin-Lim algorithm [7] to transform the Mel spectra into audio signals, we use a CBHG network to transform the 80 Mel coefficients into 1000 linear frequency coefficients, which are then transformed into an audio signal using Griffin-Lim. The CBHG network [11] is composed of a Convolutional filter Bank, a Highway layer, and a bidirectional GRU. It acts as a sequence transducer with feature learning capabilities. This module has been demonstrated to be very efficient within the Tacotron model for speech recognition [18], in the sense that fewer Mel coefficients, and therefore fewer network parameters, are required to produce high-quality signals [15]. Our loss function is also inspired by the recent work on speech synthesis, specifically the Tacotron [18] architecture: We employ a joint loss function that involves an L1 loss on the Mel coefficients plus a modified L1 loss on the linear frequency spectra where low frequencies are prioritized. 4 Data Generation To generate the training and testing audio samples, we use a publicly available collection of around 130,000 midi files1. The dataset includes various kinds of musical genres including pop, rock, rap, electronic music, and classical music. Each MIDI file consists of several tracks that contain sequences of messages that indicate which notes are played, how hard they are played, and on which channel they are played. Each channel is assigned one or more instruments. A problem with this dataset is that it is only very loosely annotated and very diverse in 1 https://redd.it/3ajwe4, accessed 18/01/18. Towards End-to-End Raw Audio Music Synthesis 141 terms of musical genre, musical complexity, and instrument distribution. We do not expect our proof of concept system to be able to cope with the full diversity of the dataset and, therefore, only select those files that meet the following criteria: 1. They contain between 4 and 6 different channels, and each channel must be assigned exactly one instrument. 2. They are from a similar musical genre. For this work, we select classical pop and rock from the 60s and 70s and select only songs from the following artists: The Beatles, The Kinks, The Beach Boys, Simon and Garfunkel, Johnny Cash, The Rolling Stones, Bob Dylan, Tom Petty, Abba. 3. We eliminate duplicate songs. 4. They contain exactly one channel with the specific instrument to extract. For this work, we consider bass, reed, and guitar as instruments to extract. The bass channel is representing a rhythm instrument that is present in most of the songs, yielding large amounts of data. The reed channel is often used for lead melody, and guitar tracks often contain chords consisting of three ore more notes. As a result, we obtain 78 songs with an extracted guitar channel, 61 songs with an extracted reed channel, and 128 songs with an extracted bass channel. We split the songs such that 80% are used for training and 20% for testing for each instrument. For each file, we extract the channel with the instrument that we want to synthesize, generate a raw audio (.wav) file from that channel, and chunk the resulting file into sliding windows of 11.5 s, with a window step size of 6 s. We then discard those samples which contain a low amplitude audio signal with an average root-mean-square energy of less than 0.2. 5 Results and Evaluation To obtain results, we trained the system for 40,000 steps with a batch size of 32 samples and generated a separate model for each instrument. For the training, we used an Adam optimizer [10] with an adaptive learning rate. We evaluate the system empirically by developing appropriate metrics for pitch, timing and variation, and we also perform a qualitative evaluation in terms of generalization capabilities of the system. We furthermore present selected samples of the system output and describe qualitatively in how far the system is able to produce high- level musical patterns. 5.1 Empirical Evaluation For the empirical evaluation, we use a metric that compares the audio signals of a generated track with the original audio track for each song in the test subset of the dataset. The metric considers three factors: timing accuracy, pitch accuracy, and variation. Timing Accuracy. For the evaluation of the timing of a generated track, we compute the onsets of each track and compare them with the beat times obtained 142 M. Eppe et al. from the MIDI data. Onset estimation is realized by locating note onset events by picking peaks in an onset strength envelope [13]. The timing error is estimated as the mean time difference between the detected onsets and the nearest 32nd notes. Results are illustrated in Fig. 3 for bass, guitar and reed track generation. The histograms show that there exists a difference in the timing error between the generated and the original tracks, specifically for the generated bass tracks. Hence, we conclude that the neural architecture is very accurate in timing. This coincides with our subjective impression that we gain from the individual samples depicted in Sect. 5.2. The computed mean error is between 20ms and 40ms, which is the same for the original track. Since the onset estimation sometimes generates wrong onsets (cf. the double onsets in the original track of Ob-La-Di, Ob-La-Da, Sect. 5.2), we hypothesize that the error results from this inaccuracy rather than from inaccurate timing. Fig. 3. Timing results for bass, guitar and reed track generation. The x-axis denotes the average error in ms and the y-axis the number of samples in a histogram bin. Pitch Accuracy.Wemeasure the pitch accuracy of the generated audio track by determining the base frequency of consecutive audio frames of 50ms. Determining the base frequency is realized by quadratically interpolated FFT [17], and we compare it to the closest frequency of the 12 semitones in the chromatographic scale over seven octaves. The resulting error is normalized w.r.t. the frequency interval between the two nearest semitones, and averaged over all windows for each audio sample. The results (Fig. 4) show that that the system is relatively accurate in pitch, with a mean error of 11%, 7%, and 5.5% of the frequency interval between the nearest two semitones for bass, guitar, and reed respectively. However, in particular for the bass, this is a significantly larger error than the error of the original track. The samples depicted in Sect. 5.2 confirm these results subjectively, as the produced sound is generally much less clean than the MIDI- generated data, and there are several noisy artifacts and chunks that are clearly outside of the chromatographic frequency spectrum. Variation. To measure variation appropriateness, we consider the number of tones and the number of different notes in each sample. However, in contrast to pitch and timing, it is not possible to compute an objective error for the Towards End-to-End Raw Audio Music Synthesis 143 Fig. 4. Pitch accuracy results for bass, guitar and reed track generation; The x-axis denotes the average pitch error in fractions of the half interval between the two closest semitone frequencies. amount of variation in a musical piece. Hence, we directly compare the variation in the generated samples with the variation in the original samples and assume implicitly that the original target track has a perfect amount of notes and tones. Hence, to compute the variation appropriateness v we compare the number of original notes (norig) and tones (torig) with the number of generated notes (ngen) and tones (tgen), as described in Eq. 1. v ={vnotes · vtones with { torig norig t if torig < tgen n if norig < ngen (1)v = gen v = gentones tgen t otherwise notes ngen otherwise orig norig Results are illustrated in Fig. 5. The histograms show that there are several cases where the system produces the same amount of variation as the original tracks. The average variation value is approximately 0.5 for all instruments. However, we do not consider this value as a strict criterion for the quality of the generated tracks, but rather as an indicator to demonstrate that the system is able to produce tracks that are not too different from the original tracks. Fig. 5. Variation of generated tracks compared to the original track for three different instruments. 144 M. Eppe et al. 5.2 Qualitative Evaluation To evaluate the generated audio files qualitatively, we investigate the musical patterns of the generated examples. The patterns that we found range from simple sequences of quarter notes over salient accentuations and breaks to com- mon musical patterns like minor and major triads. In the following, analyze two examples of generated bass lines and, to demonstrate how the approach general- izes over different instruments, also one example of a generated flute melody. We visualize the samples using beat-synchronous chromagrams with indicated onsets (vertical while lines). The upper chromagrams represent the original melodies and the lower chromagrams the generated ones. Audio samples where the original tracks are replaced by the generated ones are linked with the song titles. Op. 74 No. 15 Andantino Grazioso - Mauro Giuliano.2 The piece has been written for guitar and flute, and we obtained this result by training the network on all files in our dataset that contain these two instruments. The newly generated flute track differs significantly from the original one although style and timbre are very similar. All notes of the generated track are played in D major scale, same as the original track. The beat is also the same even though the network generates more onsets overall. Near the end of the track, the flute plays a suspended C# which dissolves itself correctly into the tonic chord D. This shows how the network successfully emulates harmonic progression from the original. The Beatles - Ob-La-Di, Ob-La-Da.3 Most generated samples are similar to the illustrated one from The Beatles - Ob-La-Di, Ob-La-Da, where the generated notes are in the same key of the original composition, including the timings of chord changes. In some places, however, alternative note sequences have formed as can be seen in the first section of the chromagram, where the F-G is replaced by an D-G pattern, and in the middle section of the chromagram, where the D is exchanged with an A for two beats. Bob Dylan - Positively 4th Street.4 In some instances, the generated track contains melodies that are also played by other instruments (e.g. the left hand of the piano often mirrors the bassline). For these cases, we observed that the network has learned to imitate the key tones of other instruments. This results in generated tracks that are nearly identical to the original tracks, as illustrated in the following chromagram of Positively 4th Street. However, while the original bass sequence has been generated by a MIDI synthesizer, the new sample sounds much more bass-like and realistic. This 2 http://www.publications.eppe.eu/data/Giuliani Op74 No15 Andantino grazioso merged 3 http://www.publications.eppe.eu/data/The Beatles Ob-La-Di Ob-La-Da merged. wav 4 http://www.publications.eppe.eu/data/Bob Dylan Positively 4th Street merged. wav Towards End-to-End Raw Audio Music Synthesis 145 means that our system can effectively be used to synthesize an accurate vir- tual instrument, which can exploited as a general mechanism to re-synthesize specific tracks. 6 Conclusion We have presented a neural architecture for raw audio music generation, and we have evaluated the system in terms of pitch, timing, variation, and pattern generalization. The metrics that we applied are sufficiently appropriate to deter- mine whether our base line neural network architecture, or future extensions of it, have the potential to synthesize music directly in wave form, instead of using symbolic representations that restrict the possible outcome. We found that this is indeed the case, as the system is very exact in terms of timing, relatively exact in pitch, and because it generates a similar amount of variation as original music. We also conclude that the system applies appropriate musical standard patterns, such as playing common cadences. Examples like Positively 4th Street also show that our system is potentially usable as a synthesizer to enrich and replace MIDI-generated tracks. As future work, we also want to investigate in how far the system implicitly learns high-level musical features and patterns like cadences and triads, and how it uses such patterns to generate appropriate musical audio data. Acknowledgments. The authors gratefully acknowledge partial support from the German Research Foundation DFG under project CML (TRR 169), the European Union under project SECURE (No 642667). References 1. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparam- eter optimization in hundreds of dimensions for vision architectures. In: Interna- tional Conference on Machine Learning (ICML) (2013) 2. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur- rent neural networks on sequence modeling. In: Neural Information Processing Systems (NIPS) (2014) 3. Engel, J., et al.: Neural audio synthesis of musical notes with WaveNet autoen- coders. Technical report (2017). http://arxiv.org/abs/1704.01279 4. Eppe, M., et al.: Computational invention of cadences and chord progressions by conceptual chord-blending. In: Proceedings of the 24th International Joint Confer- ence on Artificial Intelligence (IJCAI), pp. 2445–2451 (2015) 5. Eppe, M., Kerzel, M., Strahl, E.: Deep neural object analysis by interactive audi- tory exploration with a humanoid robot. In: International Conference on Intelligent Robots and Systems (IROS) (2018) 6. Eppe, M., et al.: A computational framework for concept blending. Artif. Intell. 256(3), 105–129 (2018) 7. Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984) 146 M. Eppe et al. 8. Huang, A., Wu, R.: Deep learning for music. Technical report (2016). https:// arxiv.org/pdf/1606.04930.pdf 9. Kalingeri, V., Grandhe, S.: Music generation using deep learning. Technical report (2016). https://arxiv.org/pdf/1612.04928.pdf 10. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Interna- tional Conference on Learning Representations (ICLR) (2015) 11. Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. Trans. Assoc. Comput. Linguist. 5, 365–378 (2017) 12. Liang, F., Gotham, M., Johnson, M., Shotton, J.: Automatic stylistic composition of bach chorales with deep LSTM. In: Proceedings of the 18th International Society for Music Information Retrieval Conference, pp. 449–456 (2017) 13. Mcfee, B., et al.: librosa: audio and music signal analysis in Python. In: Python in Science Conference (SciPy) (2015) 14. Nayebi, A., Vitelli, M.: GRUV: algorithmic music generation using recurrent neural networks. Stanford University, Technical report (2015) 15. van den Oord, A., et al.: WaveNet: a generative model for raw audio. Technical report (2016). http://arxiv.org/abs/1609.03499 16. Simon, I., Oore, S.: Performance RNN: generating music with expressive timing and dynamics (2017). https://magenta.tensorflow.org/performance-rnn 17. Smith, J.O.: Spectral Audio Signal Processing. W3K Publishing (2011) 18. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. Technical report, Google, Inc. (2017). http://arxiv.org/abs/1703.10135 Real-Time Hand Prosthesis Biomimetic Movement Based on Electromyography Sensory Signals Treatment and Sensors Fusion João Olegário de Oliveira de Souza(&), José Vicente Canto dos Santos, Rodrigo Marques de Figueiredo, and Gustavo Pessin UNISINOS University, Sao Leopoldo, Brazil jolegario@unisinos.br Abstract. The hand of the human being is a very sophisticated and useful instrument, being essential for all types of tasks, from delicate manipulations and of high precision, to tasks that require a lot of force. For a long time researchers have been studying the biomechanics of the human hand, to reproduce it in robotic hands to be used as a prosthesis in humans, in the replacement of limbs lost or used in robots. In this study, we present the implementation (electronics project, acquisition, treatment, processing and control) of different sensors in the control of prostheses. The sensors studied and implemented are: inertial, electromyography (EMG), force and slip. The tests showed reasonable results due to sliding and dropping of some objects. These sensors will be used in a more complex system that will approach the fusion of sensors through Artificial Neural Networks (ANNs) and new tests should be performed for different scenarios. Keywords: Sensors fusion  Prosthesis biomimetic Artificial neural networks (ANN) 1 Introduction Human being can make a lot of activities using the hands. Hence, the loss of the hand will limit the capabilities of realizing the daily activities of any person. In a psycho- logical way, it is also very difficult for any one to accept any member amputation [1]. About 30% of the whole world 4 millions amputees society has the superior member loss [2]. Hand prosthesis are solutions to help people who has superior member loss. In order to diminish the psychological damage emerged the aesthetics prosthesis to hide the deficiency, but with no movements. The development of digital systems integrated to new prosthesis designs allowed movement function utilities. Every day situations, such as handle objects, are possible to amputated people when the movement prosthesis are available. However, nowadays, a simple and limited movement prosthesis is very expensive. Thus, the research of new technologies of biomimetic prosthesis becomes necessary. © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 147–156, 2018. https://doi.org/10.1007/978-3-030-01424-7_15 148 J. O. de Oliveira de Souza et al. The interface between patient muscles and the acquisition system is a critical part of this project. The measure of the muscle signals and understand the useful information from it is very challenging. The system for prosthesis control are based on sensor- fusion by an trained artificial neural network, and it will allow to implement new movement features any time. The artificial neural network will be trained with the data measured from the muscle (EMG sensor) and accelerometer, force and slipper sensors connected in the prothesis. The intension of this project is to build a hand prosthesis functional prototype with a set of possible movements, such as, pinch, catch and hold objects, and make social gestures like point and wave. First, the tests will run on non amputated people and after that, this prototype will be ready to be tested on real amputated people. This proposal aims a product with a low cost production and, therefore, provide it to a larger number of amputated persons. We already have a first version of the hand prosthesis. We used an open-source project and printed at a 3D printer. With a preliminary hardware and software, the first prototype of the prosthetic hand already have simple movements. The pressure of the five fingers can be monitored when it holds an object and it can detect when an object is slipping. These results will be presented forward. 2 Methodology This project focuses in two main points in the study of robotics systems: elec- tromyographic signal data acquisition and selective interpretation of those signals to be used on robotic devices operation. Another feature of the project is to apply artificial intelligence by using embedded artificial neural networks for electromyographic pattern recognition. The system will also learn and adapt to the environment habit changes and other behavior modifica- Fig. 1. Proposed system architecture tions. Figure 1 shows the proposed workflow for the threetier system (sensors, pro- cessing and operation). Real-Time Hand Prosthesis Biomimetic Movement 149 2.1 Electromyography (EMG) Electromyography (EMG) is a monitoring technique for evaluating the electric activity produced by the skeletal muscles. The final result of these measures are the potential difference between two or more sensors applied to the patient skin versus time. Many relevant information are contained on timing EMG signals, such as, the total time of the muscle activation, the intensity of movement and the behavior variation in every repetitive movement. 2.2 EMG Acquisition The EMG signal is a continuous time information obtained through an sensor applied to the patient near the muscles whose measurement we are interested. There are two kind of sensors: surface and intramuscular. In this project we choose to use the surface ones for a practical and non-invasive process. According to SENIAM (Surface EMG for the Non-Invasive Assessment of Muscles) [3] the Ag-AgCl surface sensors should be used with an conductor gel to an stable measurement along time and avoid unde- sirable noise. The EMG signal is obtained by the potential difference between two (or more) surface sensors and it can be divided into mono-polar or multi-polar systems [4]. The mono-polar configuration require an reference sensor and it is typically placed very far from the sensor we want to measure, in order to acquire simple signals. In the multi- polar configuration the potential difference is acquired through three or more points: a reference point and two or more signal points in relation to the reference, and the potential difference is obtained by subtracting these signals. In this project the bipolar sensor was used for better information acquisition from different muscle movements. Figure 2 shows the fixation of the electrodes on the arm for the tests of this work. Fig. 2. Fixation of the electrodes on the arm 2.3 Signal Treatment Typically EMG signal are represented in low frequencies from 70 Hz to 500 Hz [5]. However, it is not so easy to extract useful information from the signal without an electronic circuit that separates the real data of the muscle movement from noise. Noise is 150 J. O. de Oliveira de Souza et al. any other signal out of EMG information, such as, cardiac beatings, neighbor muscles or electric coupling between wires and sensors. The electronic circuit was developed with different technologies of frequency active filters in order to catch the real data from the EMG. After a good filtering, the signal should be converted to discrete time. With the data correctly sampled it is possible to evaluate the behavior of the signal in time domain or even in the frequency domain. Many tools like digital filters or Fourier analysis are useful to extract important information and proceed a good data mining. And after the choice of the analog-digital converter device, we have almost all data captured from muscle movements ready to be stored in a database. As will be explained forward, this database is the information to, in the first step, train an artificial neural network that is the main process core of this project. After a good training, this neural network will be embedded into the main core that will control other actuators to move the prosthetic hand. An electronic circuit was developed to real time sampling the analogical signal and supply the main core, already programmed, to move the pros- thetic hand. 2.4 The Prosthetic Hand The prototype developed of the hand prosthesis (Fig. 3) was based on the InMoov [6] open source design for full size 3D printer. The material used for the construction of its mechanical structure was ABS and the hand has 16 degrees of freedom. Each finger moves with a plastic coated steel cable and a servomotor (Towerpro, MG995). The finger returns to the rest position is done with a rubber band attached to each one. The five servomotors are located in the forearm of the prosthesis and the electric drive was realized by an ATmega2560 microcontroller embedded on an Arduino. This first prototype has many issues, like gaps and non precise movements, and a better one should be build. Fig. 3. Test holding the plastic cup with minimal force At this stage of the work, the pressure of the five fingers from the prosthetic hand are monitored when it holds an object and it can detect when an object is slipping. For this monitoring, force and slippery sensors are used and each finger has its own servomotor. The myoelectric signal, from a human arm is used to open and close the fingers of the prosthetic hand. For the signal capture, non-invasive superficial Real-Time Hand Prosthesis Biomimetic Movement 151 electrodes were used. To measure the force applied, the force sensor FSR400 from Interlink Electronics was used (Fig. 4). The slippery sensor (Fig. 5) used was the LDT0-028K from Measurement Specialties [7]. The slippery sensor is responsible to detect if an object is slipping from the fingers. Fig. 4. Sensor force FSR400 and fixation Fig. 5. Slippery sensor LDT0-028K 2.5 Database and Artificial Neural Networks Artificial neural networks (ANNs) are often used to determine these relationships because the artificial neurons can learn nonlinear behaviors [8–10]. The artificial neuron arrangement increases the ANN learning capacity [11, 12]. ANNs are regularly used for correlating stochastic variables in other fields, such as weather forecasting [13], load forecasting in electric systems [14], satellite imaging classification [15] and others. ANNs are also used in industrial applications and academic studies [14, 16]. Massively parallel distributed ANNs can process generic behaviors and mimic patterns [17]. They are based on processing units called artificial neurons [18]. There are different types of neural networks exist in the literature, as feedforward neural networks. Feedforward neural networks consist of layers that are input, hidden and output layers (Fig. 6). MLP and RBF networks, are the two most commonly used types of feedforward neural networks. They differ in the way that the hidden layer performs its computation. The MLP use inner products and training is done through Backpropagation. In RBF, each neuron in the hidden layer computes the Euclidean distance between an input vector and a point in the neuron which can be viewed as a centre vector [19]. After the first tests of Prosthetic Hand, different Multilayer Perceptron (MLP) and Radial Basis Function (RBF) neural network techniques will be used and compared to control the movements of the prosthesis. 152 J. O. de Oliveira de Souza et al. Fig. 6. A feedforward neural network 2.6 Sensors Fusion Using two different sensors to acquire data information, instead of one, bring best and more accurate results, the sensor fusion algorithms are based on this idea [20]. Despite the principle behind the idea is simple, the implementation needs a functional algorithm to be able to implement, this algorithm to fuse the information from two or more sensors can be done based on optimal estimation [21]. The combination of the sensor information and the subsequent state estimation can be done in a coherent way, so that the uncertainty is reduced. The Kalman Filter is a state-estimator algorithm widely used to optimally estimate the unknown state of a linear dynamic system from noise-corrupted measurements [22]. In this work, after the first tests, an Artificial Neural Network will be used as an algorithm to fuse the information from the EMG, slip, force and accelerometer sensors. 3 Preliminary Results For the system kick-start and preliminary tests, the InMoov [6] open-source project was used. The prosthetic hand was prototyped using a 3D printer at the University. On this first version, an Arduino was used to acquire the input signals (myoelectric and other sensors) and for the operation of the servomotor that controls the fingers of the pros- thetic hand. On the current status of the project, the neural network was not yet implemented. To analyze the results, a video of each test was recorded and then checked frame by frame through video analysis software called Tracker. The objects used in the tests were: a plastic cup, a tennis ball and a whiteboard eraser. At first test with the cup (Fig. 7), the prosthetic hand only applied minimal force to the object to hold it. It doesn’t monitored slippery (used EMG and force sensors) applying the minimum force on the object. In Fig. 9(a) we can verify that the glass moved 44.8 mm down, in relation to its initial position. Real-Time Hand Prosthesis Biomimetic Movement 153 Fig. 7. Holding and slipping test of the plastic cup Then, at second stage the test (Fig. 8), the prosthetic hand monitored the slippery and applied more force when it detected that the object started to slip. Fig. 8. Holding and slipping test of the plastic cup The prosthetic hand was tested ten times. It was possible to observe in the Fig. 9(b) that the object slid less than in the previous test. Fig. 9. Results with the tennis ball - (a) without slip sensor and (b) all sensors. 154 J. O. de Oliveira de Souza et al. The third test was with the tennis ball (Fig. 10). The system monitored the slippery and applied more force when it detected that the object started to slip. The system was tested ten times and on average the ball slid 2.5 mm down. In Fig. 11, one of the tests completed. Fig. 10. Holding and slipping test of the tennis ball Fig. 11. Result with the tennis ball As a final test, a whiteboard eraser (Fig. 12). The prosthetic hand was tested ten times with this object and it did not fall. The lowest value was 1.63 mm, the highest value was 18.75 mm and the mean slip was 5.63 mm. Figure 13 shows the result of one of the tests. Fig. 12. Holding and slipping test of the whiteboard eraser Real-Time Hand Prosthesis Biomimetic Movement 155 Fig. 13. Result with the whiteboard eraser 4 Conclusion This study was focused on performing a set of experiments of one Prosthetic Hand to grasp different objects by EMG, force and slip sensors. This work contributes with the thesis that only simulation is not sufficient to evaluate the characteristics of real sys- tems, since some behaviors can not be predicted. Future works can be made to use fusion sensor techniques to improve the Prosthetic Hand. Having more accurate sensors techniques, different tests with various different objects can be used to best perfor- mance comparison of the systems. References 1. Pillet, J., Didierjean-Pillet, A.: Aesthetic hand prosthesis: gadget or therapy? Presentation of a new classification. J. Hand Surg. 26(6), 523–528 (2001) 2. Toledo, C., Leija, L., Munoz, R., Vera, A., Ramirez, A.: Upper limb prostheses for amputations above elbow: a review. In: Health Care Exchanges, PAHCE 2009, Pan American, pp. 104–108. IEEE (2009) 3. Hermens, H.J., Freriks, B., Disselhorst-Klug, C., Rau, G.: Development of recommendations for SEMG sensors and sensor placement procedures. J. Electromyogr. Kinesiol. 10(5), 361– 374 (2000) 4. Duchêne, J., Goubel, F.: Surface electromyogram during voluntary contraction: processing tools and relation to physiological. Crit. Rev. Biomed. Eng. 21(4), 313–397 (1993) 5. Delsys Homepage. Neuromuscular Research Center. Boston University. http://www.delsys. com/library/papers. Accessed 31 Mar 2018 6. Langevin, G.: Inmov | Open-Source 3d Printed Life-Size Robot (2015). http://inmoov.fr/ project/. Accessed 15 Sept 2017 7. LDT with Crimps Vibration Sensor/Switch. Measurement Specialties (2015). https://www. variohm.com/images/datasheets/ENG_DS_LDT_with_Crimps_A.pdf. Accessed 25 Sept 2017 8. Philip Chen, C.L., Liu, Y.J., Wen, G.X.: Fuzzy neural network-based adaptive control for a class of uncertain nonlinear stochastic systems. IEEE Trans. Cybern. 44(5), 583–593 (2014) 156 J. O. de Oliveira de Souza et al. 9. Li, K., Huang, Z., Cheng, YC., Lee, CH.: A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4503–4507 (2014) 10. Tong, S., Wang, T., Li, Y., Zhang, H.: Adaptive neural network output feedback control for stochastic nonlinear systems with unknown dead-zone and unmodeled dynamics. IEEE Trans. Cybern. 44(6), 910–921 (2014) 11. Yu, Z., Li, S.: Neural-network-based output-feedback adaptive dynamic surface control for a class of stochastic nonlinear time-delay systems with unknown control directions. Neurocomputing 129, 540–547 (2014) 12. Zeng, X., Hui, Q., Haddad, W.M., Hayakawa, T., Bailey, J.M.: Synchronization of biological neural network systems with stochastic perturbations and time delays. J. Franklin Inst. 351(3), 1205–1225 (2014) 13. Culclasure, A.: Using neural networks to provide local weather forecasts. Electronic Theses & Dissertations, Jack N. Averitt College of Graduate Studies (COGS) (2013) 14. Hayati, M., Shirvany, Y.: Artificial neural network approach for short term load forecasting for Illam region. World Acad. Sci. Eng. Technol. 28, 280–284 (2007) 15. Piscini, A., et al.: A neural network approach for the simultaneous retrieval of volcanic ash parameters and SO2 using modis data. Atmos. Meas. Tech. 7(12), 4023 (2014) 16. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis. Department of Computer Science University of Toronto (2009) 17. Shamir, R.R., et al.: A Method for Predicting the Outcomes of Combined Pharmacologic and Deep Brain Stimulation Therapy for Parkinson’s Disease. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 188–195. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10470-6_24 18. Haykin, S.: Neural Networks and Learning Machines, vol. 3. Pearson Education, Upper Saddle River (2009) 19. Sereno, F., Marques de Sá, J.P., Matos, A., Bernardes, J.: A comparative study of MLP and RBF neural nets in the estimation of the fetal weight and length. In: Campilho, A., Mendonça, A. (eds.) Proceedings of RECPAD 2000 - 11th Portuguese Conference on Pattern Recognition, University of Porto (2000) 20. Waltz, E., Llinas, J.: Multisensor Data Fusion, vol. 685. Artech House, Norwood (1990) 21. Surachai, P., Afzulpurkar, N.: Sensor Fusion Techniques in Navigation Application for Mobile Robot. INTECH Open Access Publisher (2011) 22. Manyika, J., Durrant-Whyte, H.: Data Fusion and Sensor Management: A Decentralized Information - Theoretic Approach. Ellis Horwood, London (1994) An Exploration of Dropout with RNNs for Natural Language Inference Amit Gajbhiye1(B), Sardar Jaf1, Noura Al Moubayed1, A. Stephen McGough2, and Steven Bradley1 1 Department of Computer Science, Durham University, Durham, UK {amit.gajbhiye,sardar.jaf,noura.al-moubayed,s.p.bradley}@durham.ac.uk 2 School of Computing, Newcastle University, Newcastle upon Tyne, UK stephen.mcgough@ncl.ac.uk Abstract. Dropout is a crucial regularization technique for the Recur- rent Neural Network (RNN) models of Natural Language Inference (NLI). However, dropout has not been evaluated for the effectiveness at different layers and dropout rates in NLI models. In this paper, we propose a novel RNN model for NLI and empirically evaluate the effect of applying dropout at different layers in the model. We also investigate the impact of varying dropout rates at these layers. Our empirical eval- uation on a large (Stanford Natural Language Inference (SNLI)) and a small (SciTail) dataset suggest that dropout at each feed-forward con- nection severely affects the model accuracy at increasing dropout rates. We also show that regularizing the embedding layer is efficient for SNLI whereas regularizing the recurrent layer improves the accuracy for Sci- Tail. Our model achieved an accuracy 86.14% on the SNLI dataset and 77.05% on SciTail. Keywords: Neural networks · Dropout · Natural Language Inference 1 Introduction Natural Language Understanding (NLU) is the process to enable computers to understand the semantics of natural language text. The inherent complexities and ambiguities in natural language text make NLU challenging for computers. Natural Language Inference (NLI) is a fundamental step towards NLU [14]. NLI involves logically inferring a hypothesis sentence from a given premise sentence. The recent release of a large public dataset the Stanford Natural Language Inference (SNLI) [2] has made it feasible to train complex neural network models for NLI. Recurrent Neural Networks (RNNs), particularly bidirectional LSTMs (BiLSTMs) have shown state-of-the-art results on the SNLI dataset [9]. However, RNNs are susceptible to overfitting − the case when a neural network learns the exact patterns present in the training data but fails to generalize to unseen data [21]. In NLI models, regularization techniques such as early stopping [4], L2 regularization and dropout [20] are used to prevent overfitting. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 157–167, 2018. https://doi.org/10.1007/978-3-030-01424-7_16 158 A. Gajbhiye et al. For RNNs, dropout is an effective regularization technique [21]. The idea of dropout is to randomly omit computing units in a neural network during training but to keep all of them for testing. Dropout consists of element-wise multiplication of the neural network layer activations with a zero-one mask (rj) during training. Each element of the zero-one mask is drawn independently from rj ∼ Bernoulli(p), where p is the probability with which the units are retained in the network. During testing, activations of the layer are multiplied by p [19]. Dropout is a crucial regularization technique for NLI [9,20]. However, the location of dropout varies considerably between NLI models and is based on trail-and-error experiments with different locations in the network. To the best of our knowledge no prior work has been performed to evaluate the effectiveness of dropout location and rates in the RNN NLI models. In this paper, we study the effect of applying dropout at different locations in an RNN model for NLI. We also investigate the effect of varying the dropout rate. Our results suggest that applying dropout for every feed forward connection, especially at higher dropout rates degrades the performance of RNN. Our best model achieves an accuracy of 86.14% on the SNLI dataset and an accuracy of 77.05% on SciTail dataset. To the best of our knowledge this research is the first exploratory analysis of dropout for NLI. The main contributions of this paper are as follows: (1) A RNN model based on BiLSTMs for NLI. (2) A comparative analysis of different loca- tions and dropout rates in the proposed RNN NLI model. (3) Recommendations for the usage of dropout in the RNN models for NLI task. The layout of the paper is as follows. In Sect. 2, we describe the related work. In Sect. 3, we discuss the proposed RNN based NLI model. Experiments and the results are presented in Sect. 4. Recommendations for the application of dropouts are presented in Sect. 5. We conclude in Sect. 6. 2 Related Work The RNN NLI models follow a general architecture. It consists of: (1) an embed- ding layer that take as input the word embeddings of premise and hypothesis (2) a sentence encoding layer which is generally an RNN that generates representa- tions of the input (3) an aggregation layer that combines the representations and; (4) a classifier layer that classifies the relationship (entailment, contradiction or neutral) between premise and hypothesis. Different NLI models apply dropout at different layers in general NLI archi- tecture. NLI models proposed by Ghaeini et al. [9] and Tay et al. [20] apply dropout to each feed-forward layer in the network whereas others have applied dropout only to the final classifier layer [13]. Bowman et al. [2] apply dropout only to the input and output of sentence encoding layers. The models proposed by Bowman et al. [3] and Choi et al. [7] applied dropout to the output of embed- ding layer and to the input and output of classifier layer. Chen et al. [4] and Cheng et al. [6] use dropout but they do not elaborate on the location. Dropout rates are also crucial for the NLI models [15]. Even the models which apply dropout at the same locations vary dropout rates. An Exploration of Dropout with RNNs for Natural Language Inference 159 Previous research on dropout for RNNs on the applications such as neural language models [16], handwriting recognition [18] and machine translation [21] have established that recurrent connection dropout should not be applied to RNNs as it affects the long term dependencies in sequential data. Bluche et al. [1] studied dropout at different places with respect to the LSTM units in the network proposed in [18] for handwriting recognition. The results show that significant performance difference is observed when dropout is applied to distinct places. They concluded that applying dropout only after recurrent layers (as applied by Pham et al. [18]) or between every feed-forward layer (as done by Zaremba et al. [21]) does not always yield good results. Cheng et al. [5], investigated the effect of applying dropout in LSTMs. They randomly switch off the outputs of various gates of LSTM, achieving an optimal word error rate when dropout is applied to output, forget and input gates of the LSTM. Evaluations in previous research were conducted on datasets with fewer sam- ples. We evaluate the RNN model on a large, SNLI dataset (570,000 data sam- ples) as well as on a smaller SciTail dataset (27,000 data samples). Furthermore, previous studies concentrate only on the location of dropout in the network with fixed dropout rate. We further investigate the effect of varying dropout rates. We focus on the application of widely used conventional dropout [19] to non-recurrent connection in RNNs. Fig. 1. The Recurrent Neural Network model with possible dropout locations 3 Recurrent Neural Network Model for NLI Task The RNN NLI model that we have developed follows the general architecture of NLI models and is depicted in Fig. 1. The model combines the intra-attention model [13] with soft-attention mechanism [11]. The embedding layer takes as input word embeddings in the sentence of length L. The recurrent layer with BiLSTM units encodes the sentence. Next, the intra-attention layer generates the attention weighted sentence representation following the Eqs. (1)–(3) ( ) M = tanh W yY +WhRavg ⊗ eL (1) 160 A. Gajbhiye et al. ( ) α = softmax wTM (2) R = Y αT (3) where, W y, Wh are trained projection matrices, wT is the transpose of trained parameter vector w, Y is the matrix of hidden output vectors of the BiLSTM layer, Ravg is obtained from the average pooling of Y , eL ∈ LR is a vector of 1s, α is a vector of attention weights and R is the attention weighted sequence representation. The attention weighted sequence representation is generated for premise and hypothesis and is denoted as Rp and Rh. The attention weighted representation gives more importance to the words which are important to the semantics of the sequence and also captures its global context. The interaction between Rp and Rh is performed by inter-attention layer, following the Eqs. (4)–(6). I = RTv p Rh (4) R̃p = softmax(Iv)Rh (5) R̃h = softmax(Iv)Rp (6) where, Iv is the interaction vector. R̃p contains the words which are relevant based on the content of sequence Rh. Similarly, R̃h contains words which are important with respect to the content of sequence Rp. The final sequence encod- ing is obtained from the element-wise multiplication of intra-attention weighted representation and inter-attention weighted representation as follows: Fp = R̃p Rp (7) Fh = R̃h Rh (8) To classify the relationship between premise and hypothesis a relation vector is formed from the encoding of premise and hypothesis generated in Eqs. (7) and (8), as follows: vp,avg = averagepooling(Fp), vp,max = maxpooling(Fp) (9) vh,avg = averagepooling(Fh), vh,max = maxpooling(Fh) (10) Frelation = [vp,avg; vp,max; vh,avg; vh,max] (11) where v is a vector of length L. The relation vector (Frelation) is fed to the MLP layer. The three-way softmax layer outputs the probability for each class of NLI. 4 Experiments and Results 4.1 Experimental Setup The standard train, validation and test splits of SNLI [2] and SciTail [10] are used in empirical evaluations. The validation set is used for hyper-parameter tuning. The non-regularized model is our baseline model. The parameters for An Exploration of Dropout with RNNs for Natural Language Inference 161 the baseline model are selected separately for SNLI and SciTail dataset by a grid search from the combination of L2 regularization [1e − 4, 1e − 5, 1e − 6], batch size [32, 64, 256, 512] and learning rate [0.001, 0.0003, 0.0004]. The Adam [12] optimizer with first momentum set to 0.9 and the second to 0.999 is used. The word embeddings are initialized with pre-trained 300-D Glove 840B vectors [17]. Extensive experiments with dropout locations and hidden units were conducted however we show only the best results for brevity and space limits. 4.2 Dropout at Different Layers for NLI Model Table 1 presents the models with different combinations of layers to the output of which dropout are applied in our model depicted in Fig. 1. Table 2. shows the results for the models in Table 1. Each model is evaluated with dropout rates ranging from 0.1 to 0.5 with a granularity of 0.1. Table 1. Models with corresponding layers to the outputs of which dropout is applied. Model Layer Model 1 No Dropout (Baseline) Model 2 Embedding Model 3 Recurrent Model 4 Embedding and Recurrent Model 5 Recurrent and Intra-Attention Model 6 Inter-Attention and MLP Model 7 Recurrent, Inter-Attention and MLP Model 8 Embedding, Inter-Attention and MLP Model 9 Embedding, Recurrent, Inter-Attention and MLP Model 10 Recurrent, Intra-Attention, Inter-Attention and MLP Model 11 Embedding, Intra-Attention, Inter-Attention and MLP Model 12 Embedding, Recurrent, Intra-Attention, Inter-Attention and MLP Model 13 Embedding, Recurrent, Inter-Attention and MLP Dropout at Individual Layers. We first apply dropout at each layer including the embedding layer. Although the embedding layer is the largest layer it is often not regularized for many language applications [8]. However, we observe the benefit of regularizing it. For SNLI, the highest accuracy is achieved when the embedding layer is regularized (Model 2, DR 0.4). For SciTail, the highest accuracy is obtained when the recurrent layer is reg- ularized (Model 3, DR 0.1). The dropout injected noise at lower layers prevents higher fully connected layers from overfitting. We further experimented regular- izing higher fully connected layers (Intra-Attention, Inter-Attention and MLP) individually, however no significant performance gains were observed. 162 A. Gajbhiye et al. Table 2. Model accuracy with varying dropout rates for SNLI and SciTail datasets. Bold numbers shows the highest accuracy for the model within the dropout range. Models Dataset Dropout Rate (DR) 0.1 0.2 0.3 0.4 0.5 Model 1 SNLI 84.45 SciTail 74.18 Model 2 SNLI 84.56 84.59 84.42 86.14 84.85 SciTail 75.45 75.12 74.22 73.10 74.08 Model 3 SNLI 84.12 84.21 83.76 81.04 79.63 SciTail 76.15 75.78 73.50 73.19 75.26 Model 4 SNLI 83.83 85.22 84.34 80.82 79.92 SciTail 74.65 76.08 74.22 74.46 73.19 Model 5 SNLI 84.72 83.43 72.89 70.49 62.13 SciTail 75.87 75.13 75.26 73.71 72.25 Model 6 SNLI 84.17 84.32 83.71 82.79 81.68 SciTail 73.85 75.68 75.26 73.95 73.28 Model 7 SNLI 84.33 82.97 82.00 81.15 79.25 SciTail 73.75 75.02 74.37 73.37 73.42 Model 8 SNLI 84.67 85.82 84.60 84.14 83.94 SciTail 73.80 73.52 69.29 75.82 73.89 Model 9 SNLI 84.44 83.05 82.09 81.64 79.62 SciTail 75.68 76.11 75.96 70.84 74.55 Model 10 SNLI 84.45 80.95 75.31 70.81 69.34 SciTail 73.30 75.21 74.98 74.65 71.59 Model 11 SNLI 84.31 82.43 78.94 74.93 70.54 SciTail 75.63 73.47 74.93 74.93 70.32 Model 12 SNLI 84.32 82.60 73.36 71.53 66.67 SciTail 73.47 75.63 74.74 73.42 74.40 Dropout at Multiple Layers. We next explore the effect of applying dropout at multiple layers. For SNLI and SciTail, the models achieve higher performance when dropout is applied to embedding and recurrent layer (Model 4, DR 0.2). This supports the importance of regularizing embedding and recurrent layer as shown for individual layers. It is interesting to note that regularizing the recurrent layer helps SciTail (Model 7, DR 0.2) whereas regularizing the embedding layer helps SNLI (Model 8, DR 0.2). A possible explanation to this is that for the smaller SciTail dataset the model can not afford to lose information in the input, whereas for the larger SNLI dataset the model has a chance to learn even with the loss of information in input. Also, the results from models 7 and 8 suggests that applying dropout at a An Exploration of Dropout with RNNs for Natural Language Inference 163 single lower layer (Embedding or Recurrent; depending on the amount of training data) and to the inputs and outputs of MLP layer improves performance. We can infer from models 9, 10, 11 and 12 that applying dropout to each feed forward connection helps preventing the model overfit for SciTail (DR 0.1 and 0.2). However, for both the datasets with different dropout locations the performance of the model decreases as the dropout rate increases (Sect. 4.4). Fig. 2. Convergence Curves: (a) Baseline Model for SNLI (Model 1), (b) Best Model for SNLI (Model 2, DR 0.4), (c) 100 Unit Model for SciTail (Model 13 DR 0.4), (d) 300 Unit Model for SciTail (Model 9 DR 0.2). 4.3 The Effectiveness of Dropout for Overfitting We study the efficacy of dropout on overfitting. The main results are shown in Fig. 2. For SNLI, Fig. 2(a)–(b), shows the convergence curves for the baseline model and the model achieving the highest accuracy (Model 2, DR 0.4). The convergence curves show that dropout is very effective in preventing overfitting. However, for the smaller SciTail dataset when regularizing multiple layers, we observe that the highest accuracy achieving model (Model 9, DP 0.2), overfits significantly (Fig. 2(d)). This overfitting is due to the large model size. With limited training data of SciTail, our model with higher number of hidden units learns the relationship between the premise and the hypothesis most accurately (Fig. 2(d)). However, these relationships are not representative of the validation set data and thus the model does not generalize well. When we reduced the model size (50, 100 and 200 hidden units) we achieved the best accuracy for SciTail at 100 hidden units (Table 3). The convergence curve (Fig. 2(c)) shows that dropout effectively prevents overfitting in the model with 100 hidden units in comparison to 300 units. Furthermore, for SciTail dataset, the model with 100 units achieved higher accuracy for almost all the experiments when compared to models with 50, 200 and 300 hidden units. 164 A. Gajbhiye et al. Table 3. Accuracy of 100 unit model for SciTail dataset Models Dataset Dropout Rate (DR) 0.1 0.2 0.3 0.4 0.5 Model 13 SciTail 76.72 76.25 77.05 72.58 74.22 The results of this experiment suggest that given the high learning capac- ity of RNNs an appropriate model size selection according to the amount of training data is essential. Dropout may independently be insufficient to prevent overfitting in such scenarios. 4.4 Dropout Rate Effect on Accuracy and Dropout Location We next investigate the effect of varying dropout rates on the accuracy of the models and on various dropout locations. Figure 3 illustrates varying dropout rates and the corresponding test accuracy for SNLI. We observe some distinct trends from the plot. First, the dropout rate and location does not affect the accuracy of the models 2 and 8 over the baseline. Second, in the dropout range [0.2–0.5], the dropout locations affect the accuracy of the models significantly. Increasing the dropout rate from 0.2 to 0.5 the accuracy of models 5 and 12 decreases significantly by 21.3% and 15.9% respectively. For most of the models (3, 4, 6, 7, 9 and 10) the dropout rate of 0.5 decreases accuracy. Fig. 3. Plot showing the variation of test accuracy across the dropout range for SNLI. From the experiments on SciTail dataset (Fig. 4), we observed that the dropout rate and its location do not have a significant effect on most of the mod- els, with the exception of model 8 (which shows erratic performance). Finally, for almost all the experiments a large dropout rate (0.5) decreases the accuracy of the models. The dropout rate of 0.5 works for a wide range of neural networks and tasks [19]. However, our results show that this is not desirable for RNN models of NLI. Based on our evaluations a dropout range of [0.2–0.4] is advised. An Exploration of Dropout with RNNs for Natural Language Inference 165 Fig. 4. Plot showing the variation of test accuracy across the dropout range for SciTail. 5 Recommendations for Dropout Application Based on our empirical evaluations, the following is recommended for regular- izing a RNN model for NLI task: (1) Embedding layer should be regularized for large datasets like SNLI. For smaller datasets such as SciTail regularizing recurrent layer is an efficient option. The dropout injected noise at these layers prevents the higher fully connected layers from overfitting. (2) When regularizing multiple layers, regularizing a lower layer (embedding or recurrent; depending on the amount of data) with the inputs and outputs of MLP layer should be consid- ered. The performance of our model decreased when dropout is applied at each intermediate feed-forward connection. (3) When dropout is applied at multiple feed forward connections, it is almost always better to apply it at lower rate − [0.2− 0.4]. (4) Given the high learning capacity of RNNs, an appropriate model size selection according to the amount of training data is essential. Dropout may independently be insufficient to prevent overfitting in the scenarios otherwise. 6 Conclusions In this paper, we reported the outcome of experiments conducted to investi- gate the effect of applying dropout at different layers in an RNN model for the NLI task. Based on our empirical evaluations we recommended the probable locations of dropouts to gain high performance on NLI task. Through extensive exploration, for the correct dropout location in our model, we achieved the accu- racies of 86.14% on SNLI and 77.05% on SciTail datasets. In future research, we aim to investigate the effect of different dropout rates at distinct layers. 166 A. Gajbhiye et al. References 1. Bluche, T., Kermorvant, C., Louradour, J.: Where to apply dropout in recurrent neural networks for handwriting recognition? In: 2015 13th International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 681–685. IEEE (2015) 2. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Association for Computational Linguistics (2015) 3. Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D., Potts, C.: A fast unified model for parsing and sentence understanding. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1466–1477 (2016) 4. Chen, Q., Zhu, X., Ling, Z.H., Inkpen, D.: Natural language inference with external knowledge. arXiv preprint arXiv:1711.04289 (2017) 5. Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y.: An exploration of dropout with LSTMs. In: Proceedings of Interspeech (2017) 6. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 551–561 (2016) 7. Choi, J., Yoo, K.M., Lee, S.G.: Learning to Compose Task-specific Tree Structures. AAAI (2017) 8. Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recur- rent neural networks. In: Advances in Neural Information Processing Systems, pp. 1019–1027 (2016) 9. Ghaeini, R., et al.: Dr-bilstm: Dependent reading bidirectional LSTM for natural language inference. arXiv preprint arXiv:1802.05577 (2018) 10. Khot, T., Sabharwal, A., Clark, P.: SciTail: a textual entailment dataset from science question answering. In: Proceedings of AAAI (2018) 11. Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2017) 12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 13. Liu, Y., Sun, C., Lin, L., Wang, X.: Learning natural language inference using bidirectional LSTM model and inner-attention. CoRR abs/1605.09090 (2016) 14. MacCartney, B.: Natural language inference. Stanford University (2009) 15. Munkhdalai, T., Yu, H.: Neural tree indexers for text understanding. In: Proceed- ings of the Conference, Association for Computational Linguistics, Meeting, vol. 1, p. 11. NIH Public Access (2017) 16. Pachitariu, M., Sahani, M.: Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650 (2013) 17. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word represen- tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 18. Pham, V., Bluche, T., Kermorvant, C., Louradour, J.: Dropout improves recurrent neural networks for handwriting recognition. In: 2014 14th International Confer- ence on Frontiers in Handwriting Recognition (ICFHR), pp. 285–290. IEEE (2014) An Exploration of Dropout with RNNs for Natural Language Inference 167 19. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 20. Tay, Y., Tuan, L.A., Hui, S.C.: A compare-propagate architecture with align- ment factorization for natural language inference. arXiv preprint arXiv:1801.00102 (2017) 21. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014) Neural Model for the Visual Recognition of Animacy and Social Interaction Mohammad Hovaidi-Ardestani1,2, Nitin Saini1,2, Aleix M. Martinez3, and Martin A. Giese1(B) 1 Section of Computational Sensomotorics, Department of Cognitive Neurology, CIN and HIH, University Clinic Tübingen, Ottfried-Müller-Str. 25, 72076 Tübingen, Germany martin.giese@uni-tuebingen.de 2 IMPRS for Cognitive and Systems Neuroscience, Tübingen, Germany 3 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA Abstract. Humans reliably attribute social interpretations and agency to highly impoverished stimuli, such as interacting geometrical shapes. While it has been proposed that this capability is based on high-level cognitive processes, such as probabilistic reasoning, we demonstrate that it might be accounted for also by rather simple physiologically plausible neural mechanisms. Our model is a hierarchical neural network archi- tecture with two pathways that analyze form and motion features. The highest hierarchy level contains neurons that have learned combinations of relative position-, motion-, and body-axis features. The model repro- duces psychophysical results on the dependence of perceived animacy on motion smoothness and the orientation of the body axis. In addition, the model correctly classifies six categories of social interactions that have been frequently tested in the psychophysical literature. For the genera- tion of training data we propose a novel algorithm that is derived from dynamic human navigation models, and which allows to generate arbi- trary numbers of abstract social interaction stimuli by self-organization. Keywords: Hierarchy · Neural network model · Animacy Social interaction perception 1 Introduction Humans spontaneously can decode animacy and social interactions from strongly impoverished stimuli. A classical study by Heider and Simmel [1] demonstrated that humans derived very consistently interpretations in terms of social interac- tions from simple geometrical figures that moved around in the two-dimensional plain. The figures were interpreted as living agents, to which even personality traits were attributed. More recent studies have characterized in more detail which critical features of simple stimuli affect the perception of animacy, that ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 168–177, 2018. https://doi.org/10.1007/978-3-030-01424-7_17 Neural Model for the Visual Recognition of Animacy and Social Interaction 169 is whether the object is perceived as alive [2–4]. Furthermore, detailed studies have focused on the perception of social interactions between multiple moving shapes, e.g. focusing on ‘chasing’ or ‘fighting’ [5,6]. Six interaction types have been used in a number of studies [7–9], McAleer and Pollick [9] showed that these categories can be reliably classified from stimuli showing moving circular disks whose movements were derived from real interactions. Coarse neural substrates of the processing of such stimuli have been iden- tified in fMRI studies. Animacy has been studied, modulating the movement parameters of individual moving shapes [10–12], and stimuli similar to the ones by Heider & Simmel have been frequently used in studies addressing Theory of Mind [13,14]. In fMRI and monkey studies regions like the superior temporal sulcus (STS) and human area TPJ were found to be selective for these stimuli [15–18]. In spite of this localization of relevant cortical areas, the underlying exact neural circuits of this processing remain entirely unclear. Some theories have associated the processing of such abstract stimuli with probabilistic reason- ing [19,20], while others have linked them to lower-level visual processing [6]. So far no ideas exist how such functions could be accounted for by physiologically plausible neural circuits. The goal of this paper is to present a simple neural model that reproduces some of the key observations in psychophysical experiments about the percep- tion of animacy and social interactions from simple abstract stimuli. The model in its present form is simple, but in principle extendable for the processing of more complex stimuli that require also the processing of shape details or shapes in clutter. The model is an extension of classical models of the visual process- ing stream that account for the processing of object shape and actions [21–24]. However, such models never have been applied to account for the perception of animacy or social interaction. Our attempt to use these types of architectures is motivated by recent work that showed that models of this type for the recog- nition of hand actions also account for the perception of causality from simple stimulus displays that consist of moving disks [25]. This modeling work predicted also the existence of neurons in macaque cortex that are specifically involved in the visual perception of causality [26]. Here we show that a model based on similar principles accounts for the perception of animacy and social interactions. In the following section, we first describe how we generated a stimulus set for training of the neural model, devising a generative model for social interaction stimuli that is based on a dynamical systems approach. We then describe the architecture of the model. The following section describes the results, followed by a brief discussion. 2 Stimulus Synthesis For the training of neural network models a sufficient set of stimuli is required. The problem is that from the classical psychophysical studies only a rather small set of stimuli is publicly available. For a meaningful application of learning-based neural networks approaches thus a sufficiently large training data set with similar 170 M. Hovaidi-Ardestani et al. properties needs to be generated. In our study we used movies showing individual moving agents, and interaction of 2 agents (chasing, playing, following, flirting, guarding, fighting) described in psychophisical studies [7–9]. In order to model the interaction of two moving agents we exploited a dynam- ical systems approach, which before was used very successfully for the modeling of human navigation [27]. The underlying idea, originally derived from robotics [28], is to define a dynamical systems or differential equations for the heading directions φi and the instantaneous propagation speeds vi of the interacting agents (in our case i = 1, 2). The specified movement is dependent on goal and obstacle points in the two dimensional plain, where the other agent can also act as goal or obstacle as well. We modified a model for human steering behavior during walking [29] to reproduce the movements during social interactions. The resulting dynamics is given by the following differential equations for the heading direction: φ̈i = −bφ̇i − kg(φi − ψg,i)(e−c1dg,i + c2) N∑obst + k (φ − ψ )(e−c3|φi−ψo,ni|o i o,ni )(e−c4do,ni). (1) n=1 The variables ψg,i and dg,i signify the absolute direction of the actual goal point and the distance of the goal from the agent in the 2D plain. Likewise, ψo,ni and do,ni signify the absolute direction and distance from obstacle number n from the agent, where Nobst is the number of relevant obstacles, and where km and cm signify constants. The forward speed of the agents is specified by the two stochastic differential equations τ v̇i = −vi + Fi(dg,i) + ki(t), (2) where i(t) is Gaussian white noise. The two functions Fi that specify the dis- tance dependence of the speed dynamics are different for the two agents: 1 F1(d) = − − c e−kd (3)1 + e c5(d−c6) 7 c ( ) = 8F d −kd2 1 + e−c9(d− − c11e + c12. (4)c10) The goal point of the second agent was typically the first agent. The goal points for the first agent was given by a sequence of fixed positions, which were randomly generated by uniformly sampling from the 2D plain and rejecting the samples that were closer than a fixed distance from the last sample. Since it turned out that the influence of the obstacle terms was rather low for the speed dynamics, we dropped the obstacle terms from the speed control dynam- ics. Table 1 provides an overview of the model parameters for the six simulated behaviors. We generated 50 stimuli for each interaction class. Figure 1 shows examples paths of the agents for the different behaviors for typical simulations. Neural Model for the Visual Recognition of Animacy and Social Interaction 171 Table 1. Parameters of simulation algorithm. Agent 1 Agent 2 kε C5 C6 C7 k kε C8 C9 C10 C11 C12 k Guarding (Gu) 0 1 5 0 0 0 1 1 3 0 0.5 0 Following (FO) 0 10 7 0 0 0 1 4 4 0 0 0 Fighting (FI) 1 1 3 1 0.1 1 1 1 3 1 0 0.1 Chasing (CH) 0 10 7 0 0 0 1 1 7 0 0 0 Flirting (FL) 0 1 5 0 0 1 0.6 1 2 1 0 0.5 Playing (PL) 0 1 5 0 0 1 1 1 10 0 0.5 0 35 35 35 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 5 10 15 20 25 30 35 00 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 x x x 35 35 35 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 x x x Fig. 1. Sample trajectories for 6 different social interactions. Colors indicate the posi- tions of the two agents (agent 1: blue, agent 2: red). Color saturation indicates time, the color fading out after long times. (Color figure online) 3 Model Architecture An overview of the model architecture is shown in Fig. 2. Building on classical biologically-inspired models for shape and action processing [21,22], the model comprises a form and a motion pathway, each consisting of a hierarchy of fea- ture detectors. Presently, these pathways were modelled following these classical papers, which was sufficient for the tested simple stimuli. Form Pathway: The form pathway of the simple model implementation here comprises only three hierarchy layers. The first is composed from (even and uneven) Gabor filters with 8 different orientations (cf. [22]), whose centers were placed in a grid of 120 by 120 points across the pixel image. The outputs of y y y y y y 172 M. Hovaidi-Ardestani et al. Fig. 2. Model consisting of a form and a motion pathway. ME signifies a layer of motion energy detectors, and RPM the relative position map. The top level of the model is formed by neural detectors for the perceived animacy, and a network that classifies six different types of interactions. (See text for details.) this Gabor filter array are pooled by the next layer using a maximum oper- ation over a grid of 41 by 41 filters, separately for the different orientations, in order to increase the position-invariance of the representation. The highest layer of the form pathway is formed by Gaussian radial basis function, which are trained with the shapes of the agents in different 2D orientations. Opposed to many other object recognition architectures, these shape-selective neurons have receptive fields of limited size (about 20% of the width of the image), which is consistent with neural data from area IT [30]. The outputs of this layer provide thus information about the identity of the agents, their positions, and their ori- entation in the image plain. The signal uk(φ, x, y) is the output activity of the neural detectors detecting shape k at the 2D position (x, y). Summing this signal over all φ provides a neural activity distribution up (x, y) whose peak signalsk the position of agent k in the image. This signal is used to compute the velocity and the relative positions of the moving elements or animate objects. Similarly, by summing over the positions one obtains a activity distribution uφ (φ) overk the directions with a peak at φk.vadjust Neural Model for the Visual Recognition of Animacy and Social Interaction 173 Motion Pathway: It analyzes the 2D motion and the relative motion of the moving agents. As input we use the time-dependent signals up (x, y) for eachk agent as input to a field of standard motion energy detectors (ME in Fig. 2), resulting in an output that encodes the motion energy in terms of a four- dimensional neural activity distribution (dropping the index k in the following) uv(x, y, vx, vy, t), where v = (vx, vy) is the preferred velocity vector of the motion energy detector. Pooling this output activity distribution over all spatial posi- tions using a maximum operation, a position-invariant neural representation of velocity is obtained. From this a neural representation of motion direction is obtained by pooling this activity distribution over all neurons with the same (similar) motion direction, resulting in a one-dimensional activity distribution uθ(θ, t) over the motion direction θ, from which the direction can be easily esti- mated by computing a population vector1. The same applies to the length of the velocity vector2 v = |v|. In order to compute also the acceleration of the agents, we transmit the position-invariant activity distribution uv(vx, vy, t) as input to another field of motion energy detectors, which computes from this an energy distribution ua(x, y, ax, ay, t) over the acceleration vectors a = (ax, ay). By pooling over directions, from this an activity distribution over the length of these vectors a = |a|) is computed, and again this parameter can be estimated by a simple population vector. The population estimates of θ, v and a enter the animacy computation (s.b.). For analyzing the relative motion of the two agents, following [22], the output distributions up (x, y) of the form pathway are also fed into a gain field networkk that computes a representation of the position of the second agent in a coordinate frame that is centered on the fir∫st. Its output is computed as convolution-like integral of the form u (x, y) = u (x′, y′)u (x + x′, y + y′) dx′dy′p ′ ′ p1 p2 . ThisR x ,y output defines a neural relative position map that represents the position of agent 2 as an activity peak in a coordinate frame that is centered on the first. The inte- gral is taken over a finite region of shifts |(x, y)| < D, implying that situations where the agents have a distance substantially larger than D will not produce an output peak. This makes sense since agents that are too distant do not pro- duce the percept of a social interaction. The activity distribution up (x, y, t) isR again processed by a cascade of two levels of motion energy detectors in order to compute the relative speed and acceleration of the two agents. Population estimates of the relative distance dR = |pR|, velocity vR, and the acceleration aR enter the interaction classifier. Recognition Level: The highest level of the model consists of a circuit that derives the perceived animacy of the two agents, and another one that classifies the perceived interaction class. The neurons detecting instantaneous animacy (dropping again the index k and time) multiply two input derived from the signal of both pathways signals B = A1A2. The first signal measures the alignment of ( 1 ∑A simple es∑timate of the) encoded angle is given by θ̂ = arg ( m exp(iθm) uθ(θm, t))/( m uθ(θm, t)) , wh(er∑e the θm are the pref∑erred direction)s of the neurons.2 Here the estimator is v̂ = arg ( m vmuv(vm, t))/( m uv(vm, t)) , where the vm are the preferred speeds of the neurons. 174 M. Hovaidi-Ardestani et al. the body axis of the moving agent with its direction of its motion. It is just given by the scalar product of the activity distributions over the body axis of the ∑agent uφ(φ) and the motion direction of the agent uθ(θ) in the form A1 = n uφ(θn)uθ(θn). The second signal A2 linearly combines information about the speed, and the magnitude changes and angular changes of speed, which are given by a and the angular component of a. The linear mixing weights of the animacy neurons were estimated by fitting the psychophysical results from [2]. Final animacy responses were computed as time averages over the whole trajectories. The second circuit at the top level of the model classifies the different inter- action types based on the following features: speeds vi and acceleration ai of the agents, and relative position pR, velocity vR, and acceleration aR of the agents. These features served as inputs of different classifier models, We tested a multi-layer perceptron, linear and nonlinear discriminant analysis (see also [31]), k-nearest neighbor classification, and a linear and a nonlinear support vector machine. 4 Results Results on animacy detection are shown in Fig. 3. The model reproduces at least qualitatively the dependence of animacy ratings on directions and speed changes [2]. In these experiments an agent shape moved along a straight line and then suddenly changed speed or direction by different amounts. In addition, the model reproduces the fact that a moving figure that has a body axis, like a rectangle, results in stronger perceived animacy than a circle if the movement, and that the rating is highest if the body axis is aligned with the motion than if it is not aligned [2]. Figure 4 shows example results from the Table 2. Classification results with application of the different classifier models different classifiers (6 interaction for the 6 interaction behaviors in the study types). [9]. The classifiers were trained on movies Classifier Accuracy generated with the stimulus generation algo- Linear SVM 99.0% rithm described in Sect. 2. The linear SVM classifier achieves 99% correct classifications Gaussian kernel SVM 96.3% on this data set. See Table 2 for the results LDA 94.7% with the other classifiers. Most importantly, KNN 94.7% the model achieved also 100 % correct clas- Nonlinear LDA 94.3% sifications on the example videos from [9], Neural Network 94.0% even though these movies were not used for training. Neural Model for the Visual Recognition of Animacy and Social Interaction 175 Fig. 3. Simulation results for animacy perception in comparison with experimental results. (a), (d): Dependence of animacy ratings on size of direction change. (b), (e): Dependence of animacy rating on size of speed change. (c), (f): Effect of alignment of body axis with motion direction, compared with moving circle (no body axis). Fig. 4. Confusion matrices for the best (Linear SVM) and the worst (KNN) classifier; TP: true positive rate, FN stands for false negative rate. 50 videos per class. 5 Conclusion Our model accounts by combination of very elementary neural mechanisms for a number of classical results from animacy and social interaction perception from abstract figures. To our knowledge this is the first neural model that can account for such results. Evidently the model is only a proof-of-concept with many short- comings, a major one being that the accuracy of the form and motion pathway that provide input to the animacy and interaction detection have to be improved. Since the model is in principle consistent with deep architectures for form and 176 M. Hovaidi-Ardestani et al. action recognition that can achieve high performance level it seems likely that it can be extended to the processing of much more challenging stimulus material. Even in its simple form the model proves that animacy and social interaction judgements partly might be derived by very elementary operations in hierarchical neural vision systems, without a need of sophisticated or accurate probabilistic inference. Acknowledgments. This work was supported by: HFSP RGP0036/2016; the Euro- pean Commission HBP FP7-ICT2013-FET-F/604102 and COGIMON H2020-644727, the DFG KA 1258/15-1, and BMBF CRNC FK: 01CQ1704. References 1. Heider, F., Simmel, M.: An experimental study of apparent behavior. Am. J. Psy- chol. 57(2), 243–259 (1944) 2. Tremoulet, P.D., Feldman, J.: Perception of animacy from the motion of a single object. Perception 29, 943–951 (2000) 3. Tremoulet, P.D., Feldman, J.: The influence of spatial context and the role of intentionality in the interpretation of animacy from motion. Percept. Psychophys. 68(6), 1047–1058 (2006) 4. Hernik, M., Fearon, P., Csibra, G.: Action anticipation in human infants reveals assumptions about anteroposterior body structure and action. In: Proceedings, Biological Sciences (2014) 5. Scholl, B.J., Tremoulet, P.D.: Perceptual causality and animacy. Trends Cogn. Sci. 4(8), 299–309 (2000) 6. Gao, T., Scholl, B.J.: Perceiving animacy and intentionality. In: Rutherford, M.D., Kuhlmeier, V.A., (eds.) Social Perception. The MIT Press (2013) 7. Blythe, P., Miller, G.F., Todd, P.M.: How motion reveals intention: categorizing social interactions. In: Gigerenzer, G., Todd, P. (eds.) Simple heuristics that make us smart, pp. 257–285. Oxford University Press, London (1999) 8. Barrett, H.C., Todd, P.M., Miller, G.F., Blythe, P.W.: Accurate judgments of inten- tion from motion cues alone: a cross-cultural study. Evol. Hum. Behav. 26(4), 313–331 (2005) 9. McAleer, P., Pollick, F.E.: Understanding intention from minimal displays of human activity. Behav. Res. Methods 40, 830–839 (2008) 10. Schultz, J., Friston, K.J., O’Doherty, J., Wolpert, D.M., Frith, C.D.: Activation in posterior superior temporal sulcus parallels parameter inducing the percept of animacy. Neuron 45(4), 625–635 (2005) 11. Morito, Y., Tanabe, H.C., Kochiyama, T., Sadato, N.: Neural representation of animacy in the early visual areas: a functional MRI study. Brain Res. Bull. 79(5), 271–280 (2009) 12. Shultz, S., McCarthy, G.: Perceived animacy influences the processing of human- like surface features in the fusiform gyrus. Neuropsychologia 60, 115–120 (2014) 13. Blakemore, S.-J., Boyer, P., Pachot-Clouard, M., Meltzoff, A., Segebarth, C., Decety, J.: The detection of contingency and animacy from simple animations in the human brain. Cereb. Cortex 13(8), 837–844 (2003) 14. Yang, D.Y.-J., Rosenblau, G., Keifer, C., Pelphrey, K.A.: An integrative neural model of social perception, action observation, and theory of mind. Neurosci. Biobe- hav. Rev. 51, 263–275 (2015) Neural Model for the Visual Recognition of Animacy and Social Interaction 177 15. Lahnakoski, J.M., et al.: Naturalistic FMRI mapping reveals superior temporal sulcus as the hub for the distributed brain network for social perception. Front. Hum. Neurosci. 6, 233 (2012) 16. Isik, L., Koldewyn, K., Beeler, D., Kanwisher, N.: Perceiving social interactions in the posterior superior temporal sulcus. PNAS 114, E9145–E9152 (2017) 17. Sliwa, J., Freiwald, W.A.: A dedicated network for social interaction processing in the primate brain. Science 356(6339), 745–749 (2017) 18. Walbrin, J., Downing, P., Koldewyn, K.: Neural responses to visually observed social interactions. Neuropsychologia 112, 31–39 (2018) 19. Baker, C.L., Saxe, R., Tenenbaum, J.B.: Action understanding as inverse planning. Cogn. Reinf. Learn. High. Cogn. 113, 329–349 (2009) 20. Shu, T., Peng, Y., Fan, L., Lu, H., Zhu, S.-C.: Perception of human interaction based on motion trajectories: from aerial videos to decontextualized animations. Top. Cogn. Sci. 10(1), 225–241 (2018) 21. Riesenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999) 22. Giese, M.A., Poggio, T.: Neural mechanisms for the recognition of biological move- ments. Nat. Rev. Neurosci. 4, 179–192 (2003) 23. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE 11th International Conference on Computer Vision (2007) 24. Fleischer, F., Caggiano, V., Thier, P., Giese, M.A.: Physiologically inspired model for the visual recognition of transitive hand actions. J. Neurosci. 15(33), 6563–80 (2013) 25. Fleischer, F., Christensen, A., Caggiano, V., Thier, P., Giese, M.A.: Neural theory for the perception of causal actions. Psychol. Res. 76(4), 476–493 (2012) 26. Caggiano, V., Fleischer, F., Pomper, J.K., Giese, M.A., Thier, P.: Mirror neurons in Monkey premotor area F5 show tuning for critical features of visual causality perception. Current Biology 26(22), 3077–3082 (2016) 27. Warren, W.H.: The dynamics of perception and action. Psychol. Rev. 113(2), 358– 389 (2006) 28. Schner, G., Dose, M.: A dynamical systems approach to task-level system inte- gration used to plan and control autonomous vehicle motion. Robot. Auton. Syst. 10(4), 253–267 (1992) 29. Fajen, B.R., Warren, W.H.: Behavioral dynamics of steering, obstacle avoidance, and route selection. J. Exp. Psycholology Hum. Percept. Perform. 1(3), 184–184 (2003) 30. di Carlo, J.J., Zoccolan, D., Rust, N.C.: How does the brain solve visual object recognition? Neuron 73(3), 415–434 (2012) 31. You, D., Hamsici, O.C., Martinez, A.M.: Kernel optimization in discriminant anal- ysis. IEEE Trans. Pattern Anal. Mach. Intell. 33(3), 631–638 (2011) Attention-Based RNN Model for Joint Extraction of Intent and Word Slot Based on a Tagging Strategy Dongjie Zhang1,2, Zheng Fang1,2, Yanan Cao2(&), Yanbing Liu2, Xiaojun Chen2, and Jianlong Tan2 1 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 2 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China {zhangdongjie,fangzheng,caoyanan,liuyanbing, chenxiaojun,tanjianlong}@iie.ac.cn Abstract. In this paper, we proposed an attention-based recurrent neural net- work model based on a tagging strategy for intent detection and word slot extraction. Unlike other joint models dividing the joint task into two sub-models by sharing parameters, we explore a tagging strategy to incorporate the intent detection task and word slot extraction task in a sequence labeling model. We implemented experiments on a public dataset and the results show that the tagging strategy methods outperform most of the existing pipelined and joint methods. Our tagging strategy model obtained 97.65% accuracy rate on intent detection task and 95.15% F1 score on word slot extraction task. Keywords: Intent detection  Word slot extraction  Joint model Attention mechanism  Tagging strategy 1 Introduction Intent detection and word slot extraction are two basic issues in the field of Natural Language Understanding and these two tasks are usually handled separately [19]. Intent detection and word slot extraction can be regarded as a sentence classification and sequence tagging task respectively. Traditionally, we solve these problems in a sequential order, extracting the word slots first and then detecting the intent of the given sentence. This separated framework makes the task easy to handle and can deal with different subtask issues more flexibly. It is assumed that these two tasks have no correlation between them which enables them to be treated as an independent model, however, in many cases this is not true. Thus, the results of the word slot extraction can affect the outcome of the intent detection by the propagation of errors. Compared with the pipeline models, the joint learning framework handles the two tasks using a single model [2]. The joint model can integrate the information of word slots and of intent by sharing collective parameters and it has been shown to perform well on the joint extraction task [20]. These joint models can make the intent detection © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 178–188, 2018. https://doi.org/10.1007/978-3-030-01424-7_18 Attention-Based RNN Model for Joint Extraction of Intent and Word Slot 179 and word slot extraction process simpler as we only need to train one model to fine- tune the tasks. Although the aforementioned joint methods can handle the two subtasks in a single model, they can also produce redundant information by extracting word slots and intents separately. Generally, these frameworks need two classifiers which have sep- arate label collections: one for intent extraction and another for word slot labeling. So the total number of labels is the combined size of the two label collections. However, this may produce redundant labeling results, that is, if there is a slot s that never appears in the intent i, the model may still give the result of labeling a word as word slot s along with the intention i. In addition, it’s inevitable to propagate the error of two classifiers to each other during training the joint model. In our work, we model the relation of word slots and intent directly by using only one sequential label classifier instead of extracting the word slots and intents separately. We redefined a set of tags containing the information of word slot and the intent of the whole sentence. Based on this tagging strategy, the joint extraction of word slot and intent can be converted into a sequence tagging problem. With this strategy, we can easily use sequence-to-sequence models to handle the two tasks simultaneously without complicated feature engineering. However, one word slot may have various intents in different sentences and the words indicating the intent of the sentence may locate far away from the current input word. Many sequence labeling model are capable of capturing long-distance depen- dence information but they still strongly focus on the parts around the current input word. The attention mechanism which has made satisfactory effect in the field of machine translation [1] can effectively learn global attention information of the sequence by emphasizing the influence of key words on the model results. Specially, we wonder if the attention mechanism can be utilized in and improve our joint tagging model. So we implemented the attention mechanism on our joint model to make it more sensitive to key information, especially the long-distance information indicating the intent. In this paper, we focus on resolving the issue of redundant labeling results, prop- agation of interactions intrinsically in the pipeline as well as the traditional joint training models on word slot extraction task and intent detection task. Based on the motivation, we applied a tagging strategy accompanied with an end-to-end model to settle the problem by transforming the joint extraction task into a sequence tagging problem. In order to solve the influence of the diversified relation between word slots and intentions, we introduce global sentence information through attention mechanism to enhance the effect of sentence intent. Experiments on ATIS data set show that our joint model significantly outperforms pipeline and traditional joint models. The rest of our paper is organized as follows. In Sect. 2, we introduce the related works of RNN sequence labeling model and the attention mechanism for sequence labeling. In Sect. 3, we describe our labeling strategy and end-to-end RNN extraction models in detail. In Sect. 4 we mainly show the settings and results of our experiments. Finally, we conclude the work in Sect. 5. 180 D. Zhang et al. 2 Related Works Intent detection and word slot extraction are corresponded to two fundamental prob- lems–text classification and sequence labeling, which are the basis of many natural language applications and are usually solved in a pipeline manner. For Intent detection, Support Vector machines (SVMs) [3], deep neural network methods [14] and Con- volutional Neural Networks (CNNs) [7] have been widely used. The boosting method [16] and its improved method with dependency parsing-based sentence simplification [17] can handle the complex, longer and natural utterances more effectively. The adaptation of the recursive neural network also achieved competitive performance on the intent detection task [2]. In case of word slot extraction, few of the most popular methods include Maximum Entropy Markov Models (MEMMs) [10], Conditional Random Fields (CRFs) [13] and Recurrent Neural Networks (RNNs) [11]. Label dependency is beneficial for word slot extraction task by feeding previous output label [9]. The RNN-CRF networks can also be used in word slot extraction task [5]. In general, simple Recurrent Neural Networks and Convolutional Neural Networks have shown to significantly outperform the previous state-of-the-art Maximum Entropy Markov Models and Conditional Random Fields and the deep Long Short-Term Memories (LSTMs) was emphatically proposed to be applied to the word slot extraction task [20]. In addition, the joint training model has become a research hot- spot. The joint model of Recursive Neural Networks integrated two subtasks into one compositional model by providing an elegant mechanism for incorporating both dis- crete syntactic structure and continuous-space word and phrase representations [2]. The CNN-CRF model can be jointly trained by extracting features automatically from CNN layers and sharing with the intent model [19]. Recently, a novel tagging strategy has been proposed in joint extraction of entities and relations [22]. Results show that the tagging methods are better than most of the existing pipelined and joint learning methods without identifying entities and relations separately. This task mainly focuses on extracting a triplet consisting of two entities and the relation of the two entities. Unlike traditional models, this work proposed a tagging strategy that label triples directly rather than extracting entities and relation- ships separately. To implement this tagging strategy, a new set of labels containing information about the entities and the relation between them has been designed. With this tagging strategy, the joint extraction of entities and relations can be transformed into a sequence labeling problem. In this way, the sequence labeling model can be conveniently used to handle the joint task without complex feature engineering. However, this tagging strategy still has deficiencies in identifying overlapping rela- tionships and the diversity association between two corresponding entities still needs to be refined. Attention-Based RNN Model for Joint Extraction of Intent and Word Slot 181 3 Proposed Methods 3.1 The Tagging Strategy Traditional model labels the intent and the words slot separately as Table 1 shows. The labels of intent and word slot are divided into two collections. Table 1. The word slots and intent of a sentence instance in ATIS corpus. Sentence Flights From Boston To Kansas City On Friday Word slot O O B-fromloc O B-toloc I-toloc O B-depart time Intent Flight In order to avoid the redundant labeling results and propagation interaction, we adopt a new tagging strategy. How the results are tagged is shown in Fig. 1. Based on our tagging strategy, each word is assigned to a tag that contains three parts: the word position in word slot, the word slot type, and the intent of the whole sentence. With the symbol “O” at the head of the tag, this represents the “Other” tag, which means that the corresponding word is not in any of the word slots. In addition to symbol “O”, we apply the “BIES” symbol to represent the position information in word slot. The word slot type is obtained from a predefined set. The intent type symbol can also get from a predefined set but the intention of all words in a given sequence is exactly the same. Thus, the total number of tags is Nt ¼ Np  Ns  Ni  U, where Np is the number of the “BIES” position information symbol, Ns is the size of the word slot set, Ni is the number of all intents and U is the number of redundant labels. Fig. 1. The instance of our tagging strategy. The word slot symbol “fromloc” and “toloc” represent the departure and destination of the flight. the “flight” symbol expresses the intent of asking for flight information. As is shown in Fig. 1, the word “atlanta” is signed the tag “B-fromloc-flight”. The position information is marked as “B”, the word slot type is marked as “fromloc”, the intent type is marked as “flight” and the three parts of the tag are connected by the symbol “-”. The intent of a sentence is obtained from the majority intent symbols of all the words. 182 D. Zhang et al. 3.2 Attention-Based RNN Model In recent years, end-to-end model based on recurrent neural network has been widely used in sequence labeling task [12, 20]. In this paper, we investigate an end-to-end model to produce the extraction results as Fig. 2 shows. It contains an embedding layer, a bi-directional RNN layer and a hidden layer with attention mechanism. Fig. 2. The illustration of our model with a word embedding layer, a bi-LSTM layer and a hidden layer. yi is the hidden layer output, hi is the hidden layer state and ci is the attention context vector. The Bi-RNN Layer. In the sequence labeling task, we generally learn a function f : X ! Y that maps the input sequence to its corresponding label sequence explicitly aligned to the given the input sequence Xðx1; x2;    ; xTÞ and its corresponding label sequence Yðy1; y2;    ; yTÞ. In our joint task, we want to find the best label sequence Y given input words X such that: ŷ ¼ argmax PðY jXÞ ð1Þ The bidirectional RNN model has been proven to capture the semantic and sequential information for each word effectively in sequence tagging task by reading sentences bidirectionally. In our proposed model, we use a bidirectional RNN layer reading the input sequence in both forward and backward directions. The forward RNN reading the input sequence in its original order generates a hidden state fhi at each time step i. Similarly, the backward RNN reading the input sequence in its reverse order generates a sequence of hidden states ½bh1; bh2;    ; bhT . The bidirectional RNN layer hidden state hi at each time step i is combined of the forward state fhi and backward state bhi, hi ¼ ½fhi; bhi. Each hidden state hi carries information of the entire input sequence with strong focus on the parts around the i th word. The hidden state h and the bi-RNN output y are then combined with an attention context vector c to produce the label distribution. Attention-Based RNN Model for Joint Extraction of Intent and Word Slot 183 The Attention Mechanism. Attention mechanism can be regarded as the process of selectively filtering a small amount of important information from all the provided information ignoring most of the non-important information [18]. The process can be reflected in the calculation of the weight coefficient. The greater the weight is, the more it focuses on its corresponding value. The weight represents the importance and the value is its corresponding information. In the joint extraction task, the attention mechanism can provide the classifier with global attention information by giving dif- ferent weights to the words. The attention mechanism is applied in a hidden layer above the bi-RNN layer. We initialize the hidden layer state using the last hidden state of the bi-RNN layer fol- lowing the approach in [2]. At each time step i, the hidden layer state si is calculated as a function of the previous bi-RNN output yi1, the bi-RNN hidden state hi and the attention context vector ci: si ¼ f ðyi1; hi; ciÞ ð2Þ The attention context distribution c is generated by the hidden state h of the bidirectional RNN. In detail, ci is calculated as the weighted sum of the bi-RNN states h ¼ ðh1; h2;    ; hTÞ [2]: X ci ¼ T¼ ai;jhj ð3Þj 1   ¼ P exp ei;jai;jhj T   ð4Þ k¼1 exp ei;k ei;k ¼ gðsi1; hkÞ ð5Þ where g is a feed-forward neural network. The attention context vector ci provides additional information to the hidden layer that can be viewed as weighted sequential features of the RNN hidden layer states ðh1; h2;    ; hTÞ. In this way, the attention mechanism can provide global weighted information to generate labels. The Bias Loss Function. In order to enhance the influence of word slots we tried to use the RMSprop optimization method [15] by defining the loss function as: X ¼ jDj XT L max ¼ ¼ ð1þ aIðOÞÞlogðpi ¼ yijxi; hÞ ð6Þj 1 i 1   ð Þ exp o ð jÞ p ji ¼ P iNt exp oð Þ ð7Þkk¼1 i Where jDj is the size of the data set, T is the length of the sequence, yi is the label of the i th word, pi is the normalized probability of the tags which is defined in formula 7. Nt is the total number of tags, oi is the output of the i th word, a is the bias weight of the 184 D. Zhang et al. loss function. The larger a is, the more influence the corresponding tag has. IðOÞ is a binary function that distinguishes the loss of tag “O” and word slot tags and it was defined as follows:  ð Þ ¼ 0; tag ¼ OI O 1; tag 6¼ O 4 Experiments For a better comparison with previous methods and presenting the effect of our method, we carried out experiments on the Air Travel Information System (ATIS) pilot corpus. Then our model was compared with the previous pipeline and joint training models to demonstrate the performance in both independent and joint tasks. 4.1 Experimental Settings Dataset. ATIS (Airline Travel Information Systems) data set [4] is widely used in intent detection and word slot extraction task. The data set contains the conversation text of persons who made the flight reservation. In this work, we follow the ATIS corpus setup used in [9, 11, 16, 19]. There are 4978 conversation text from the ATIS-2 and ATIS-3 corpora in the training set and 893 conversation text from the ATIS-3 NOV93 and DEC94 data sets in the test set. The total number of word slot labels is 127 and the size of intent types is 18. We use the F1 score to evaluate the results on word slot extraction and evaluate the performance of intent detection by using classification accuracy rate. Hyperparameters. In our experiments, LSTM cell is used as the basic RNN unit. Our LSTM implementation follows the design in [21]. The number of cells in the LSTM layer is 128. We set the initial LSTM forget gate bias as 1 [6]. In our model, there is only one LSTM layer and the multilayer LSTM will be explored in future work. The word embeddings dimension is set to 128. We randomly initialize the word embeddings and fine-tuned during backward propagation. The training batch size is 16. Dropout rate on the fully connected network is set to 0.5 for regularization [21]. To prevent the gradient from exploding, the maximum value of gradient clipping is set to 5. The bias of the loss function is set to 10 and the number of headers of the attention is set to 10. We apply Adam optimization to our model following the settings in [8]. 4.2 Intent Detection Task Results We first report the results on independent tasks of intent detection and word slot extraction. We used the bi-LSTM model as our baseline and compared the performance of our proposed model with previously reported methods on intent detection task and illustrate the results in Table 2. As we can see, our joint methods performs better than pipelined methods on intent detection. The attention-based bi-LSTM joint model and the bi-LSTM joint model with Attention-Based RNN Model for Joint Extraction of Intent and Word Slot 185 Table 2. The results on independent task of intent detection Model Intent accuracy (%) Recursive NN [2] 95.40 Boosting [16] 95.62 Boosting + Simplified sentences [17] 96.98 bi-LSTM 97.14 bi-LSTM with attention 97.31 bi-LSTM with bias loss function 97.20 bi-LSTM with attention-bias loss function 97.65 bias loss function advances the bi-LSTM model. Moreover, the bi-LSTM model combined with attention mechanism and bias loss function achieved the best accuracy of intent detection. This could be attributed to the combination of attention mechanism and bias loss function that allows the model to learn the sequence level information more efficiently. While training the attenuation model, we found the attention mechanism is helpful to enhance the influence of long-distance keywords when the intent of words is been labeling. As shown in Fig. 3, We can find that the attention weights at the beginning of the sentence are higher when we label the last word “thursday”. The word slot of “thursday” is a date slot which may appear in many sentences with different intents. So we should know the intent of the sentence as well as the slot of the word “thursday” and then label the word with “B-depart_date-flight”. Obviously, the beginning words carry most information of the intent and the attention mechanism can find additional long-distance information effectively to solve multiple intent issues. This may explain one side of the reason for the good performance of our joint model on intent detect task. Fig. 3. The distribution of the attention weights when labeling the last word “thursday” with the intent “flight” of the sentence. The darker shade is the higher attention weight is. 4.3 Word Slot Extraction Task Results Table 3 shows the performance of our proposed model for word slot extraction and previously reported results. Once again, the joint model performs better than the pipeline method. Besides, the attention-based model gives slightly better F1 score than the non-attention-based models. The reason could be the attention mechanism seeking to find other supporting information from input word sequence for the word slot label 186 D. Zhang et al. prediction. Overall, attention-based RNN Models outperform the ones without atten- tion mechanism and the bias loss function is helpful for the word slot extraction. When we combine attention mechanism and bias loss function on bi-LSTM model we find the F1 score gets slight reduction. We think the weight of bias loss function may disrupt the weight of attention during backpropagation. As the bias is manually set, it is difficult to select a perfectly suitable hyperparameter, which may lead to human errors affecting the training process. In the next work, we will try to optimize the bias by setting it as a parameter of the model. Table 3. The results on independent task of word slot extraction Model F1 score (%) CNN-CRF [19] 94.35 RNN with Label Sampling [9] 94.89 Hybrid RNN [11] 95.06 Deep LSTM [20] 95.08 bi-LSTM 94.89 bi-LSTM with attention 95.15 bi-LSTM with bias loss function 95.13 bi-LSTM with attention-bias loss function 95.05 4.4 Joint Task Results Table 4 shows our tagging model’s performance on joint extraction task of intent and word slots comparing to previously reported results. Table 4. The results of joint task on intent detection and word slot extraction Model F1 score (%) Intent accuracy(%) RecNN [2] 93.22 95.40 RecNN + Viterbi [2] 93.96 95.40 bi-LSTM 94.89 97.09 bi-LSTM with attention 95.15 97.20 bi-LSTM with bias loss function 95.13 96.89 bi-LSTM with attention-bias loss function 95.05 97.09 As shown in this table, the joint model using tagging strategy achieved promising performance on both intent detection and word slot extraction. The attention based bi- LSTM get the best performance during our experiments. However, the combination model based on attention mechanism and bias loss function still have much room for improvement. Attention-Based RNN Model for Joint Extraction of Intent and Word Slot 187 We checked the badcase in the results, most of which were caused by the word “UNK” which represents low frequency words. Besides, many word slots are also infrequent in the mislabeling results. It can be speculated that due to the limit of the data size, the training data could not cover all the cases well, especially for words and word slots with low frequency. In future missions, we will scale the size of the data set and adopt a deeper RNN model to further improve the performance of our model. The experimental results show the effectiveness of our proposed method. But it still has shortcoming on identifying multiple tags. In the next work, we will replace the softmax function in the output layer with multiple classifier, so that a word can be labeled multiple tags. In this way, the word tagging process can be transformed into a multi-classification problem, which can solve the problem of multiple tags. Although, our model can enhance the effect of word slot words, the associations between word slots and sentence intent still require refinement in next works. 5 Conclusion In this paper, we explored a tagging strategy and investigated the end-to-end RNN models to jointly extract of intent and word slots. We further improved our joint tagging strategy model with the attention mechanism to solve the problem of diver- sified relationship between word slots and intentions. Based on our tagging strategy model, the joint task of intent detection and word slot extraction is greatly simplified as only one sequence tagging model needs to be trained and deployed. We conduct experiments on a public dataset and the experimental results show that our joint model achieved better performance on the benchmark ATIS task compared with most of the existing pipelined and joint models for both independent and joint extraction task. Acknowledgement. This work was supported by the National Key Research and Development program of China (No. 2018YFB1004703). References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Guo, D., Tur, G., Yih, W., Zweig, G.: Joint semantic utterance classification and slot filling with recursive neural networks. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 554–559. IEEE (2014) 3. Haffner, P., Tur, G., Wright, J.H.: Optimizing SVMs for complex call classification. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP 2003), vol. 1, pp. I–I. IEEE (2003) 4. Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, 24–27 June 1990 (1990) 5. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 188 D. Zhang et al. 6. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International Conference on Machine Learning, pp. 2342–2350 (2015) 7. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv: 1408.5882 (2014) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (2014) 9. Liu, B., Lane, I.: Recurrent neural network structured output prediction for spoken language understanding. In: Proceedings of the NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions (2015) 10. McCallum, A., Freitag, D., Pereira, F.C.: Maximum entropy markov models for information extraction and segmentation. In: ICML, vol. 17, pp. 591–598 (2000) 11. Mesnil, G., et al.: Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 530–539 (2015) 12. Mikolov, T., Kombrink, S., Burget, L., Černocký, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011) 13. Raymond, C., Riccardi, G.: Generative and discriminative algorithms for spoken language understanding. In: Eighth Annual Conference of the International Speech Communication Association (2007) 14. Sarikaya, R., Hinton, G.E., Ramabhadran, B.: Deep belief nets for natural language call- routing. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5680–5683. IEEE (2011) 15. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012) 16. Tur, G., Hakkani-Tür, D., Heck, L.: What is left to be understood in ATIS? In: 2010 IEEE Spoken Language Technology Workshop (SLT), pp. 19–24. IEEE (2010) 17. Tur, G., Hakkani-Tür, D., Heck, L., Parthasarathy, S.: Sentence simplification for spoken language understanding. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5628–5631. IEEE (2011) 18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017) 19. Xu, P., Sarikaya, R.: Convolutional neural network based triangular CRF for joint intent detection and slot filling. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 78–83. IEEE (2013) 20. Yao, K., Peng, B., Zhang, Y., Yu, D., Zweig, G., Shi, Y.: Spoken language understanding using long short-term memory neural networks. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 189–194. IEEE (2014) 21. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014) 22. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:1706.05075 (2017) Using Regular Languages to Explore the Representational Capacity of Recurrent Neural Architectures Abhijit Mahalunkar(B) and John D. Kelleher Dublin Institute of Technology, Dublin, Ireland abhijit.mahalunkar@mydit.ie, john.d.kelleher@dit.ie Abstract. The presence of Long Distance Dependencies (LDDs) in sequential data poses significant challenges for computational models. Various recurrent neural architectures have been designed to mitigate this issue. In order to test these state-of-the-art architectures, there is growing need for rich benchmarking datasets. However, one of the draw- backs of existing datasets is the lack of experimental control with regards to the presence and/or degree of LDDs. This lack of control limits the analysis of model performance in relation to the specific challenge posed by LDDs. One way to address this is to use synthetic data having the properties of subregular languages. The degree of LDDs within the gen- erated data can be controlled through the k parameter, length of the generated strings, and by choosing appropriate forbidden strings. In this paper, we explore the capacity of different RNN extensions to model LDDs, by evaluating these models on a sequence of SPk synthesized datasets, where each subsequent dataset exhibits a longer degree of LDD. Even though SPk are simple languages, the presence of LDDs does have significant impact on the performance of recurrent neural architectures, thus making them prime candidate in benchmarking tasks. Keywords: Sequential models · Long distance dependency Recurrent neural networks · Regular languages Strictly piecewise languages 1 Introduction A Recurrent Neural Network (RNN) is able to model temporal data efficiently [1]. In theory, RNNs are capable of modeling infinitely long dependencies. A long distance dependency (LDD) describes a contingency (or interaction) between two (or more) elements in a sequence that are separated by an arbitrary number of positions. LDDs often occur in natural language, for example in English there is a requirement for subjects and verbs to agree, compare: “The dog in that house is aggressive” with “The dogs in that house are aggressive”. However, in practice successfully training an RNN to model LDDs is still extremely dif- ficult, due in-part to exploding or vanishing gradients [2,3]. There have been ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 189–198, 2018. https://doi.org/10.1007/978-3-030-01424-7_19 190 A. Mahalunkar and J. D. Kelleher significant advances in this domain, and various architectures have been devel- oped to address the issue of LDDs [4–11]. Indeed, the fact that a number of RNN extensions are specifically designed to address the problem of modeling LDDs is a testament to the fundamental importance of the challenge posed by LDDs. In order to test the representational capacity of these models and aide in future development of new models, there is a growing need for large datasets which manifest various degrees of LDDs. Various benchmarking datasets and tasks which exhibit such properties are currently being employed [4,7,12,13]. However, using them provides no experimental control over the degree of LDD these datasets exhibit. Although, the copy and add task [4] does have control over this factor, the dataset generated via this scheme does not possess comparable complexity with datasets sampled from real world sequential processes. Strictly k -Piecewise (SPk) languages, as studied by Rogers et al. [14], are proper subclasses of piecewise testable languages [15]. SPk languages are natural and can express some of the kinds of LDDs found in natural languages [16,18]. In relation to research on LDDs, SPk languages are particularly interesting because by controlling the parameter k and the length of the strings, one can control the maximum LDD in the dataset, and by choosing appropriate forbidden strings, it is possible to simulate a natural dataset exhibiting a certain degree of LDD. These properties make SPk languages prime candidate for benchmarking tasks. Contribution: This research used a finite-state implementation of an SP2 grammar to generate strings of varying length, from 2 to 500. SP2 is analogous to subject-verb agreement in English language, thus using this grammar generates LDDs of similar complexity, and controlling the length of the strings generated controls the maximum LDD span in the dataset. Appropriate forbidden strings were chosen. State-of-the-art sequential data models were trained to predict the next character for every generated dataset. It was observed that as the length of the strings in the datasets increased the perplexity of the models increased. This is due in-part to the limitations of the representational capacity of these models. However, of the models tested it was observed that Recurrent Highway Networks display the lowest perplexity on character prediction task for large sequences exhibiting very high LDDs. 2 Recurrent Neural Architectures for LDDs The focus of this paper is to experimentally evaluate the ability of modern RNN architectures to model LDDs by testing current state-of-the-art models on datasets of SPk sequences which exhibit LDDs of varying lengths. For our experiments we chose the following architectures as the relevant representatives of RNNs: Long Short Term Memory [4], Recurrent Highway Networks [9] and Orthogonal RNNs [10]. This choice of networks was based on the fact that (a) each of these networks were specifically designed to address performance issue of the standard RNN while modeling LDD datasets, and (b) taken together the set of selected models provide coverage of the different approaches found in the literature to the problem of LDDs. Regular Languages and Recurrent Neural Architectures 191 LSTMs were an early effort in addressing the vanishing gradient effect by introducing “constant error carousels”, which enforced constant error flow through thereby bridging minimal time lags in excess of 1000 discrete steps. Neural Turing Machines are memory augmented networks. They are composed of a network controller and a memory bank. These components allowed the network to provide attention to different memory locations. Recurrent Highway Networks (RHNs) extended the LSTM architecture to allow step-to-step tran- sition depths larger than one. Orthogonal RNNs (ORNNs) extend the standard RNN architecture by enforcing soft or hard orthogonality on the weight matrix. 3 Benchmarking Datasets There is a relatively small number of datasets that are popular for testing the representational capacity of RNNs. Most of these datasets are known to exhibit LDDs, which is a necessary criteria for their selection as a benchmarking dataset. The Penn Treebank [12] (PTB) is one of these datasets. It consists of over 4.5 million words of American English and was constructed by sampling English sentences from a range of sources. The WikiText language modeling dataset [7] was released in 2016 and has become a popular choice for language modeling experiments. It is a collection of over 100 million tokens extracted from various Wikipedia articles. This dataset is much larger than the PTB, which is the primary reason that it is preferred to the PTB in recent works. Although, the PTB and WikiText differ in terms of the sources that the sentences they contain are sampled from, both dataset exclusively contain English language sentences. Hence both the datasets are constrained by English language grammar, and therefore will exhibit similar LDD characteristics. Moreover, it is unclear what these LDDs are because the data is sampled from a natural process (the English language) the LDD characteristics of which are not accurately estimated. The difficulty of using naturally occurring datasets to investigate LDDs has been recognized and several synthetic benchmarks have been used to test the ability of RNNs to learn LDDs in sequential data. The copy and adding tasks, introduced in [4], is one such example. The task entails remembering an input sequence followed by a string of blank inputs. The sequence is terminated using a delimiter after which the network must produce the input sequence, ignoring the string of blanks inputs that follow the original sequence [10]. This task provides an experimenter with a great degree of control over the length of LDD in the dataset they synthesize in order to train and test their models. Another method of testing models on simulated LDDs, is to train them to learn the MNIST image classes [13]. This is achieved by sequentially feeding all the 784 pixels of a MNIST image to the model under test and then training the network to classify MNIST image category. Every image is fed to the network pixel by pixel, starting from the top left pixel and finishing at the bottom right pixel. This simulates LDDs of length 784 as the network has to remember all the 784 pixels in order to classify the images. Formal languages, have previously been used to train RNNs and investigate their inner workings. The Reber grammar [19] was used to train various first order 192 A. Mahalunkar and J. D. Kelleher RNNs [21,22]. The Reber grammar was also used as a benchmarking dataset in the original work on LSTM models [4]. Regular languages, studied by Tomita [20], were used to train RNNs to learn grammatical structures of the string. A very recent example of research using formal languages to evaluate RNNs is Avcu et al. [17]. The work presented in this paper falls within this tradition of analysis, however it extends the previous research on using formal languages by: (a) broadening the variety of LDDs within the generated datasets, (b) evaluating a broader variety of models, and (c) using language model perplexity as the evaluation metric. 4 Formal Language Theory and Regular Languages Formal Language Theory (FLT) finds its use in various domains of science. Pri- marily developed to study the computational basis of human language, FLT is now being used to extensively analyze any rule-governed system [23–25]. Regu- lar languages are the simplest grammars (type-3 grammars) within the Chom- sky hierarchy which are driven by regular expressions. Subregular languages, e.g. Strictly k -Piecewise or Strictly k -Local, are subclasses of regular languages. These languages can be identified by mechanisms much less complicated than Finite-State Automata. Many aspects of human language such as local and non local dependencies are similar to subregular languages [26], and there are certain types of LDDs in human language which allow finite-state characterization [18]. These types of LDD can be modeled using Strictly k -Piecewise languages. 4.1 Strictly Piecewise Languages In order to explain how we used SPk languages to generate datasets appropriate to our experimental goals it is first necessary to present an explanation of these languages. Following [14,16,17], a language L is described by a finite set of symbols, i.e. an alphabet, denoted by Σ. The symbols are analogous to words or characters in English, music notes in music theory, genes in genomics, etc. A set Σ∗ is a free monoid, a set of finite sequences of zero or more elements from Σ. For example, for Σ = {a, b, c}, its Σ∗ contains all concatenations of a, b, and c: {λ, a, ab, ba, cac, acbabc, ...}. The string of length zero is denoted by λ. wi is the ith word/string (w) of L. The length of a string u is denoted |u|. A stringset (or Formal Language) is a subset of Σ∗. If u and v are strings, uv denotes their concatenation. For all u, v, w, x ∈ Σ∗, if x=uwv, then w is a substring of x. For example, bc is a substring of abcd, as concatenating a,bc,d yields abcd. Similarly, a string v is a subsequence of string w iff v = σ1σ2...σn and w ∈ Σ∗σ ∗ ∗1Σ σ2Σ ...Σ∗σnΣ∗, where σ ∈ Σ. For example, string bd is a subsequence of length k =2 of abcd, acd is a subsequence of length k =3 of the same string abcd, but string db is not a subsequence of abcd. A subsequence of length k is called a k-subsequence. Let subseqk(w) denote the set of subsequences of w up to length k. Regular Languages and Recurrent Neural Architectures 193 A Strictly Piecewise grammar can be defined as a set of permissible sub- sequences. The grammar G is simply all strings whose k -long subsequences are permissible according to G. Consider a language L, consisting of Σ = {a, b, c, d}. The grammar, GSP2 = {aa, ac, ad, ba, bb, bc, bd, ca, cb, cc, cd, da, db, dc, dd} are comprised of these permissible subsequences of length k =2. Note, however, that although {ab} is a logically possible subsequence of length k, it is not in the gram- mar. Subsequences which are not in the grammar are called forbidden strings. The string u = [bbcbdd ], where |u| = 6 belongs to GSP2, because it is composed of subsequences that are in that grammar. Similarly, the string v = [bbdbbbcbddaa], where |v| = 12 belongs to GSP2. However, the string w = [bbabbbcbdd ] does not because {ab} is a forbidden subsequence as it is not part of the grammar. This condition applies for any string x for |x| ∈ Z. One can also define an SP grammar for k=3 and k=4 for Σ = {a, b} as GSP3 and GSP4 respectively. For example, GSP3 = {aaa, aab, abb, baa, bab, bba, bbb}, with {aba} as forbidden string. A string [aaaaaaab] of length 8 is a valid GSP3 string and [aaaaabaa] is invalid. Thus, an appropriate grammar reflecting the dataset one intends to simulate can be designed by selecting appropriate permissible strings in the grammar. For the specific language, forbidden strings can be computed1. Note, to define an SPk grammar it is necessary to specify at least one forbidden string. Fig. 1. The automaton for GSP2 (k =2) which generates strings of length=6 Figure 1 illustrates the finite-state diagram of a GSP2 for strings of length 6 with forbidden string {ab}. Traversing a path from state 1 until state 11 will gen- erate valid GSP2 strings of length 6, e.g. {accdda, caaaaa}. It can also be noted that there is no path which generates a string which has an {ab} subsequence e.g. {abcccc}. Using the above described methodology, of choosing strings of appro- priate length, one can simulate appropriate LDDs in a dataset. One can also control the number of dependent elements by choosing an appropriate k. Forbid- den strings allow for elimination of certain combinations in generated datasets, which can be useful when one is trying to simulate real world datasets. 1 Refer Sect. 5.2 Finding the shortest forbidden subsequences in [16] for method to compute forbidden sequences for a particular SP language. 194 A. Mahalunkar and J. D. Kelleher 5 Experiment In this experiment, we generate 4 datasets of SP2 language. For each dataset we train an LSTM, an ORNN, and a RHN, and evaluate and compare the performance of the models. 5.1 Generating SP2 dataset For our experiment, Σ = {a, b, c, d} was selected. Forbidden strings for this language were selected as {ab,bc}. In order to introduce various degrees of LDDs, strings with lengths l were generated in random order, where 2 ≤ l ≤ 500. For every l, the number of strings per l is nl. For this experiment, nl ≤ 1, 000, 000. This allowed for uniform distribution of strings of all lengths. These strings were grouped in 4 datasets as described in Table 1. Within each dataset, strings were randomly ordered to avoid biased gradients. For training the neural networks, a subset of these generated datasets were used due to the size of each dataset. Table 1. Datasets of SP2 language Dataset Min length Max length Max LDDs Original Sample Dataset 1 2 20 20 15MB 15MB Dataset 2 21 100 100 470MB 50MB Dataset 3 101 200 200 1.5GB 100MB Dataset 4 201 500 500 9.9GB 200MB The strings were generated using the tool foma [27]. A post processing python script was developed to select the small sample from the original datasets 1, 2, 3 and 4 as described in Table 1. Every dataset is made up of strings of varying l. The python script was also used to randomize the order of strings (as per the length), so as not to bias the models2. 5.2 Training Task All the networks were trained on a character prediction task. For each net- work type (LSTM, ORNN, RHN) a network was trained on each of the 4 SP2 datasets, and also on a standard dataset of English language. The English lan- guage datasets were included in the experiments to provide a comparison for model performance when the vocabulary and type of data was varied. For the LSTM and ORNN the PTB was used as the standard English language dataset, and for the RHN the Text8 dataset was used. Note, that the experimental task was kept constant across all datasets, so although the PTB and Text8 datasets 2 Source code available at https://github.com/silentknight/ICANN2018. Regular Languages and Recurrent Neural Architectures 195 are often used as part of a word-prediction task, in these experiments the net- works were trained and evaluated on character prediction on the PTB and Text8 datasets. For SPk languages, the generated datasets were split into training (60%), validation (20%) and test (20%) sets. The LSTM3 with dropout models were trained as advised in [28]; the ORNN4 models were trained as recommended in [10]; and, the RHN5 models were trained following [9]. The performance of all the three network types was measured by computing the perplexity of the network after each epoch. The performance curve for the LSTM model is plotted in Fig. 2a, the performance of ORNN model is plotted in Fig. 2b, and the performance curve of RHN is plotted in Fig. 2c. (a) LSTM Network (b) Orthogonal RNN (c) Recurrent Highway Network Fig. 2. Perplexity vs training epoch for recurrent neural architectures. 3 LSTM source https://github.com/tensorflow/models/blob/master/tutorials/rnn/ ptb/ptb word lm.py. 4 ORNN Source https://github.com/veugene/spectre release. 5 RHN source https://github.com/julian121266/RecurrentHighwayNetworks. 196 A. Mahalunkar and J. D. Kelleher 6 Analysis In Fig. 2, we visualize the impact of increasing LDDs while training all the three architectures. Our results show that during the initial phase of the training, the LSTM network displayed perplexity directly proportional to the degree of LDDs present in the dataset. It is seen that dataset 4 (LDD order of around 500) presents higher perplexity as compared to the other datasets. However, every dataset eventually exhibits lower perplexity after epoch 20. When com- pared with the PTB task, one can observe lower perplexity by LSTM network in modeling datasets 1, 2, 3 and 4 during the initial phase of training. This is due in-part to the small vocabulary size in the SP2 datasets (Σ = {a,b,c,d}). A small vocabulary size tends to lower entropy in a sequence. The PTB has much larger vocabulary thus increasing the entropy and eventually increasing perplexity. Selection of more forbidden strings leads to much richer grammar. SPk languages generated for this experiment contained only 2 forbidden strings, this led to generation of less rich grammar as compared to the PTB (English grammar). However, one can observe that the LSTM model learns the PTB much faster than SP2 languages (the graph drops earlier). This can be directly attributed to the presence of longer LDDs in the SP2 datasets. Orthogonal RNNs enforce soft orthogonality to address vanishing gradient problem. When compared with LSTM network training of the PTB, it is observed that the perplexity of both architectures is very similar during the initial train- ing phase, but ORNNs performance does not improve with more training as compared to LSTM. The impact of vocabulary size is also evident in this case (the perplexity for PTB is much higher than for the SP2 datasets). However, it can be seen that ORNNs trained with datasets 1 and 2 present higher perplexity as compared to datasets 3 and 4 (longer LDDs) suggesting that ORNN models overfit datasets 1 and 2 and are able to generalize on datasets 3 and 4. This could be attributed to orthogonal weight initializations which makes learning longer dependencies easier. Focusing on the graph for the Recurrent Highway Networks it can be observed that the model tended to exhibit lower perplexity on SP2 datasets with higher degrees of LDDs. This could be attributed to the architecture of the network. Due to increased depth in recurrent transitions in these networks, it was possible for the model to achieve good performance on datasets with long LDDs. However, on datasets with lower degrees of LDDs these models tend to overfit and, thus, exhibit higher perplexity. Furthermore, comparing the RHN graph on the Text8 dataset with the LSTM and ORNN graphs on the PTB it is apparent that RHNs are better at handling larger vocabularies: the RHN graph for Text8 is lower than the LSTM and ORNN graphs on the PTB. 7 Conclusion In this paper, we used SPk languages to generate benchmarking datasets for LDDs. We trained various RNNs with the generated datasets and analyzed their Regular Languages and Recurrent Neural Architectures 197 performance. The analysis revealed that SPk languages are able to generate datasets with varying degree of LDDs. Consequently, using SPk languages gives experimental control over the generation of rich datasets by controlling the k, the length of the strings, the vocabulary of the generated language, and by choosing appropriate forbidden strings. The analysis also revealed that RHNs have a much better capability (as compared with LSTMs and ORNNs) to model LDDs. Acknowledgements. This research was partly supported by the ADAPT Research Centre, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Funds. The research was also supported by an IBM Shared University Research Award. We also, gratefully, acknowl- edge the support of NVIDIA Corporation with the donation of the Titan Xp GPU under NVIDIA GPU Grant used for this research. References 1. Elman, J.L.: Finding structure in time. Cogn. Sci. 14, 179–211 (1990) 2. Hochreiter. S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma the- sis, TU Munich (1991) 3. Yoshua, B., Simard, P., Frasconi, P.: Learning long-term dependencies with gradi- ent descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994) 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 5. Graves, A., Wayne, G., Danihelka, I.: Neural Turing Machines. CoRR (2014) 6. Salton, G.D., Ross, R.J., Kelleher, J.D.: Attentive language models. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 441–450 (2017) 7. Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. In: ICLR 2016 (2016) 8. Chang, S. et al.: Dilated recurrent neural networks. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 77–87. Curran Associates, Inc. (2017) 9. Zilly, J.G., Srivastava, R.K., Koutnk, J., Schmidhuber, J.: Recurrent highway net- works. In: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR, vol. 70 (2017) 10. Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning recurrent networks with long term dependencies. In: Proceeding of ICML 2017 (2017) 11. Henaff, M., Szlam, A., LeCun, Y.: Recurrent orthogonal networks and long-memory tasks. In: Proceedings of the 33rd International Conference on Machine Learning, PMLR, vol. 48, pp. 2034–2042 (2016) 12. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated cor- pus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993). ISSN 0891–2017 13. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 14. Rogers, J., et al.: On languages piecewise testable in the strict sense. In: Ebert, C., Jäger, G., Michaelis, J. (eds.) MOL 2007/2009. LNCS (LNAI), vol. 6149, pp. 255–265. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14322- 9 19 198 A. Mahalunkar and J. D. Kelleher 15. Simon, I.: Piecewise testable events. In: Brakhage, H. (ed.) GI-Fachtagung 1975. LNCS, vol. 33, pp. 214–222. Springer, Heidelberg (1975). https://doi.org/10.1007/ 3-540-07407-4 23 16. Ogihara, M., Tarui, J. (eds.): TAMC 2011. LNCS, vol. 6648. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20877-5 17. Avcu, E., Shibata, C., Heinz, J.: Subregular complexity and deep learning. In: Proceedings of the Conference on Logic and Machine Learning in Natural Language (LaML 2017), vol. 1, pp. 20–33 (2017) 18. Heinz. J., Rogers, J.: Estimating strictly piecewise distributions. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 886–896 (2010) 19. Reber, A.S.: Implicit learning of artificial grammars. J. Verbal Learn. Verbal Behav. 6(6), 855–863 (1967) 20. Tomita, M.: Learning of construction of finite automata from examples using hill- climbing. In: Proceedings of Fourth International Cognitive Science Conference, pp. 105–108 (1982) 21. Casey, M.: The dynamics of discrete-time computation, with application to recur- rent neural networks and finite statemachine extraction. Neural Comput. 8(6), 1135–1178 (1996) 22. Smith, A.W., Zipser, D.: Encoding sequential structure: experience with the real- time recurrent learning algorithm. Proc. IJCNN I, 645–648 (1989) 23. Chomsky, N.: Three models for the description of language. IRE Trans. Inf. Theory 2, 113–124 (1956) 24. Chomsky, N.: On certain formal properties of grammars. Inf. Control. 2, 137–167 (1959) 25. Fitch, W.T., Friederici, A.D.: Artificial grammar learning meets formal language theory: an overview. Philos. Trans. R. Soc. B Biol. Sci. 367(1598), 1933–1955 (2012) 26. Jager, G., Rogers, J.: Formal language theory: refining the Chomsky hierarchy. Philos. Trans. R. Soc. B Biol. Sci. 367(1598), 1956–1970 (2012) 27. Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Lin- guistics, pp. 29–32 (2009) 28. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. In: Proceedings of ICRL (2015) Learning Trends on the Fly in Time Series Data Using Plastic CGP Evolved Recurrent Neural Networks Gul Mummad Khan1(&) and Durr-e-Nayab2 1 Electrical Engineering Department, UET Peshawar, Peshawar, Pakistan gk502@uetpeshawar.edu.pk 2 Computer System Engineering Department, UET Peshawar, Peshawar, Pakistan nayaab_khan@nwfpuet.edu.pk Abstract. An approach of Direct Online Learning (DOL) to incorporate devel- opmental plasticity in Recurrent Neural Networks termed as Plastic Cartesian Genetic Programming evolved Recurrent Neural Network (PCGPRNN), is pro- posed to exploit the trends in the data of the foreign currency to forecast the future currency rates, while reshaping its connectivity, biasing factors and selecting various parameters from the input vector ‘on the fly’ according to the traversed trends. The developedmodel learns in real time and exhibits the optimum topology for the best possible output using neuro-evolution. The network performance is observed in a range of scenarios with varying network parameters and various currencies and trading indexes obtaining competitive results. Networks trained to predict single instances are further explored in independent scenarios to predict various time intervals in advance, achieving remarkable results. Keywords: Cartesian genetic programming  Developmental plasticity Foreign currency exchange  Neuro evolution  Recurrent Neural Networks 1 Introduction Plasticity in neural networks is an efficient phenomenon that exists in biological neural networks [19]. In artificial neural networks (ANNs) plasticity is attained by ability to change the aspects of the network in response to environmental conditions [15–17]. Due to their natural capacity they are turning more famous to study in variable work environments. They are becoming more known due to their essential ability to train live in the variable task environment. Similarly, systems with memory (i.e. recurrent) neural networks have tremendous fast learning ability that makes them efficient for dealing with challenging scenarios [8]. The developmental plastic neural networks when incorporated with the capabilities of the feedback give rise to a novel neural network approach which combines the capabilities of the developmental plasticity and recurrent networks. In this work, it is applied to evolve for prediction of foreign currency exchange rates. This mechanism is called Plastic Cartesian Genetic Programming evolved Recurrent Neural Network (PCGPRNN). The proposed dynamic ANN creates new neural sub-systems when they witness dynamic learning scheme and have © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 199–207, 2018. https://doi.org/10.1007/978-3-030-01424-7_20 200 G. M. Khan and Durr-e-Nayab premiere performance due to feedback mechanism. The plasticity allows the network to customize its aspects while the problem domain is evolving and solves multiple linear/nonlinear issues without adversity from the damaging interference. Catastrophic interference or catastrophic forgetting occurs when a neural network trained on one problem forgets how to solve it when trained on another problem [2, 18]. The feedback mechanism of the network is both constructive and destructive and proves to be useful and convenient for the unsteady financial time series data because of its quick learning ability [8]. PCGPRNN is an exceptional technique because of the existence of a feedback mechanism in the network which takes the system status inputs into con- sideration while continuously converting the morphology at runtime. PCGPRNN as opposed to feed-forward can measure random sequence inputs because of its ability of exploiting the internal memory [8]. Through this task our target is to seek a neural network technique which integrates bio-inspired developmental measures using feed- back. The Plastic Cartesian Genetic Programming based Recurrent Neural Network (PCGPRNN) proposed in this work is obtained for a dynamic network with capability to regularly fix its design and weights in response to external environment. Foreign Exchange (Forex) rates are directly influenced by many macro and micro economic factors collectively besides international relations and global state of business. Like- wise the inflation rate, rate of interest, Per capita income, role of speculations, cost of manufacturing, industry, economic growth in terms of Gross Domestic Income, Political Stability and Relative Strength Index (RSI) of the stocks and the econo- mization of other countries changes the worth of a currency instantaneously as well as in long run [23]. With linear data sets the traditional statistical models shows better performance compare to that of the performance with non-linear data sets where it shows some limitations, for example stock indices [6]. Time series forecasting [1] was used for Hidden Markov model (HMM) which shows susceptibility to external factors like stock indices making it ambiguous and immutable. A support vector machine used in [27] learnt progression of volatility levels of forex data predicted by Hidden Markov Model & predicted approximations of the level. Improved outputs were obtained by executing numerical measure of actual data. The two ANN schemes Multilayer Perceptron (MLP) and Volterra mentioned in [4] are also exploited for time series forecasting. Multi-neural ANN evolved comprising a master with three sub networks [15] that forecasted US Dollar and Taiwan Dollar (TWD) exchange rates, their predisposition on five macro-economic factors. The MLP forecasting model produces efficient and accurate results of three (3) days’ ahead USD/Euro exchange rates but its performance plunged with intrusion of external factors [11]. In [24] an extension of the traditional application of Genetic Programming was proposed in the domain of daily currency exchange rates predictions, in conjunction with trigonometric operators. High-order statistical functions to analyze each system performance using daily returns of the British Pound and Japanese Yen were proposed with a unique representation. It was presented that using high-order statistical functions with integration of trigonometric functions outperformed the traditional models. Learning Trends on the Fly in Time Series Data 201 The trend of Genetic Programming based prediction model for Forex rate predic- tion started recently [24–26]. Before this, statistical models [22], ANN and data mining concepts were employed [21]. 2 Literature Review Generative and developmental approaches of artificial neural networks dynamically changes the aspects of the network continuously during problem solving phase. Nolfi introduced indirect mapping of ANN model [17]. In [10] a similar network was pro- posed where a single cell utilizes the process of mitosis and migration forming a 2-D neural network. The plastic neural model introduced in [12] could develop itself at run time influenced by changes in the environment. In [14], Floreano et al. explored the behavioral sturdiness of synaptic plasticity evolving neuro-controllers to solve light switching scenario having no reward mechanism. The HyperNEAT encoding scheme in [5] is exploited for evolution of synaptic weights and learning rules parameter set with poor testing results in the T-maze foraging bee scenario. The learning capability of the agents was enhanced later on [7]. In [18] developmental model of neurons, com- prising of seven chromosomes encoding various computational functions of biological neuron is presented to demonstrate learning with development. In [20] the interaction of Hebbian homo-synaptic with fast non-hebbian hetero-synaptic plasticity is demon- strated to be sufficient for assembly formation. The reminiscence don’t forget in a spiking recurrent network model with excitatory and inhibitory neurons. Blocking any component of plasticity averted strong functioning as a memory network. The work here uses Cartesian Genetic Programming to obtain suitable computa- tional functions for internal processing and developmental rules. To solve the time series forecasting scenario of currency exchange learning potential of the system is evaluated. Promising results are obtained in this work demonstrating robustness to deal with the dynamic scenarios. Miller pioneered Cartesian Genetic Programming (CGP) for evolution of digital circuits in 1999 [13]. It comprises of a 2D-two dimensional graphical architecture unlike the traditional tree based structure of genetic programming, having function nodes arranged in Cartesian format interconnected in a feed-forward manner. CGP has been evaluated in diverse fields of application gener- ating fascinating and competitive results [13]. 3 Plastic Cartesian Genetic Programming Evolved Recurrent Neural Network Plasticity in the form of dynamic weights, topology and complexity of the network is incorporated in Cartesian Genetic Programming (CGP) based Recurrent Neural Net- work (RNN) to explore the ability of online learning at runtime. RNN provide the state information to be part of the input parameters thus making the network markovian [8]. Plasticity is introduced by providing additional genes to make the developmental decisions [9]. The CGPRNN is having feedback mechanism, which feeds one or more system outputs back to the system. The general approach of Plastic CGPRNN is 202 G. M. Khan and Durr-e-Nayab depicted in Fig. 1. Figure 1 shows the basic CGPRNN block illustrating the network parameters developmental gene to introduce plasticity in the network. Sigmoid Ac va on Func on Decision Box Sum Inputs x0 Y0 x1 f0 Averaging Process Output X i fN-1 Y N-1 CGPANN Block Sum Sigmoid Ac va on Func on Sum Sigmoid Ac va on Func on Fig. 1. A generalized approach of PCGPRNN Development in the network occurs as a reflection of the system weighted output passed through a log-sigmoid function. The uniqueness of the approach is that changes in the network take place in real time related to the flow of data in the network, modifying its architecture, topology, complexity and weights. 4 Experimental Setup The PCGPRNN currency forecaster model introduced here exploits the currency exchange rates data acquired from the Australian Reserved Bank (ARB) for training and testing of the model. Daily exchange rates of US Dollars are considered for up to 500 days to train the model. Testing is performed on independent set of exchange rates data of 1000 days, for ten currencies namely: Korean Won (KW), Indonesian Rupiah (IDR), Canadian Dollars (CAD), Singapore Dollars (SGD), New Zealand Dollars (NZD), Taiwanese Dollars (TWD), Great Britain Pounds (GBP), Euros (EUR), Swiss Franc (CHF), Japanese Yen (YEN) and Malaysian Ringgits (MR). Initially random populations of PCGPRNN networks are produced for training purposes, these networks develop during the run time of a particular generation. Ten independent networks Learning Trends on the Fly in Time Series Data 203 having different genotype sizes and each working on five independent random seeds are introduced for training purposes. Maximum numbers of generations are restricted to one million in each training phase. The optimal trained networks are then evaluated on ten different currencies for their performance. Five inputs are allowed per neuron, log- sigmoid being used as activation function, system inputs are ten (10) in numbers, the mutation rate (lr) is set at 10%. The evolutionary strategy used is 1 + k, with k set to 9, representing the number of offspring. These parameters are based on the previous performance of CGPANN and CGPRNN [2, 3, 8]. 5 Results and Analysis Experimentation is performed on offline historical data and performance is obtained from the difference of estimated and actual values during training and testing processes. The system takes ten days daily averaged currency values as input and the eleventh days’ currency value is estimated. Once the optimal system is achieved during training phase, it is tested with the new data sets keeping the output historical values hidden from the system. The system predicts the unknown eleventh days’ exchange rate and is compared with the actual exchange rate to assess the system enactment. The experi- ments are carried out for the mentioned network architecture and the results are shown in Tables 1, 2, 3 and 4. Table 1 enlists results of the PCGPRNN forecaster model during training phase in terms of Mean Absolute Percentage Error (MAPE) values. The performance of the network is analyzed and the preeminent performance is attained with 150 nodes network securing the MAPE value of 1.537. Table 2 shows the per- formance of the PCGPRNN model during the testing phase. The results show that the best results are accomplished with Korean Won (KW) data set for the network of 150 nodes securing the MAPE value of 1.1315. Table 1. Training phase results of PCGPRNN Nodes MAPE 50 1.715 100 1.698 150 1.537 200 1.700 250 2.928 300 1.621 350 1.571 400 1.574 450 2.010 500 1.541 204 G. M. Khan and Durr-e-Nayab Table 2. Testing phase results of PCGPRNN model Data 50 100 150 200 250 300 350 400 450 500 SDR 1.71 1.69 1.53 1.70 2.92 1.62 1.57 1.57 2.01 1.54 CNY 2.45 2.21 2.28 2.25 3.21 2.40 2.26 2.26 3.07 2.29 IDR 1.89 1.61 1.56 1.69 3.18 1.80 1.55 1.63 2.59 1.57 KW 1.79 1.18 1.13 1.23 3.91 3.60 1.14 1.40 2.82 1.13 TD 2.34 1.50 1.45 1.59 5.14 4.13 1.45 1.80 3.84 1.45 MR 1.90 1.68 1.62 1.75 3.10 7.32 1.62 1.68 2.48 1.63 HKD 1.87 1.58 1.53 1.67 3.11 7.58 1.52 1.61 2.57 1.54 CAD 1.98 1.62 1.59 1.74 3.22 7.99 1.57 1.67 2.79 1.61 NZD 5.08 1.76 1.68 1.85 5.55 3.93 1.69 2.05 4.10 1.68 CHF 7.38 1.85 1.73 1.98 8.22 3.63 1.75 2.19 4.52 1.73 GBP 15. 9 1.84 1.74 1.94 26.4 7.48 1.73 1.85 2.87 1.75 EUR 15.1 1.82 1.73 1.87 25.6 7.14 1.73 1.78 2.51 1.73 TWI 12.1 2.24 2.11 2.27 21.4 5.97 2.11 2.22 3.33 2.11 YEN 17.2 1.56 1.52 1.68 28.2 8.03 1.51 1.59 2.58 1.54 Avg 11.5 1.36 1.30 1.43 20.9 5.59 1.29 1.49 2.75 1.31 Table 3 highlights a comparison between the accuracy of PCGPRNN forecaster model with the contemporary ANN models introduced previously for similar exchange rates. PCGPRNN with 98.87% seems to outperform all. Note that all other networks are static, whereas PCGPRNN continue to change at runtime in response to the input data patterns. Table 3. Comparison of PCGPRNN with contemporary ann models Network Accuracy Multi Layer Perceptron [4] 72 Volterra Network [4] 76 AFERFM [1] 81.2 HFERFM [1] 69.9 Back Propagation Network [15] 62.27 Multi Neural Network [15] 66.82 CGPANN [2] 98.84 PCGPRNN (Proposed) 98.87 Learning with Development In order to evaluate ‘learning on the fly’ capability of the network, we have tested the network for its performance in a completely new scenario. We have evaluated the performance of PCGPRNN to predict more days (i.e. 7, 10, 15, 30, 60) rather than single day. Table 4 shows the MAPE values of the proposed PCGPRNN model for multiple days’ prediction. It can be observed the proposed model performs better in the advance prediction scenarios as well. Learning Trends on the Fly in Time Series Data 205 Table 4. Testing results of PCGPRNN model Currency 7 10 15 30 60 SDR 3.8298 3.9776 4.5585 5.8555 7.9523 CNY 3.0987 3.4626 4.0178 5.3280 7.5935 IDR 2.2317 2.5736 3.3274 4.7754 7.2544 KW 2.9470 3.2516 3.9662 5.8721 9.3758 TD 3.1521 3.5315 4.1070 5.6000 7.9183 MR 3.0304 3.3461 3.8826 5.1105 7.2978 HKD 3.1586 3.4931 3.9919 5.2200 7.1626 CAD 3.2143 3.9818 4.8699 6.2117 8.8012 NZD 3.8877 4.4869 5.4938 6.5251 9.9959 CHF 3.4591 4.3415 5.3881 7.0680 8.6958 GBP 3.3462 4.0830 4.8516 5.5046 8.0757 EUR 4.2062 5.0777 6.4202 7.9186 9.8737 TWI 2.9801 3.3947 3.8889 5.0798 6.6557 YEN 2.7573 3.1262 3.8094 4.8327 6.2890 6 Conclusion and Future Enhancements We have enhanced the forecasting of foreign currency volatility in the global market using the power of neuro evolution and its amalgamation with DOL to its next station. The recurrent models, that are used in work exhibit self-modifying and orientation capabilities. Cartesian Genetic Programming is explored to encode the dynamic computational networks and is evolved for its learning behavior at runtime in the proposed system. It incorporates synaptic as well as developmental plasticity in Recurrent Neural Networks. Plastic Cartesian Genetic Programming evolved Recurrent Neural Network (PCGPRNN), the model exploits the trends in foreign exchange to predict the upcoming currency exchange rates, while developing its topology, synaptic connectivity and other architectural component including input vector ‘on the fly’. The results demonstrated the system to be robust and able to learn on the fly to predict the volatile nature of foreign exchange rates. References 1. Philip, A.A., Tofiki, A.A., Bidemi, A.A.: Artificial neural network model for forecasting foreign exchange rate. World Comput. Sci. Inf. Technol. J. 1(3), 110–118 (2011) 2. Khan, G.M., Nayab, D., Mehmud, S.A., Zafar, M.H.: Evolving dynamic forecasting model for foreign currency exchange rates using plastic neural networks. In: IEEE 12th International Conference on Machine Learning and Applications ICMLA (2013) 3. Nayab, D., Muhammad Khan, G., Mahmud, S.A.: Prediction of foreign currency exchange rates using CGPANN. In: Iliadis, L., Papadopoulos, H., Jayne, C. (eds.) EANN 2013. CCIS, vol. 383, pp. 91–101. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642- 41013-0_10 206 G. M. Khan and Durr-e-Nayab 4. Kryuchin, O.V., Arzamastsev, A.A., Troitzsch, K.G.: The prediction of currency exchange rates using artificial neural networks. Exch. Organ. Behav. Teach. J., no. 4 (2011) 5. Risi, S., Stanley, Kenneth O.: Indirectly encoding neural plasticity as a pattern of local rules. In: Doncieux, S., Girard, B., Guillot, A., Hallam, J., Meyer, J.-A., Mouret, J.-B. (eds.) SAB 2010. LNCS (LNAI), vol. 6226, pp. 533–543. Springer, Heidelberg (2010). https://doi.org/ 10.1007/978-3-642-15193-4_50 6. Kadilar, C., Alada, H.: Forecasting the exchange rate series with ANN: the case of Turkey. Econ. Stat. Chang. 9, 17–29 (2009) 7. Galeshchuk, S., Mukherjee, S.: Deep networks for predicting direction of change in foreign exchange rates. Intell. Syst. Account., Financ. Manag. 24, 100–110 (2017) 8. Khan, M.M., Khan, G.M., Miller, J.F.: Efficient representation of recurrent neural networks for Markovian/non-Markovian non-linear control problems. In: International Conference on Intelligent Systems Design and Applications, pp. 615–620 (2010) 9. Khan, M.M., Khan, G.M., Miller, J.F.: Developmental plasticity in cartesian genetic programming artificial neural networks. In: Proceedings of the International Conference on Informatics in Control, Automation and Robotics, pp. 449–458 (2011) 10. Cangelosi, A., Nolfi, S., Parisi, D.: Cell division and migration in a ‘genotype’ for neural networks. Netw. Comput. Neural Syst. 5, 497–515 (1994) 11. Pacelli, V., Bavelacqua, V., Azzollini, M.: An artificial neural network model to forecast exchange rates. J. Int. Learn. Syst. Appl. 3(2A), 57–69 (2011) 12. Upegui, A., Perez-Uribe, A., Thoma, Y., Sanchez, E.: Neural development on the Ubichip by means of dynamic routing mechanisms. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 392–401. Springer, Heidelberg (2008). https://doi. org/10.1007/978-3-540-85857-7_35 13. Miller, J.F.: Cartesian Genetic Programming. Natural Computing Series. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-17310-3 14. Floreano, D., Urzelai, J.: Evolutionary robots with on-line self-organization and behavioral fitness. Neural Netw. 13(4), 431–443 (2000) 15. Chen, A.P., Hsu, Y.C., Hu, K.F.: A hybrid forecasting model for foreign exchange rate based on a multi-neural network. In: Fourth International Conference on Natural Computation, ICNC, vol. 5, pp. 293–298 (2008) 16. Coleman, O.J., Blair, A.D.: Evolving plastic neural networks for online learning: review and future directions. In: Thielscher, M., Zhang, D. (eds.) AI 2012. LNCS (LNAI), vol. 7691, pp. 326–337. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35101-3_28 17. Nolfi, S., Miglino, O., Parisi, D.: Phenotypic plasticity in evolving neural networks. In: Proceedings of the International Conference from Perception to Action, pp. 146–157. IEEE Press (1994) 18. Khan, G.M., Miller, J.F., Halliday, D.M.: A developmental model of neural computation using cartesian genetic programming. In: Proceedings of the Genetic and Evolutionary Computation (Companion), pp. 2535–2542. ACM (2007) 19. Massobrio, P., et al.: In vitro studies of neuronal networks and synaptic plasticity in invertebrates and in mammals using multielectrode arrays. Neural Plast. (2015) 20. Zenke, F., Agnes, E.J., Gerstner, W.: Diverse synaptic plasticity mechanisms orchestrated to form and retrieve memories in spiking neural networks. Nat. Commun. 21(6), 6922 (2015) 21. Ravi, V., Lal, R., Kiran, N.R.: Foreign exchange rate prediction using computational intelligence methods. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 4, 659–670 (2012) 22. FOREX Tutorial: Economic Theories, Models, Feeds & Data Available: http://www. investopedia.com/university/forexmarket/forex5.asp.Accessed:September. Accessed Sep 2017 Learning Trends on the Fly in Time Series Data 207 23. Patel, P.J., Patel, N.J., Patel, A.R.: Factors affecting currency exchange rate, economical formulas and prediction models. International Journal of Application or Innovation in Engineering Managment. 3, 53–56 (2014) 24. Schwaerzel, R., Bylander, T.: Predicting currency exchange rates by genetic programming with trigonometric functions and high-order statistics. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation. ACM (2006) 25. Alvarez Diaz, M.: Speculative strategies in the foreign exchange market based on genetic programming predictions. Appl. Financ. Econ. 20(6), 465–476 (2010) 26. Shylajan, C.S., Sreejesh, S., Suresh, K.G.: Rupee-dollar exchange rate and macroeconomic fundamentals: an empirical analysis using flexible-price monetary model. J. Int. Bus. Econ. 12(2), 89–105 (2011) 27. Shioda, K., Deng, S., Sakurai, A.: Prediction of foreign exchange market states with support vector machine. In: 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA), vol. 1. IEEE (2011) Noise Masking Recurrent Neural Network for Respiratory Sound Classification Kirill Kochetov(B), Evgeny Putin, Maksim Balashov, Andrey Filchenkov, and Anatoly Shalyto Computer Technologies Lab, ITMO University, 49 Kronverksky Pr, 197101 St. Petersburg, Russia {kskochetov,eoputin,balashov,afilchenkov,shalyto}@corp.ifmo.ru Abstract. In this paper, we propose a novel architecture called noise masking recurrent neural network (NMRNN) for lung sound classifica- tion. The model jointly learns to extract only important respiratory-like frames without redundant noise and then by exploiting this information is trained to classify lung sounds into four categories: normal, containing wheezes, crackles and both wheezes and crackles. We compare the perfor- mance of our model with machine learning based models. As a result, the NMRNN model reaches state-of-the-art performance on recently intro- duced publicly available respiratory sound database. Keywords: Respiratory sound classification Recurrent neural networks · Deep learning 1 Introduction In the last decades many machine learning (ML) approaches have been intro- duced to analyze respiratory cycle sounds including crackles, coughs, wheezes [1– 6]. However almost all conventional ML models solely rely on hand-crafted fea- tures. Furthermore, highly complex preprocessing steps are required to make use of designed features [4–6]. Thus, merely ML-based models may not be robust to external/internal noises in lung sounds and may not generalize their perfor- mance across different softwares and measuring devices. However, to be used in clinics respiratory tracking systems have to reach high classification accuracy. From that perspective deep learning (DL) models [7] have gained a lot of attention in the community. DL-based models primary rely on high abstract rep- resentation of data that are learned through the training of models. Due to this fact, DL models reach state-of-the-art performance on the range of tasks includ- ing image recognition [8], speech recognition [9], time series forecasting [10]. In this work, we propose an architecture of recurrent neural network (RNN) called NMRNN that is trained in end-to-end manner to simultaneously detect noise in respiratory cycles and to classify lung sounds into several categories such as: normal, wheezes, crackles or wheezes and crackles. In other words, our model ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 208–217, 2018. https://doi.org/10.1007/978-3-030-01424-7_21 NMRNN for Respiratory Sound Classification 209 itself decides what information and from what time points it should use to make effective prediction of respiratory sounds. The crucial feature of the model is that it is trained without applying any hand preprocessing stages like slicing of individual respiratory cycles. Through extensive testing, the proposed model has reached state-of-the-art performance on recently published large open database of lung sound records [11]. The rest of the paper is organized as follows. In Sect. 2, we review several notable works in respiratory sounds classification using ML and DL based mod- els. Detailed description of NMRNN is given in Sect. 3. Sections 4 and 5 presents results and comparative study with solely ML-based models. Conclusions are presented in Sect. 6. 2 Related Work Recently a comprehensive comparative study of applying different ML models to automatic wheeze detection was done in [4]. Authors used a lot of models includ- ing feed-forward neural network, random forest (RF), support vector machine (SVM) and trained them on two datasets: phonopneumogram samples and the Dubrovnik General Hospital (DGH) dataset. To reduce the influence of cardio- vascular and muscular noise, they applied Yule-Walker filter followed by STFT procedure. Then, two types of features were extracted from the lung sounds: MFCC (Mel-frequency cepstral coefficients) features and some statistical fea- tures. The authors reported that their best model with statistical features got 93.62% and 91.77% accuracy on phonopneumograms and DGH datasets, accord- ingly. Meanwhile, based on MFCC features SVM model reached 99% accuracy on both datasets. In [12], authors proposed to use hidden Markov models (HMM) coupled with Gaussian mixture models (GMM) for classification of respiratory sounds into four categories: normal, containing wheezes, crackles and both crackles and wheezes. The main idea behind applying HMM was that it is able to take into account frame position in a sequence which leads to better accuracy comparing to GMM. To tackle with noise in sound records, they applied spectral subtraction technique [13]. MFCC extracted from the records were used as input features to the model. In addition to MFCC features obtained in range from 50Hz to 2000Hz, the first time derivatives of MFCCs were used to track feature dynam- ics and to decorrelate feature vectors resulting in feature set with size 30. As a result, the ensemble model of 28 HMMs with 5 states and 1 Gaussian per state achieved 0.495 and 0.396 scores on the cross-validation and second evaluation score respectively. In both experiments different patients were used for train- ing and testing, so it was honest validation, and we can compare these results with ours. One of the most successful attempts of applying DL models to the field of respiratory sound classification was done in [14]. Authors used convolutional neural networks (CNN) to detect wheezes in lung sound records. Firstly, respi- ratory records were augmented by biasing sound sample in several time frames. 210 K. Kochetov et al. Then, STFT features were computed followed by standard normalization. Lastly, obtained normalized spectrograms of lung sounds were used to train 2D CNN. The final model received 99% accuracy and 0.96 AUC on the dataset. 3 Method RNNs are a class of artificial neural networks (ANNs), which are able to pro- cess temporal data, such as sound and text. RNNs can use their internal state (memory) and feedback to process sequences of inputs. LSTM (Long short-term memory) and GRU (gated recurrent unit) networks [15,16] are popular variants of RNN. They are show unprecedented performance on sequence-related tasks such as NLP (Natural Language Processing) [17] and speech recognition [18]. We use both LSTM and GRU units for our experiments. NMRNN is based on three main ideas: 1. Adapt RNNs, which are designed for time-scale data and can consider all information from sequential frames of input signal. 2. Distinguish noise and content automatically during training. 3. Make predictions using only breath (without noise), because noise can include biased anomalies similar to wheezes or crackles. Fig. 1. MNRNN architecture. Stacked Noise RNN predicts one noise label per frame using original MFCC data. MASK block adds attention mechanism of the most impor- tant frames with respiratory cycles. Stacked Anomalies RNN predicts one anomaly label per sample using highlighted data from the MASK block. The MNRNN model consists of three parts: noise classifier, respiratory (or anomaly) classifier and some kind of attention called MASK. Schematic overview of the model is shown in Fig. 1. NMRNN for Respiratory Sound Classification 211 First of all, before model training each sound sample was split on frames with equal length. There is only one anomaly label for sound sample and one noise label for each frame. Noise classifier is a stacked RNN called NRNN, which predicts noise label for every frame from the sample. NRNN optimizes a cross-entropy loss calculated for each output during training ∑ LCE(p, q) = − p(x)× log(q(x)). (1) Then predicted noise labels propagates through masking layer called MASK, where original frames multiplies with masking coefficient (1−X)× Y , where X is the predicted noise label (X = 1 for noise frame) and Y is a frame. Anomaly classifier is a stacked RNN called ARNN, which predicts one anomaly label for one sample (all frames). ARNN takes highlighted frames from MASK block as input data and optimize cross-entropy loss for one label per sample. The final loss of the proposed architecture is following: Lmodel = a1 × LCEnoise + a2 × LCEanom. (2) Values of coefficients a1 and a2 are based on the idea that the main goal of the model is anomaly classification, not noise classification. The proposed MASK mechanism is simple and efficient and was inspired by gating technique used in GRU cell, where memory needs to be rewritten on each time step using only important information from the input. NRNN param- eters were optimized using both NRNN and ARNN losses, so together NRNN and MASK mechanisms allow not only to mask noise frames, but to highlight useful subsamples with respiratory-like content. Attention mechanism used in current model is not the same as usually used for seq2seq models [19]. The main difference is that seq2seq attention mechanism commonly create context vector with weighted sum of encoder hidden states and maps it with current decoder hidden state. So attention in seq2seq extends sight of decoder during sequence prediction. Our MASK layer relies on both predicted noise and anomaly labels, because it receives gradients from both RNN blocks. We conducted additional experiment to show that model with MASK mechanism outperforms model with- out it in terms of classification metrics. The main feature of MNRNN method is the ability to perform end-to-end classification without using any manual preprocessing steps like slicing breath on separate cycles. The only commonly used preprocessing step that we did was splitting data to equal frames. The amount of frames does not affect on model training and testing too. 4 Experiments In the study, logistic regression (LR), random forest (RF), gradient boosting machine (GBM), SVM-based classifier [20] and standard RNN were used as 212 K. Kochetov et al. baselines for comparison with the NMRNN model. For baseline experiments, we used the same preprocessing as provided in [4]. 4.1 Database For training and evaluation the ICBHI Scientific Challenge database was used [11]. The database contains audio samples, collected independently by two research teams in two different countries over several years. The database con- sists of 920 annotated audio samples from 126 patients. It includes 6898 different respiratory cycles with 1864 crackles, 886 wheezes and 506 crackles and wheezes. The database summary is presented in Table 1. There are a lot of noise in sounds: 1840 noise cycles in all data and 1366 in AKGC417L data. It simulates real life conditions and made the classification algorithm more robust and stable for noise attack. Table 1. Database summary. Recordings columns includes statistics about separate sound recordings data. Cycles columns includes statistics about individual respiratory cycles Num of Recordings Cycles All equipment AKGC417L All equipment AKGC417L Patients 126 56 126 56 Samples 920 683 6898 4697 Normal breath 287 196 3642 2226 Wheezes 134 77 886 512 Crackles 297 252 1864 1578 Wheezes and Crackles 202 158 506 381 4.2 Experiments Setup In this work, we conducted several experiments. Different data and preprocessing steps were used for them. The key idea of all experiments is to compare proposed approach with other machine learning models in different situations in terms of performance and robustness. 1. Simple noise binary classification experiment for initial model checking. 2. 4-class anomalies classification using individual respiratory cycle as input. 3. 4-class anomalies classification using sound samples with several respiratory cycles in each (end-to-end classification). The aim of the first experiment is to check RNN and NMRNN ability to learn respiratory and noise cycle interval lengths and frequencies. The second exper- iment should compare our baseline models with recently proposed method [12]. NMRNN for Respiratory Sound Classification 213 The second experiment is demonstrative, but it has one critical limitation: it is not end-to-end experiment, because first of all we need to split lung sounds on respiratory cycles, but there is no automatic universal solution for this task yet. So, for each new lung sound record we need to manually split it into respiratory cycles. For this reason, the third experiment was conducted. The aim of this exper- iment is to check the abilities of the models to find what input information is important and where it is located in multidimensional feature space. Model as end-to-end classifier needs to find respiratory-dependent features in the data by itself. Also, there are two variations of data for each experiment. We use all available data and data recorded only on AKGC417L microphone. The main idea of using second data type is to show that the models can achieve better performance using only one unbiased data source. All experiments were conducted on a computer with Intel Core i7-6900 CPU with 128GB of RAM and NVIDIA GTX 1080Ti GPU. 4.3 Result Evaluation Due to the unbalanced data set, we used sensitivity and specificity as statistical indicators of the models performance. Sensitivity, specificity and overall score were proposed in the original data set paper [11,12]. Overall evaluation score can be formulated as: Sensitivity + Specificity Score = . (3) 2 We used 5-fold cross-validation over patients to evaluate the results and it is important to note that there is no patients from the train set in the test set on each split. So, we used honest real-oriented division of the data for validation. 4.4 Preprocessing To remove sounds caused by heartbeats, the signal components at low frequencies have to be suppressed. We use the high pass finite impulse response (FIR) filter with cutoff frequency fc = 100 Hz for remove sounds caused by heartbeat [12]. In this work, MFCC was used as feature extractor. The lower and upper frequencies of processed content were cut to 50 and 2000Hz respectively, because wheezes and crackles are in this interval [12]. Parameters frame length and frame step were both chosen equal to 0.05 s using grid search optimization [21]. Every sound sample from original database was sliced on pieces called frames with length of 0.5 s each. Every frame was split on 10 non-overlapping frames. Both frame length and frame step are 0.05 s. One MFCC set (13 values) was extracted from each frame. So, every piece is described by 130 MFCC features. Each frame and sample corresponds to a breathing (presence of anomaly) and noise label. There are four breathing classes in the database: normal breathing, breathing with wheezes, crackles and with both wheezes and crackles. 214 K. Kochetov et al. During anomaly classification using all frames (one label per sound) or subset of frames (one label per respiratory cycle) we want to predict existence of anoma- lies in the overall sound sample or in the only one respiratory cycle respectively. So, for baseline models each sound sample or respiratory cycle was reshaped into a single flattened array. Taking into account different audio lengths, final data samples were cut or filled using standard padding technique. Also, augmentation technique (was proposed in [14]) with shifting was used for solving the problem of respiratory cycles localization. PCA (Principal Component Analysis) was used for dimensionality reduction (only for baseline models). 5 Results For noise binary classification task NMRNN achieved 0.89 evaluation score com- pared with the best baseline model GBM, which reached only 0.53 score. It can be explained by the ability of RNN to learn cycle and noise intervals length and frequency and use this additional information during prediction. Table 2. Results of 4-class classification of each respiratory cycle. Metrics of Jakovljevic HMM was not provided with AKGC417L data All equipment AKGC417L Model Sens Spec Score Sens Spec Score GBM 0.476 0.554 0.515 0.534 0.568 0.551 LR 0.425 0.508 0.466 0.426 0.51 0.468 RF 0.438 0.538 0.488 0.483 0.521 0.502 SVM 0.49 0.502 0.496 0.502 0.518 0.51 Jakovljevic [12] 0.423 0.567 0.495 - - - RNN (ours) 0.584 0.73 0.657 0.617 0.741 0.679 Results of 4-class classification of each respiratory cycle are presented in Table 2. There is a comparison of our baseline and NMRNN models with HMM- based method proposed by Jakovljevic. All models were trained on MFCC fea- tures. Performance of our models is similar with performance of Jakovljevic HMM [12], except for NMRNN, which outperforms competitors. So, it is correct to compare presented baseline models with proposed RNN-based approach in the next experiment. Also, models trained only on AKGC417L data show better scores as expected due to reduced bias of data distribution. The second exper- iment is less complex than the third one, because of data manually sliced on respiratory cycles before training. Results of end-to-end classification are provided in Table 3. NMRNN def- initely outperforms other methods with respect to the chosen criterion. The main reason is that RNN was designed to process such kind of data with tem- poral dependencies. Another models face with problems of large dimensionality NMRNN for Respiratory Sound Classification 215 Table 3. Results of 4-class classification of each sound sample All equipment AKGC417L Model Sens Spec Score Sens Spec Score GBM 0.362 0.142 0.252 0.348 0.174 0.261 LR 0.348 0.184 0.266 0.366 0.236 0.301 RF 0.433 0.054 0.244 0.451 0.079 0.265 SVM 0.313 0.251 0.282 0.278 0.256 0.267 RNN (ours) 0.511 0.717 0.614 0.572 0.728 0.65 NMRNN (ours) 0.56 0.736 0.648 0.62 0.75 0.685 and localization of respiratory cycles. So, neither PCA or augmentation do not help to solve these problems, because the baseline models are not adapted for unstable data with floating content such as sound with several respiratory cycles. MASK block with noise classification increases performance on about 0.035 in terms of score. It can be explained by ability of the final model to concentrate only on frames with respiratory cycles, not with noise. Also MASK block helps to distinguish false positive anomalies (biased noise) with real anomalies (crackles or wheezes) as justified on Fig. 2. Fig. 2. Confusion matrices of RNN and NMRNN. MASK block helps to clarify some samples similarity by masking false positive anomalies detected in noise frames. Due to that both sensitivity and specificity was improved. Models trained only on AKGC417L data show performance as in the previous experiments. This proves that the model can be adapted for single source and can in theory boost performance with increasing of amount of unbiased data for training. We used grid search [21] as optimization algorithm for finding best hyper- parameters for baseline and RNN-based models. So the best RNN-based model 216 K. Kochetov et al. with MASK block consists of 2-layer RNNs as both NRNN and ARNN parts with GRU cells with 256 units in each. Coefficients a1 and a2 from Eq. 2 are 0.3 and 0.7 respectively, which corresponds to the main task of the model (anomaly classification). Overall model architecture was trained using Adam [22] optimizer with learning rate = 0.0001. 6 Conclusion In this paper, we proposed RNN-based end-to-end model architecture called NMRNN to detect different anomalies in lung sound data with masking of noise. MASK block is very powerful, so it allows the model to consider only relevant frames during classification. We assume, that the trained MASK mechanism is a superior direction of further improvement. The main contribution of this approach is that it is trained without apply- ing any manual preprocessing steps using respiratory records of any lengths. NMRNN reaches state-of-the-art performance in comparison with another ML models on respiratory sound classification task and, including recently proposed [12], on individual respiratory cycle classification task. Also, this study shows the ability of the model to learn cycle and the lengths of noise intervals and frequencies. Experiments with AKGC417L microphone motivate to concentrate on single data source during creation of approach appli- cable in real life conditions. Acknowledgements. This work was financially supported by the Government of the Russian Federation, Grant 08-08. References 1. Bahoura, M., Pelletier, C.: Respiratory sounds classification using cepstral analysis and Gaussian mixture models. In: 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEMBS 2004, vol. 1, pp. 9–12. IEEE (2004) 2. Mayorga, P., Druzgalski, C., Morelos, R.L., Gonzalez, O.H., Vidales, J.: Acoustics based assessment of respiratory diseases using GMM classification. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6312–6316. IEEE (2010) 3. Palaniappan, R., Sundaraj, K., Sundaraj, S.: A comparative study of the SVM and K-NN machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC Bioinform. 15(1), 223 (2014) 4. Milicevic, M., Mazic, I., Bonkovic, M.: Classification accuracy comparison of asth- matic wheezing sounds recorded under ideal and real-world conditions. In: 15th International Conference on Artificial Intelligence, Knowledge Engineering and Databases (AIKED 2016), Venice (2016) 5. Rocha, B.M., Mendes, L., Chouvarda, I., Carvalho, P., Paiva, R.P.: Detection of cough and adventitious respiratory sounds in audio recordings by internal sound analysis. In: Maglaveras, N., Chouvarda, I., de Carvalho, P. (eds.) Preci- sion Medicine Powered by pHealth and Connected Health. IP, vol. 66, pp. 51–55. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7419-6 9 NMRNN for Respiratory Sound Classification 217 6. Serbes, G., Ulukaya, S., Kahya, Y.P.: An automated lung sound preprocessing and classification system based onspectral analysis methods. In: Maglaveras, N., Chouvarda, I., de Carvalho, P. (eds.) Precision Medicine Powered by pHealth and Connected Health. IP, vol. 66, pp. 45–49. Springer, Singapore (2018). https://doi. org/10.1007/978-981-10-7419-6 8 7. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 8. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017) 9. Palaz, D., Magimai-Doss, M., Collobert, R.: Analysis of CNN-based speech recog- nition system using raw speech as input. Technical report, Idiap (2015) 10. Weigend, A.S.: Time Series Prediction: Forecasting the Future and Understanding the Past. Routledge, New York (2018) 11. Rocha, B.M., et al.: A respiratory sound database for the development of auto- mated classification. In: Maglaveras, N., Chouvarda, I., de Carvalho, P. (eds.) Pre- cision Medicine Powered by pHealth and Connected Health. IP, vol. 66, pp. 33–37. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7419-6 6 12. Jakovljević, N., Lončar-Turukalo, T.: Hidden Markov model based respiratory sound classification. In: Maglaveras, N., Chouvarda, I., de Carvalho, P. (eds.) Pre- cision Medicine Powered by pHealth and Connected Health. IP, vol. 66, pp. 39–43. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7419-6 7 13. Berouti, M., Schwartz, R., Makhoul, J.: Enhancement of speech corrupted by acoustic noise. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1979, vol. 4, pp. 208–211. IEEE (1979) 14. Kochetov, K., Putin, E., Azizov, S., Skorobogatov, I., Filchenkov, A.: Wheeze detection using convolutional neural networks. In: Oliveira, E., Gama, J., Vale, Z., Lopes Cardoso, H. (eds.) EPIA 2017. LNCS (LNAI), vol. 10423, pp. 162–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65340-2 14 15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 16. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the proper- ties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014) 17. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language model- ing. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) 18. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on Acoustics, speech and signal processing (ICASSP), pp. 6645–6649. IEEE (2013) 19. Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015) 20. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7 21. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012) 22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) Lightweight Neural Programming: The GRPU Felipe Carregosa1(B), Aline Paes2, and Gerson Zaverucha1 1 Department of Systems Engineering and Computer Science, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil {fborda,gerson}@cos.ufrj.br 2 Department of Computer Science, Institute of Computing, Universidade Federal Fluminense, Niterói, RJ, Brazil alinepaes@ic.uff.br Abstract. Deep Learning techniques have achieved impressive results over the last few years. However, they still have difficulty in producing understandable results that clearly show the embedded logic behind the inductive process. One step in this direction is the recent development of Neural Differentiable Programmers. In this paper, we designed a neu- ral programmer that can be easily integrated into existing deep learning architectures, with similar amount of parameters to a single commonly used Recurrent Neural Network. Tests conducted with the proposal sug- gest that it has the potential to induce algorithms even without any kind of special optimization, achieving competitive results in problems handled by more complex RNN architectures. Keywords: Recurrent Neural Networks Neural Differentiable Programmers 1 Introduction Recently there has been a renewed interest in merging traditional programming and Neural Networks (NNs), particularly thanks to more advanced Automatic Differentiation (AD) tools [8]. These new tools can evaluate functions written in the host languages idiomatic structures, allowing programmers to easily and efficiently obtain the gradient of varied units of code with respect to their argu- ments. This enables augmenting the programming toolset with the Machine Learning capabilities. With a similar goal, Neural Differentiable Programmers (NDPs) [9,11] have been developed to allow NNs to compose algorithms in more traditional ways. This allows them to potentially tackle hard problems, involving complex arith- metic and logical reasoning. Thus, in order to model the input-output relation- ship, instead of applying a series of transformations directly over the input, NDPs The authors would like to thank the Brazilian Research Agencies CNPq and CAPES for partially finance this research. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 218–227, 2018. https://doi.org/10.1007/978-3-030-01424-7_22 Lightweight Neural Programming: The GRPU 219 choose a sequence of transformations from a predefined instruction set, yielding an explicit algorithm to transform the input into the solution. Furthermore, they can also decouple its learned logic from the specific input values, allowing for better generalization and re-usability in different contexts. However, current NDP models focus on end-to-end solutions for specific contexts and problems, instead of being easily integrated into current Deep Learning models. In this paper, we propose The Gated Recurrent Programmer Unit (GRPU), a NDP technique that can be easily integrated into any current model that uses a Recurrent Neural Network (RNN). Moreover, GRPU uses around the same amount of parameters as a simple Gated Recurrent Unit (GRU) [4], is agnostic in terms of external memory structure and data inputs, and can be extended in similar ways to RNNs, like stacking and soft attention strategies. This way it can provide a lightweight way of augmenting Deep Learning models with the induction of more traditional programs. The rest of the paper is organized as follows. The next section briefly explains the GRU and the most known NDPs in the literature. The 3rd section details the model devised in this work. The 4th section brings the experiments we have conducted in this work, and the last section is the conclusion. 2 Preliminaries Here we briefly explain the GRUs, by which our model is inspired, and the most relevant neural programmers found in the related literature. 2.1 Gated Recurrent Unit (GRU) Recently, a new, simpler, architecture for RNNs has been developed, the Gated Recurrent Unit (GRU) [4]. GRUs present comparative performance to the tradi- tionally used Long Short-Term Memory (LSTM) [5], while using fewer parame- ters, as they have only two interacting layers instead of three: the update gate and the reset gate. When the value computed at the reset gate is close to 0, the cor- responding previous hidden state is erased and, therefore, ignored when creating the new state. This allows the GRU to drop information judged irrelevant. The update gate, on the other hand, controls how much information from the previ- ous hidden state should be directly carried over to the current hidden state. This shortcut between the previous state and the following one allows information to be kept untouched indefinitely, helping with the Vanishing Gradient Problem [3]. The value of the current hidden state is computed as ht = (1− ut) ∗ ht−1 + ut ∗ h̃t, where u and r stands for the update and reset gates, respectively, and ˜ht, the new candidate state: ˜ht = tanh(W.[rt ∗ ht−1, xt] + b). The values of the update and reset gates are defined with their own set of parameters, where the update gate is computed as ut = σ(Wz.[ht−1, xt] + bu) and the reset gate as rt = σ(Wr.[ht−1, xt] + br). 220 F. Carregosa et al. 2.2 Related Work: Neural Programmers Neural Differentiable Programming (NDP) techniques try to combine the pat- tern matching and universal approximation nature of the neural networks with the discrete series of operations from traditional algorithms [9]. Fundamentally, neural networks are simply a chain of geometric transformations, and finding one of such transformations that can fully generalize each traditional operation, such as arithmetic and logic operations, is hard and require potentially large amounts of data. For example, even a simple sum or product of numbers is not a trivial task for a neural network to learn, especially considering the distortion caused by the non linear transformations that occur at each step. Integrating algorithmic-like aspects has been a tendency since the success of the attention models [12]. They allow the network to learn to choose the data it wants to access in a completely differentiable way. NDPs go one step further and not only apply the selection to the input data, but also to the operation applied to the data. For that, they comprise a selection of differentiable operations, and through soft attention they are able to select an operation for each step, and the results of each step can then become the input of the following step. They possess, then, the ability to induce algorithms that transform the original input into the desired output through the multiple steps. The selection operation usually has the form result = oplist(args)T softmax(opcode), where oplist is an N -sized vector in which each field is an operation like sum or multiplication, and opcode is a vector with N values, generated by a RNN at each step. Some of the most notable neural programmers are: – The Neural Programmer [9] is a table query based model that, given an input question, selects a series of aggregate operations and a series of columns from the input table for each operation to be applied. The training phase involves finding the operations and column arguments that minimizes the error towards the given output, using two LSTMs and two softmax layers. – The Neural Programmer Interpreter [10] is composed of a single LSTM and a domain specific encoder for the state of the environment. The LSTM has three selector units to choose the next operation, its arguments, and when the subprogram terminates. It predicts the next step of a program only, and not the full program at once, requiring the program trace as input. – The Neural Random Access Machines [7] is a sequence-to-sequence program- mer model, in which every data register of the virtual machine it implements contains a pointer (a probability distribution) that can be transformed into new pointers through look-up-table based operations. Each pointer can be used to read or write from a memory tape using attention. 3 The Gated Recurrent Programmer Unit We introduce a novel neural differentiable programmer architecture that focuses on low footprint and easy integration with other neural architectures. It has considerably fewer parameters than the models described in the previous section, Lightweight Neural Programming: The GRPU 221 and it does not require a complex input in both training and execution (such as tables, preprocessed lists or programs traces). Additionally, unlike the previous models, the GRPU instructions can have any number of arguments, due to not requiring softmax selection, and of operations transforming those arguments in a single step. 3.1 The Architecture Figure 1 exhibits the GRPU architecture, which is built upon the structure of a regular GRU. GRPU is not only easily exchangeable wherever a GRU can be used, enabling traditional algorithmic manipulation of it’s inputs, but it can also be implemented with just a few lines of codes over the GRU. The fundamental difference between the two models is the way the new state is produced, but this small difference also affects how everything else is interpreted. Thus, in GRPU, the affine transformation is replaced by an Arithmetic and Logic Unit (ALU), a module that executes one operation for each set of fields of the hidden state to produce the next state values. The Virtual Machine (VM) state, which replaces the hidden state in the GRU, is hvm ∈ RN , where N is both the ALU’s operation’s outputs sizes summed and the argument’s sizes summed. In other words, the VM state is both the arguments for the ALU, and the outputs of the ALU. The ALU receives the previous VM state and returns a new candidate for the next state from the results of each operation. The reset gate, in this context, operates as the argument selector, responsible for determining which arguments will be fed to the ALU, turning the ones that should be ignored to zero. The update gate defines which operations have their results kept and which ones are ignored. In this last case, the previous values of the VM state are restored, and the operation is replaced by a NOP , No Operation. The algorithm is, therefore produced by producing the GRU gates [ut, rt] based on the inputs, which is equivalent to producing the opcode [operations, operands]t. Calculating every step gives the final algorithm, like the example displayed in Fig. 2. Unlike with GRUs though, the hidden state, or the VM state, shouldn’t be used in the creation of the gates output, and therefore in the creation of the instructions. This is done so the model can learn generic algorithms, that can automatically deal with data not seen in the training base. In the current Fig. 1. The basic Gated Recurrent Programmer Unit. Dashed lines are the input of the gates, normal lines are the hidden (VM) state path. 222 F. Carregosa et al. Fig. 2. Example of a two step algorithm: -(arg1+arg3). Each row has one argument and one operation throughout two recurrent steps. The reset gate selects the arguments for the ALU operations (grayed in the image with solid lines), while the update gate selects which operation results or arguments will be kept (grayed operation results). architecture it means that there is a direct mapping between the current input and the respective instruction. While this behavior is sometimes enough, we would like the model to use past information for creating the algorithm, and, for that reason, we include an additional controller unit, which acts in parallel to the programmer and has the same structure as the GRU. The complete model is depicted in Fig. 3, and represented by the following set of equations (from Eqs. 2 to 5): rt = σ(Wr.[hct−1, xt] + br) (1) ut = σ(Wu.[hct−1, xt] + bu) (2) hc˜t = tanh(W.[r c t ∗ hct−1, xt] + b) (3) hvm[i] = ALU(rvm˜t t , h vm t−1, externalt, operation[i]) (4) ht = (1− ut) ∗ ht−1 + ut ∗ h̃t (5) Fig. 3. The Gated Recurrent Programmer Unit. The upper part is the virtual machine, which executes the instruction according to the selections made by the gates. The lower part is the controller, which encodes a representation of all past inputs for the gates, producing instructions that aren’t just a mapping of the current input. Lightweight Neural Programming: The GRPU 223 Where vm defines the Virtual Machine (VM) section and c the controller section of the state and gate outputs, h = [hvmt t , h c t ], rt = [r vm t , r c t ] and ut = [hvm, hct t ] are the hidden state (formed by the concatenation of VM and controller states), reset gate (which assumes the task of argument selector for the VM state) and update gate (which assumes the task of the operation selector for the VM state), respectively. h = [hvm˜ ˜t t , h c ˜ t ] is the next state candidate. The ALU is a function that receives the VM state (arguments), the argument selection (reset gate output), any external data or differentiable memory that can be read/write through specific operations, and the list of operations to apply to the arguments. 3.2 The Arithmetic and Logic Unit (ALU) The ALU natively supports n-ary operations, with the arguments selected directly with the argument selector. But one aspect that must be considered is what is the neutral element in the operation. The argument selector rejects arguments by multiplying them by zero. This behavior does not influence oper- ations such as summation and the logical or. In other cases, though, such as the product or the logical and, a zero valued (rejected) argument would guarantee that the result is zero or False, respectively. To solve this issue, we introduce a transformation that makes rejected arguments (in which rt[i] = 0) to have value one, instead of zero, and selected arguments to have the argument value itself, which may include zero. Table 1 shows the output we would like the both cases have. Table 1. Target inputs for operations with neutral element 0 and 1. Input (i) Selector (r) Neutral 0 Neutral 1 Input (i) Selector (r) Neutral 0 Neutral 1 0 0 0 1 x 0 0 1 0 1 0 0 x 1 x x An additional complication is that the argument selector gate is not restricted to binary outputs, but instead, covers the entire space between 0 and 1. To handle that we need to work on a superset of the Boolean algebra, like the Fuzzy Logic [6]. In particular, we choose the following generalized form for the basic logic operators, though other options are also possible: x AND y = x ∗ y, x OR y = 1− (1− x) ∗ (1− y) = x+ y − x ∗ y and NOT x = 1− x. Converting the neutral 1 column in terms of i and r in the truth Table 1 into a sum of products representation (where “.” is the logical and, “+” is the logical or, and “x” is the logical negation of x) we get i.r+ i.r+ i.r. Next, by factoring r on the last two terms, we reach r.(i + i) + i.r, and by applying the identity i+ i = 1), we reach Eq. 6. r + i.r (6) Then, replacing the boolean operators for the fuzzy operators in the form of (NOT r) OR (i AND r), we get (1−r) OR (i∗r) = (1−r)+(i∗r)−(1−r)∗(i∗r) = 1− r + i ∗ r − i ∗ r + i ∗ r2, which brings us the Eq. 7. 224 F. Carregosa et al. 1− r + i ∗ r2 (7) Similarly, the sum of product form for the neutral 0 in Table 1 is simply i AND r, and, therefore in the generalized operators it is defined as i ∗ r, which is already how the reset gate output is applied to the hidden state. Thus, for any operation wherein the neutral element is zero we do i ∗ r and for any operation wherein the neutral element is one we apply Eq. 7 as its input. For lesser arity operations, it’s possible to simply eliminate some of the con- nections to the arguments (for example a toggle operation only needs a connec- tion to it’s previous result), and/or to use aggregate functions. By averaging the reset gate outputs before multiplying the VM state, it’s also possible to have a soft selection equivalent to the softmax. Besides the operations that map arguments to results, algorithms also require testing and flow control, and for that we first have to define comparison opera- tions. Comparison operations typically have arity two (such as equal, not equal, less than, greater than), or one (equal to zero, not equal to zero, etc.) and return one if the condition is true, or zero otherwise. The way we implement the differ- entiable not equal (and the equal, by simply subtracting it from 1) is by having |arg1 − arg2|/(|arg1 − arg2| + ) where  is a constant to avoid division by zero. Greater than and less− than can be implemented with a shifted sigmoid (logistic) function, approximating the Heaviside step function. With the comparison operator, we can implement an element of control flow in the differentiable machine, the conditional operation. It makes the instruction to be executed only if the condition determined by a comparison operation, or a combination of them through logical operators, is met, and otherwise all the instruction is rejected. This is implemented by changing the operation selection mechanism according to Table 2, in which ũcond is the operation selector value (update gate value) for the conditional operation, ũop is the operation selector value for the target normal operation, hcond is the result of the comparison used for the conditional, and uop is the final operation selector values (the value of the operation or a NOP , or No Operation, equivalent to the update gate rejecting the operation). Simplifying the table like with the neutral element above: u ˜op = ũop AND (( NOT ũcond) OR hcond) (8) And using the same transformation inspired by Fuzzy Logic we discussed above, we arrive in the Eq. 9 below: uop = u ˜op ∗ (1 + ũcond ∗ (hcond − 1)) (9) And for integrating it within the model equations, with ut being the final output of the update gate for using in Eq. 5, ũct the controller section and ũ vm t the VM section of the update gate calculated in Eq. 2: ut = [ũvmt , ũ vm t ∗ (1 + u ˜cond ∗ (hcond − 1))] (10) Lightweight Neural Programming: The GRPU 225 If the rejection condition happens, the whole programmer section of the update gate is multiplied by a scalar zero, and the new VM state becomes hvmt−1, and, therefore, the algorithm does not produce any effect in that step. Table 2. Desired output when accepting or rejecting the input. u ˜h ˜cond cond ũop uop ucond hcond ũop uop 0 (-) 0 (-) 0 (-) 0 (nop) 1 (if) 0 (false) 0 (-) 0 (nop) 0 (-) 0 (-) 1 (do op) 1 (op) 1 (if) 0 (false) 1 (do op) 0 (nop) 0 (-) 1 (-) 0 (-) 0 (nop) 1 (if) 1 (true) 0 (-) 0 (nop) 0 (-) 1 (-) 1 (do op) 1 (op) 1 (if) 1 (true) 1 (do op) 1 (op) 3.3 Expanding the Model Since the GRPU is similar in structure to a GRU, it can be extended in similar ways. For instance, by stacking a number of GRPUs it is possible to have different control flows, executing multiple operations per step, according to the number and order of transformations over the VM state. Another possibility is to use the encoder-decoder with soft attention [2] as inspiration, allowing the model to learn its own sequencing through the input, while also decoupling the input size from the program size. 4 Experimental Results To produce the results presented here, we run all the tests with Tensorflow [1] on a single GPU, Adam optimization, learning rate 10−4, and, otherwise, default parameters and no regularization. The controller hidden state has size 100. 4.1 The Adding Problem To evaluate the potential to learn long algorithms, we use a variant of the RNN Adding Problem described in [13]. In each step the network is fed with a control value of either −1, 0 or 1 and an input value ranged [0, 1]. If the control is 1, which always happens in exactly two of the steps, then the corresponding input value should be one of the operands in the sum. There are between 50 and 55 steps. With a 10,000 samples training set and a 1,000 samples test set, batch size of 100, and using a bidirectional GRU with the outputs connected to a fully connected linear regression layer, the cited author achieves the mean squared error of 0.0041 on the test set. Using the GRPU, we feed only the control vector to the controller unit to avoid dependence between the induction of the algorithm and the processed data. The ALU also contains 3 operations, a READ operator that returns the control vector, an ADD operator and a PRODUCT operator. This means that each step 226 F. Carregosa et al. has to choose to store the result of each of the 3 possible operations, or keep the previous argument, and to choose any combination of the 3 previous results as input for the operations, creating a very large search space with a program up to 55 instructions long. The output of the model is the result of the sum. Table 3. Experiments. *Bidirectional GRU results from [13] Configuration 1,000 epochs 1,000 epochs (test) (training) Bidirectional GRU - batch 100 N/A 0.0041 - 1,000 samples* GRPU - batch 100 - 10,000 samples 0.247 0.759 GRPU - batch 32 - 32,000 samples 0.0089 0.00699 GRPU - batch 10 - 10,000 samples 0.0000387 0.00709 GRPU - batch 10 - 10,000 samples 0.000426 0.000696 - Varying number of steps GRPU - batch 10 - 10,000 samples 0.00616 0.0166 - (Multiplication Variant) GRPU - batch 10 - 10,000 samples 0.06 0.06 (Conditional Variant) Table 3 shows that using the same batch size leads to very poor performance, indicating that the model is more prone to getting stuck in local minima. Either increasing the number of samples or reducing the batch size, which increases the stochastic effect, brings the results much closer to the more complex traditional model. Starting with just 10 steps and increasing the number up to the target throughout the epochs yields the best generalization. 4.2 Other Variations Just changing the example above from addition to product, and changed the input range to [0.5, 1.5], to prevent values frequently close to zero, allow us to evaluate the logic for the operations with neutral element one. The network behaved similarly, reducing the error to adequate levels after the 1,000 epochs, as seen on the Multiplication Variant on Table 3. To test the conditional, we moved the control vector of the Adding Problem to the virtual machine, to be read on a second READ operator. It’s also added a conditional operation that checks if it’s input is 1, and if otherwise it forces a NOP in the step. This adds to 5 operations in the ALU, and the controller in this variation has no input besides it’s state, and it’s therefore incapable of choosing on it’s own when to select the ADD operation and when to skip. This variation converges very fast, but gets easily stuck in a local minima worse than the original variant. Lightweight Neural Programming: The GRPU 227 5 Conclusions and Future Work Here, we presented a novel Neural Programming architecture that can help build- ing a framework connecting neural networks and traditional programming. It has the potential of helping both models that write programs autonomously and users to integrate their logic within the neural network operation. The experi- ments have found some of the issues of previous neural programmer works: the convergence of such models is not trivial, possibly since the higher restriction on the search space may conduct to more local minima. More research in this area could provide better insights on the model behavior during training. A number of further tests could be conducted in future works to better understand the potential of our model, such as tuning the hyper-parameters and ALU settings, adding regularization, experimenting with transfer learning and domain adaptation using the added transparency, evaluating deep GRPU models, and also techniques to extract efficient discrete algorithms. References 1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous sys- tems (2015). https://www.tensorflow.org/ 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 3. Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gra- dient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994) 4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empir- ical Methods in Natural Language Processing, pp. 1724–1734. ACL (2014) 5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 6. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Pren- tice Hall, Upper Saddle River (1995) 7. Kurach, K., Andrychowicz, M., Sutskever, I.: Neural random access machines. ERCIM News 2016(107) (2016) 8. Maclaurin, D., Duvenaud, D., Adams, R.P.: Autograd: effortless gradients in numpy (2015) 9. Neelakantan, A., Le, Q.V., Sutskever, I.: Neural programmer: inducing latent pro- grams with gradient descent. CoRR abs/1511.04834 (2015). http://arxiv.org/abs/ 1511.04834 10. Reed, S.E., de Freitas, N.: Neural programmer-interpreters. CoRR abs/1511.06279 (2015). http://arxiv.org/abs/1511.06279 11. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems (NIPS 2015), vol. 28, pp. 2692–2700 (2015) 12. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learn- ing, pp. 2048–2057 (2015) 13. Zhou, G.B., Wu, J., Zhang, C.L., Zhou, Z.H.: Minimal gated unit for recurrent neural networks. Int. J. Autom. Comput. 13(3), 226–234 (2016) Towards More Biologically Plausible Error-Driven Learning for Artificial Neural Networks Krist́ına Malinovská(B), Ľudov́ıt Malinovský, and Igor Farkaš Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia {malinovska,farkas}@fmph.uniba.sk http://cogsci.fmph.uniba.sk/cnc/ Abstract. Since the standard error backpropagation algorithm for supervised learning was shown biologically implausible, alternative mod- els of training that use only local activation variables have been proposed. In this paper we present a novel algorithm called UBAL, inspired by the GeneRec model. We shortly describe the model and show the perfor- mance of the algorithm for XOR and 4-2-4 problems. Keywords: Error-driven learning · Biological plausibility In search for an alternative to error backpropagation [5], considered to be biologically implausible [1], O’Reilly proposed the GeneRec model [4]. Instead of propagating error values, neuron activation is propagated in GeneRec bidirec- tionally. The weight update is based on the difference in the net activation in the minus phase (producing output from input) and the plus phase (desired value is “clamped” on the output layer and the activation spread back to the hidden layer). Building on this principle, we proposed the BAL model [3] for bidirec- tional heteroassociative mappings, but failed to reach 100% convergence on the canonical 4-2-4 encoder task despite extensive experimental tuning [2]. As an improvement, we propose the Universal Bidirectional Activation-based Learning (UBAL) algorithm with additional learning parameters enabling the model to perform also unidirectional association tasks such as classification. As GeneRec, our model uses activation state differences, but with separate weight matrices M and W for each direction of activation flow. The activation is propagated in four phases (Fig. 1). As outlined in Table 1, in the forward prediction phase FP, the input is presented to layer p and the activation spreads to layer q and vice versa for the backward prediction BP. Additionally, there are echo activation phases (FE and BE) in which the network’s previous outputs qFP and pBP are echoed back to p and q through weights M and W , respectively. The learning rule in Eqs. 1 and 2 takes as inputs intermediate terms t (target) and e (estimate) from Table 2. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 228–231, 2018. https://doi.org/10.1007/978-3-030-01424-7_23 Towards More Biologically Plausible Error-Driven Learning 229 Fig. 1. Activation propagation in a network with input-output layers x and y and one hidden layer. Table 1. Activation propagation rules, p and q denote two layers of the network connected by weight matrices W and M . Symbols b and d denote the biases and σ stands for the standard logistic activation function. Direction and phase Term Value FP ∑Forward prediction qj σ( i wijp FP i + bj) ∑ Forward echo pFE FPi σ( j mjiqj + di)∑ Backward prediction pBPi σ( j m BP jiqj + di) ∑ Backward echo qBE BPj σ( i wijpi + bj) Table 2. Definition of terms used in the learning rule. Term name Term Value Forward target tF βFqFP + (1− βF)qBPj q j q j Forward estimate eFj γ F q q FP j + (1− γFq )qBEj Backward target tB βBpBPi p i + (1− βBp )pFPi Backward estimate eBi γ B p p BP i + (1− γB FEp )pi Δw B F Fij = λ ti (tj − ej ) (1) Δmij = λ tFj (t B i − eBi ) (2) 230 K. Malinovská et al. The learning rate λ and parameters β (target prediction strength) and γ (estimate prediction strength) used in the learning rule terms in Table 2 drive the network learning. Depending on their values the network can accomplish different tasks. In Fig. 2 we present results from experiments with the 4-2-4 encoder indi- cating that using a reasonable learning rate the network always converges to a solution. Unlike its predecessor BAL, given a certain parameter setup (Table 3), UBAL converges in the XOR task as shown in Fig. 3. Preliminary results from further experiments suggest that UBAL could get us closer towards a biologically plausible alternative to error backpropagation. Table 3. Parameters β a γ in our experiments, βB = 1− βF . 4-2-4 Encoder XOR X — H — Y X — H — Y βF 1.0 – 0.5 – 0.0 0.01 – 1.0 – 0.0 γF 0.5 – 0.5 0.0 – 0.0 γB 0.5 – 0.5 0.0 – 0.0 1 400 0.98 0.96 200 0.94 0 0.92 0.9 −200 0 1 2 3 0 1 2 3 Fig. 2. Results from 4-2-4 encoder experiments with varying λ (1000 nets). Success rate indicates how many networks were able to learn the task with 100% accuracy. Towards More Biologically Plausible Error-Driven Learning 231 ·104 1 2 0.8 0.6 1 0.4 0 0.2 2 4 6 8 2 4 6 8 Fig. 3. Results from XOR experiments with varying hidden layer size (1000 nets) and λ = 0.2. Maximum training epochs: 20000. Acknowledgment. This work was supported by grants VEGA 1/0796/18 and KEGA 017UK-4/2016. References 1. Crick, F.: The recent excitement about neural networks. Nature 337(6203), 129–132 (1989) 2. Csiba, P., Farkaš, I.: Computational analysis of the bidirectional activation-based learning in autoencoder task. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2015) 3. Farkaš, I., Rebrová, K.: Bidirectional activation-based neural network learning algo- rithm. In: Mladenov, V., Koprinkova-Hristova, P., Palm, G., Villa, A.E.P., Appollini, B., Kasabov, N. (eds.) ICANN 2013. LNCS, vol. 8131, pp. 154–161. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40728-4 20 4. O’Reilly, R.: Biologically plausible error-driven learning using local activation differ- ences: the generalized recirculation algorithm. Neural Comput. 8(5), 895–938 (1996) 5. Rumelhart, D., Hinton, G., Williams, R.: Learning Internal Representations by Error Propagation, pp. 318–362. no. 1. The MIT Press, Cambridge (1986) Online Carry Mode Detection for Mobile Devices with Compact RNNs Philipp Kuhlmann(B), Paul Sanzenbacher, and Sebastian Otte Cognitive Modeling Group Computer Science Department, University of Tübingen, Sand 14, 72076 Tübingen, Germany kuhlmann.ph+icann18@gmail.com, sebastian.otte@uni-tuebingen.de Abstract. Nowadays mobile devices are an essential part of our daily life. Especially fitness tracking application, which record our daily actions or exercise sessions, require a robust carry mode detection of the device. For a detailed and accurate analysis of the acquired data it is essential to know the relative position and thus the expected movement of the phone relative to the performed actions. On the other hand, it is important that such a detection is as energy-efficient as possible, which eliminates common deep convolutional approaches in advance. The contribution of this paper is twofold. First, we provide a mobile device carry mode data set, which currently consists of 6 h and 28min of labeled accelerometer recordings. Second, we developed a robust online method to estimate the carry mode of such a device, which allows robust classification of long sequences of data based on compact Recurrent Neural Networks (RNNs), particularly Long Short-Term Memories (LSTMs). Our approach is gen- erally applicable due to only requiring data from an accelerometer and is lightweight enough to run on small embedded devices. Specifically, we demonstrate that LSTMs can almost perfectly distinguish between the carry modes hand, bag and pocket. Keywords: Mobile devices · RNN · LSTM · Carry mode detection 1 Introduction Modern mobile devices contain a variety of sensors including accelerometer, gyroscope, magnetometer and GPS, to name just a few. While single sensors or combinations thereof fulfill essential functions such as estimating location or orientation of the device, they are also increasingly used in exercising or health applications. The broad availability of sensor data from a huge variety of sen- sors also allows for using machine learning techniques to extract all kinds of information. In particular, sequentially recorded sensor data can be processed using recurrent neural networks. We present an approach for classifying the carry mode of a mobile device using accelerometer data. The carry mode is classified into one of three categories hands, bag and pocket using a LSTM-based recurrent neural network [3]. Knowing the current carry mode can be useful for several ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 232–241, 2018. https://doi.org/10.1007/978-3-030-01424-7_24 Online Carry Mode Detection 233 applications. For example, the time required to pick up the phone when it starts ringing depends on its location that is directly related to the carry mode. The carry mode information can therefore be used to notify the calling party so they can decide if it is worth waiting for the call to be picked up. Another example application is to use the carry mode as a safety measure. While a certain safety mode is activated, the phone is assumed to be in the bag or pocket. As soon as the hand carry mode is detected in that safety mode, an alarm can be triggered if somebody is trying to steal the device. Previous works already explored the possibilities of using accelerometer mea- surements to estimate or classify external conditions. Hernandez et al. [2] used the accelerometer measurements to monitor physiological conditions like the heart or breath rate of the person carrying the devices. Additionally, they proved that the measurement is possible regardless of the carry mode of the phone but with significantly varying accuracy. Otte et al. [5] have demonstrated the poten- tial of LSTM-based RNNs to classify the terrain type on which a mobile robot is driving by only evaluating the vibrations of the robot platform on the dif- ferent terrain types. The aim of this paper, however, is to evaluate if and how well RNNs, particularly ones with relatively few parameters, are able to detect the current carry mode of the device in an online fashion, that is, without a time-window incrementally classifying each new time step of input data. The paper is organized as follows. First, in Sect. 2 the dataset that we used in our experiments is introduced. Second, the applied RNN architecture is moti- vated and sketched out in Sect. 3. Third, our experimental results are presented and discussed in Sect. 4. Finally, Sect. 5 recapitulates this study and gives ideas for the next research steps. 2 Dataset Alongside our work, we recorded our own extensive dataset1 for training and verification purposes. The complete dataset was recorded by multiple persons using different mobile devices. A detailed composition of the recorded data is shown in Table 1. Table 1. Composition of our recorded dataset that is used for training and verification. Phone #Datapoints Total length #Recordings by class Pocket Hands Bag Sony Xperia ZX1 555028 4:43:29 21 7 5 HTC M8 One 60087 36:04 5 4 2 OnePlus 5T 106782 58:35 4 3 0 Total 721897 6:28:09 30 14 7 1 Available under: http://cm.inf.uni-tuebingen.de. 234 P. Kuhlmann et al. 2.1 Acquisition For recording of the training and validation sequences, an Android application is used that offers the user the possibility to select the carry mode and add an optional comment in case there are some unexpected irregularities in the recording. A single sequence is created by pressing a start button at the beginning and a stop button at the end. After a sequence is recorded, the application encodes the data as JSON object and uploads it to a server, where it is stored in a database. If the upload fails, e.g., due to missing internet connectivity, the data is stored on-device and can be re-uploaded as soon as a connection is established again. The recorded data includes the raw accelerometer data, consisting of the x, y, z acceleration of the mobile device in the device’s local coordinate system and the time-stamp of each sample. The time-stamp is used to detect and account for deviations from the requested sampling rate of 50Hz. Each recording is tagged with the associated carry mode. We distinguish between bag, hands, and pocket, which are the most commonly used methods for carrying a mobile device and which have significantly different characteristics. Most of the data was recorded during the everyday usage of the mobile devices, while some recordings were specially crafted to cover edge cases. To prevent over-fitting we tried to vary the device’s orientation and location between each recording. Figure 1 shows example extracts from different recordings. 2.2 Dataset Preparation For training, sequences of length 100 or 200 were extracted from the recordings. Since especially for the pocket and bag carry modes the mobile device is not carried in the expected mode at the beginning and at the end of the recorded sequence, it is cut at both ends by three seconds. Fig. 1. Extracts of sequences recorded in carry mode pocket for three different devices. All extracts contain 50 samples which corresponds to one second. Online Carry Mode Detection 235 3 Recurrent Neural Networks Due to their cyclic connections RNNs are able to learn temporal dependencies in data sequences, whereas feed-forward networks can only learn static pattern mappings. In contrast to traditional RNNs, the before mentioned LSTM model [3], which can be seen as a differentiable memory cell, overcomes the problem of vanishing gradients. LSTMs are capable to handle even very long time lags up to 10 000 time steps. Due to this and other capabilities, e.g., precise timing, precise value reproduction, or counting, LSTM-like RNNs unleash an impressive learn- ing potential and are the de-facto standard for sequential learning tasks. Note that we applied specific RNN regularization [8]. Prior to classification we may optionally apply some preprocessing steps to prepare the data for the network. 3.1 Data Preprocessing The preprocessing consists of multiple steps depending on the mode of operation. First of all the input data is split into multiple sequences of length n. The value of n depends on whether we are training or evaluating the model. When evaluating the model we use n = N , where N is the total length of the recording. The second step is an optional dimensionality reduction via principle compo- nent analysis. The reduction was intended to prevent over-fitting on the device orientation, which can normally be detected due to the gravity acceleration. Nevertheless the evaluation has shown that the network does not over-fit on the data and the dimensionality reduction significantly reduces the information for the network. Lastly, we have to take care of all three classification categories to be equally represented in the training dataset in order to prevent over-fitting on one cat- egory. Therefore we select only an equally sized subset of sequences for each category from the available input data. 3.2 RNN Architecture The input is fed into three convolution layers with kernel sizes k = 1, 3, 5, per- forming a convolution along the temporal axis of our input data stream. The outputs of the convolutional layers are concatenated and then used as input to each cell of our LSTM block, resulting in a sequence of 9 samples per time step. Each sample of the concatenated sequence is fed into an LSTM block consisting of c = 50 parallel independent LSTM cells. Also the only recurrent connec- tions are within this block. Each LSTM cell receives the concatenated input and the recurrent output from every other cell. The output from each LSTM cell is connected to a fully connected layer with 20 neurons that uses a leaky ReLU activation function. The category mapping is achieved through a final fully con- nected layer with a softmax output function, resulting in a probability for each output category. An overview over the complete network structure is shown in Fig. 2. The network comprises a total of 13,173 trainable parameters. 236 P. Kuhlmann et al. cell 1 conv1 conv3 fully-connected conv5 concatenation + softmax cell 50 fully-connected + leaky ReLU input convolutional layer LSTM block output Fig. 2. Network architecture consisting of multiple parallel temporal convolutions of different size, an LSTM block with multiple independent LSTM cells, and a fully- connected layer followed by a softmax function. 3.3 Training Given the ground-truth one-hot vectors for each input sequence, the cross- entropy is minimized between the ground truth and the output of the network via Back-Propagation Through Time (BPTT) [6]. For optimization, we use the Adam optimizer [4] with a constant learning rate of η = 10−5 and default param- eters β1 = 0.9, β2 = 0.999, and ε = 10−3. 3.4 Implementation Details The network architecture as well as the training and testing procedures are implemented in TensorFlow [1] and using Tensorpack as a training interface [7]. Although TensorFlow is not particularly efficient when it comes to train- ing recurrent neural networks with low-dimensional sequences, it allows for fast prototyping and for models to be exported and run on mobile devices using TensorFlow Lite2. The preprocessing is performed online, as it does not require much computation time. 4 Experimental Results 4.1 Network Configurations In order to find a suitable network architecture for achieving a high accuracy and fast convergence, several network components were added and explored. Using a single LSTM block is sufficient for achieving a high validation accuracy, whereas using two successive blocks drastically slows down the training process with- out increasing the overall accuracy. Appending additional fully-connected layers after the LSTM block can increase the convergence speed. However, they do in general not increase the final accuracy. Applying several parallel convolutional 2 https://www.tensorflow.org/mobile/tflite/. Online Carry Mode Detection 237 layers with different sizes of receptive fields allows the network to extract the most important local information from the input sequences and can also help to smoothen out high-frequency noise. We found out that the convolutional layers are especially helpful when training the architecture using data from multiple different devices, as the acceleration sensors in the devices have different noise characteristics. 4.2 Results The network architecture (see Fig. 2) was trained on two datasets containing sequences from three devices and one single device respectively. The training results and accuracies are listed in Table 2. Table 2. Accuracies for different datasets and number of sequences on the training and validation set. Dataset Devices Size Training accuracy Validation accuracy 1 3 1497 Sequences 0.994 0.984 2 1 753 Sequences 0.998 0.987 Fig. 3. Classification example of a sequence of the hands class. Since the acceleration sensors in mobile devices have different signal charac- teristics, it is important to have a uniform distribution of training data across the different devices and carry modes in order for the model to generalize ade- quately. We can show that training the network on data from a single device can further increase the classification accuracy, as it can adapt to these specific sensor characteristics. Another significant observation we made is that the hand carry mode is detected a lot better than the other carry modes. This is because 238 P. Kuhlmann et al. Fig. 4. Classification example of a sequence of the pocket class. Fig. 5. Classification example of an sequence of the bag class. Although the classifica- tion is generally correct, the noise level is much higher than the results for the other two classes. Fig. 6. Classification example of an sequence of the bag class, which shows that the classification remains stable despite the long sequence length of over 10min with over 36000 samples. Online Carry Mode Detection 239 Fig. 7. Comparison of convergence rates of the two datasets with different training sequence lengths. The different sample sizes for each curve result from the dataset sizes as well as from the number of epochs depending on the sequence length. the transition between the bag and the pocket carry mode is not as clear as between the hand carry mode and the others. While the mobile device follows a characteristic movement pattern when carried in a pocket or a bag, the move- ment is damped when carried in the hands. Classified example sequences for each class are shown in Figs. 3, 4, and 5. Figure 6 shows that the classification remains stable even when processing long sequences. We tested several different options for the sequence length for our training input, where 100 and 200 proved to yield the best results. On the one hand, the longer sequence length with a length of 200 time steps achieves significantly better results on the diverse dataset 1 that contains data captured from multiple devices. On the other hand, if the data is relatively uniform a longer sequence length does not improve the results. Nonetheless, the final accuracy is lower, but the convergence is still faster. The convergence rates can be seen in Fig. 7. Fig. 8. Classification output of the RNN facing altering class transitions. 240 P. Kuhlmann et al. Finally, we investigated the behavior at class transitions, which is an impor- tant aspect when using RNNs in a continuous classification scenario without clear class boundaries. We found that in any case the RNNs were able to catch the class transitions successfully. We think that this might run out-of-the-box because of the applied regularization [8]. Figure 8 exemplary shows that the refer- ence network is clearly able to detect the transitions between the altering classes hands and pocket rapidly. Note that the ability of catching class transitions 5 Conclusion We have shown that LSTM-based RNNs are a robust and easily implementable way to reliably detect the carry mode of mobile devices. The results indicate that this task is heavily impacted by the device model. Compared to other possible (deep learning) approaches our specific network architecture is relatively lightweight (≈13,000 parameters) but still achieved a very high detection rate of nearly 99% on our carry mode dataset with three classes. We also proved that a step-by-step online detection is feasible for which no large time-window (as e.g. for spectral transformation-based approaches) is necessary. Further work can explore a more diverse set of carry modes and try to sta- bilize long input sequences, which are generally a weak point of generic LSTM networks. Nonetheless our architecture kept stable even over 10,000 s time-steps of continuous classification. Additionally, the work can be extended to also estimate the actual movement performed by the person carrying the devices with the help of the predicted carry mode. It is also thinkable to randomly skip time-steps in order to further improve the energy efficiency. Acknowledgements. We would like to thank Denis Heid and Florian Grimm for testing the dataset recording application and for their effort in collecting a diverse dataset in many different real-life scenarios. References 1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous sys- tems (2015). https://www.tensorflow.org/ 2. Hernandez, J., McDuff, D.J., Picard, R.W.: Biophone: physiology monitoring from peripheral smartphone motions. In: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7180–7183. IEEE (2015) 3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 4. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd Inter- national Conference for Learning Representations (2015) Online Carry Mode Detection 241 5. Otte, S., Weiss, C., Scherer, T., Zell, A.: Recurrent neural networks for fast and robust vibration-based ground classification on mobile robots. In: 2016 IEEE Inter- national Conference on Robotics and Automation (ICRA), pp. 5603–5608. IEEE (2016) 6. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990) 7. Wu, Y., et al.: Tensorpack (2016). https://github.com/tensorpack/ 8. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014) Deep Learning Deep CNN-ELM Hybrid Models for Fire Detection in Images Jivitesh Sharma(B), Ole-Christopher Granmo, and Morten Goodwin Center for Artificial Intelligence Research, University of Agder, Jon Lilletuns vei 9, 4879 Grimstad, Norway {jivitesh.sharma,ole.granmo,morten.goodwin}@uia.no Abstract. In this paper, we propose a hybrid model consisting of a Deep Convolutional feature extractor followed by a fast and accurate classi- fier, the Extreme Learning Machine, for the purpose of fire detection in images. The reason behind using such a model is that Deep CNNs used for image classification take a very long time to train. Even with pre- trained models, the fully connected layers need to be trained with back- propagation, which can be very slow. In contrast, we propose to employ the Extreme Learning Machine (ELM) as the final classifier trained on pre-trained Deep CNN feature extractor. We apply this hybrid model on the problem of fire detection in images. We use state of the art Deep CNNs: VGG16 and Resnet50 and replace the softmax classifier with the ELM classifier. For both the VGG16 and Resnet50, the number of fully connected layers is also reduced. Especially in VGG16, which has 3 fully connected layers of 4096 neurons each followed by a softmax clas- sifier, we replace two of these with an ELM classifier. The difference in convergence rate between fine-tuning the fully connected layers of pre- trained models and training an ELM classifier are enormous, around 20x to 51x speed-up. Also, we show that using an ELM classifier increases the accuracy of the system by 2.8% to 7.1% depending on the CNN fea- ture extractor. We also compare our hybrid architecture with another hybrid architecture, i.e. the CNN-SVM model. Using SVM as the classi- fier does improve accuracy compared to state-of-the-art deep CNNs. But our Deep CNN-ELM model is able to outperform the Deep CNN-SVM models. (Preliminary version of some of the results of this paper appear in “Deep Convolutional Neural Networks for Fire Detection in Images”, Springer Proceedings Engineering Applications of Neural Networks 2017 (EANN’17), Athens, Greece, 25–27 August). Keywords: Deep convolutional neural networks Extreme learning machine · Image classification · Fire detection 1 Introduction The problem of fire detection in images has received a lot of attention in the past by researchers from computer vision, image processing and deep learning. ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 245–259, 2018. https://doi.org/10.1007/978-3-030-01424-7_25 246 J. Sharma et al. This is a problem that needs to be solved without any compromise. Fire can cause massive and irrevocable damage to health, life and property. It has led to over a 1000 deaths a year in the US alone, with property damage in access of one billion dollars. Besides, the fire detectors currently in use require different kinds of expensive hardware equipment for different types of fire [27]. What makes this problem even more interesting is the changing background environment due to varying luminous intensity of the fire, fire of different shades, different sizes etc. Also, the false alarms due to the environment resembling fire pixels, like room with bright red/orange background and bright lights. Further- more, the probability of occurrence of fire is quite low, so the system must be trained to handle imbalance classification. Various techniques have been used to classify between images that contain fire and images that do not. The state-of-the-art vision-based techniques for fire and smoke detection have been comprehensively evaluated and compared in [39]. The colour analysis technique has been widely used in the literature to detect and analyse fire in images and videos [4,24,31,37]. On top of colour analysis, many novel methods have been used to extract high level features from fire images like texture analysis [4], dynamic temporal analysis with pixel-level filtering and spatial analysis with envelope decomposition and object labelling [40], fire flicker and irregular fire shape detection with wavelet transform [37], etc. These techniques give adequate performance but are currently outperformed by Machine Learning techniques. A comparative analysis between colour-based models for extraction of rules and a Machine Learning algorithm is done for the fire detection problem in [36]. The machine learning technique used in [36] is Logistic Regression which is one of the simplest techniques in Machine Learning and still outperforms the colour-based algorithms in almost all scenarios. These scenarios consist of images containing different fire pixel colours of different intensities, with and without smoke. Instead of explicitly designing features by using image processing techniques, deep neural networks can be used to extract and learn relevant features from images. The Convolutional Neural Networks (CNNs) are the most suitable choice for the task of image processing and classification. In this paper, we employ state-of-the-art Deep CNNs for fire detection and then propose to use hybrid CNN-ELM and CNN-SVM models to outperform Deep CNNs. Such hybrid models have been used in the past for image classifi- cation, but the novelty of our approach lies in using state-of-the-art Deep CNNs like VGG16 and Resnet50 as feature extractors and then remove some/all fully connected layers with an ELM classifier. This models outperform Deep CNNs in terms of accuracy, training time and size of the network. We also compare the CNN-ELM model with another hybrid model, CNN-SVM and show that the CNN-ELM model gives the best performance. The rest of the paper is organized in the following manner: Sect. 2 briefly describes the related work with CNNs for fire detection and Hybrid models for image classification. Section 3 explains our work in detail and Sect. 4 gives details of our experiments and presents the results. Section 5 summarizes and concludes our work. Deep CNN-ELM Hybrid Models for Fire Detection in Images 247 2 Related Work In this paper, we integrate state-of-the-art CNN hybrid models and apply it to the problem of fire detection in images. To the best of our knowledge, hybrid models have never been applied to fire detection. So, we present a brief overview of previous research done in CNNs used for fire detection and hybrid models separately in the next two sub-sections. 2.1 CNNs for Fire Detection There have been many significant contributions from various researchers in devel- oping a system that can accurately detect fire in the surrounding environment. But, the most notable research in this field involves Deep Convolutional Neu- ral Networks (Deep CNN). Deep CNN models are currently among the most successful image classification models which makes them ideal for a task such as Fire detection in images. This has been demonstrated by previous research published in this area. In [7], the authors use CNN for detection of fire and smoke in videos. A simple sequential CNN architecture, similar to LeNet-5 [18], is used for classification. The authors quote a testing accuracy of 97.9% with a satisfactory false positive rate. Whereas in [43], a very innovative cascaded CNN technique is used to detect fire in an image, followed by fine-grained localisation of patches in the image that contain the fire pixels. The cascaded CNN consists of AlexNet CNN architecture [17] with pre-trained ImageNet weights [28] and another small network after the final pooling layer which extracts patch features and labels the patches which contain fire. Different patch classifiers are compared. The AlexNet architecture is also used in [34] which is used to detect smoke in images. It is trained on a fairly large dataset containing smoke and non-smoke images for a considerably long time. The quoted accuracies for large and small datasets are 96.88% and 99.4% respectively with relatively low false positive rates. Another paper that uses the AlexNet architecture is [23]. This paper builds its own fire image and video dataset by simulating fire in images and videos using Blender. It adds fire to frames by adding fire properties like shadow, fore-ground fire, mask etc. separately. The animated fire and video frames are composited using OpenCV [2]. The model is tested on real world images. The results show reasonable accuracy with high false positive rate. As opposed to CNNs which extract features directly from raw images, in some methods image/video features are extracted using image processing techniques and then given as input to a neural network. Such an approach has been used in [6]. The fire regions from video frames are obtained by threshold values in the HSV colour space. The general characteristics of fire are computed using these values from five continuous frames and their mean and standard deviation is 248 J. Sharma et al. given as input to a neural network which is trained using back propagation to identify forest fire regions. This method performs segmentation of images very accurately and the results show high accuracy and low false positive rates. In [11], a neural network is used to extract fire features based on the HSI colour model which gives the fire area in the image as output. The next step is fire area segmentation where the fire areas are roughly segmented and spurious fire areas like fire shadows and fire-like objects are removed by image difference. After this the change in shape of fire is estimated by taking contour image difference and white pixel ratio to estimate the burning degree of fire, i.e. no- fire, small, medium and large. The experimental results show that the method is able to detect different fire scenarios with relatively good accuracy. 2.2 Hybrid Models for Image Classification The classifier part in a Deep CNN is a simple fully connected perceptron with a softmax layer at the end to output probabilities for each class. This section of the CNN has a high scope for improvement. Since it consists of three to four fully connected layers containing thousands of neurons, it becomes harder and slower to train it. Even with pre-trained models that require fine tuning of these layers. This has led to the development of hybrid CNN models, which consist of a specialist classifier at the end. Some of the researchers have employed the Support Vector Machine (SVM) as the final stage classifier [1,21,25,33,38]. In [25], the CNN-SVM hybrid model is applied to many different problems like object classification, scene classification, bird sub-categorization, flower recognition etc. A linear SVM is fed ‘off the shelf convolutional features’ from the last layer of the CNN. This paper uses the OverFeat network [30] which is a state-of-the-art object classification model. The paper shows, with exhaustive experimentation, that extraction of convolutional features by a deep CNN is the best way to obtain relevant characteristics that distinguishes an entity from another. The CNN-SVM model is used in [21] and successfully applied to visual learn- ing and recognition for multi-robot systems and problems like human-swarm interaction and gesture recognition. This hybrid model has also been applied to gender recognition in [38]. The CNN used here is the AlexNet [17] pre-trained with ImageNet weights. The features extracted from the entire AlexNet are fed to an SVM classifier. A similar kind of research is done in [33], where the soft- max layer and the cross-entropy loss are replaced by a linear SVM and margin loss. This model is tested on some of the most well known benchmark datasets like CIFAR-10, MNIST and Facial Expression Recognition challenge. The results show that this model outperforms the conventional Deep CNNs. In 2006, G.B. Huang introduced a new learning algorithm for a single hidden layer feedforward neural network called the Extreme Learning Machine [13,14]. This technique was many times faster than backpropagation and SVM, and outperformed them on various tasks. The ELM randomly initializes the input Deep CNN-ELM Hybrid Models for Fire Detection in Images 249 weights and analytically determines the output weights. It produces a minimum norm least squares solution which always achieves lowest training accuracy, if there are enough number of hidden neurons. There have been many variants of ELM depending upon a specific application, which have been summarised in [12]. This led to the advent of CNN-ELM hybrid models, which were able to outperform the CNN-SVM models on various applications. The major advantage of CNN-ELM models is the speed of convergence. In [29], the CNN-ELM model is used for Wireless Capsule Endoscopy (WCE) image classification. The softmax classifier of a CNN is replaced by an ELM classifier and trained on the feature extracted by the CNN feature extractor. This model is able to outperform CNN- based classifiers. The CNN-ELM model has also been used for handwritten digit classifica- tion [19,22]. In [19], a ‘shallow’ CNN is used for feature extraction and ELM for classification. The shallow CNN together with ELM speeds up the training process. Also, various weight initialization strategies have been tested for ELM with different receptive fields. Finally, two strategies, namely the Constrained ELM (C-ELM) [44] and Computed Input Weights ELM (CIW-ELM) [35] are combined in a two layer ELM structure with receptive fields. This model was tested on the MNIST dataset and achieved 0.83% testing error. In [22], a deep CNN is used for the same application and tested on the USPS dataset. A shallow CNN with ELM is tested on some benchmark datasets like MNIST, NORB-small, CIFAR-10 and SVHN with various hyper parameter configurations in [20]. Another similar hybrid model that uses CNN features and Kernel ELM as classifier is used in [9] for age estimation using facial features. Another appli- cation where a CNN-ELM hybrid model has been applied is the traffic sign recognition [41]. A different strategy of combining CNN feature extraction and ELM learn- ing is proposed in [15]. Here, an ELM with single hidden layer is inserted after every convolution and pooling layer and at the end as classifier. The ELM is trained by borrowing values from the next convolutional layer and each ELM is updated after every iteration using backpropagation. This interesting archi- tecture is applied to the application of lane detection and achieves excellent performance. A comparative analysis of the CNN-ELM and CNN-SVM hybrid models for object recognition from ImageNet has been illustrated in [42]. Both these models were tested for object recognition from different sources like Amazon, Webcam, Caltech and DSLR. The final results show that the CNN-ELM model outperforms the CNN-SVM model on all datasets and using Kernel ELM further increases accuracy. Using ELM as a final stage classifier does not end at image classification with CNNs. They have also been used with DBNs for various applications [3,26]. 3 The Fire Detector In this paper, we propose to employ hybrid deep CNN models to perform fire detection. The AlexNet has been used by researchers in the past for fire detection 250 J. Sharma et al. which has produced satisfactory results. We propose to use two Deep CNN archi- tectures that have outperformed the AlexNet on the ImageNet dataset, namely VGG16 [32] and Resnet50 [10]. We use these models with pre-trained ImageNet weights. This helps greatly when there is lack of training data. So, we fine-tune the ELM classifier on our dataset, which is fed the features extracted by the Deep CNNs. 3.1 Deep ConvNet Models The Convolutional Neural Network was first introduced in 1980 by Kunihiko Fukushima [8]. The CNN is designed to take advantage of two dimensional struc- tures like 2D Images and capture local spatial patterns. This is achieved with local connections and tied weights. It consists of one or more convolution layers with pooling layers between them, followed by one or more fully connected lay- ers, as in a standard multilayer perceptron. CNNs are easier to train compared to Deep Neural Networks because they have fewer parameters and local receptive fields. In CNNs, kernels/filters are used to see where particular features are present in an image by convolution with the image. The size of the filters gives rise to locally connected structure which are each convolved with the image to produce feature maps. The feature maps are usually sub-sampled using mean or max pooling. The reduction in parameters is due to the fact that convolution layers share weights. The reason behind parameter sharing is that we make an assumption, that the statistics of a patch of a natural image are the same as any other patch of the image. This suggests that features learned at one location can also be learned for other locations. So, we can apply this learned feature detector anywhere in the image. This makes CNNs ideal feature extractors for images. The CNNs with many layers have been used for various applications espe- cially image classification. In this paper, we use two state-of-the-art Deep CNNs that have achieved one of the lowest error rates in image classification tasks. In this work, we use VGG16 and Resnet50, pre-trained on the ImageNet dataset, along with a few modifications. We also compare our modified and hybrid models with the original ones. The VGG16 architecture was proposed by the Visual Geometry Group at the University of Oxford [32], which was deep, simple, sequential network whereas the Resnet50, proposed by Microsoft research [10], was an extremely deep graphical network with residual connections (which avoids the vanishing gradients problem and residual functions are easier to train). We also test slightly modified versions of both these networks by adding a fully-connected layer and fine-tuning on our dataset. We also tested with more fully connected layers but the increase in accuracy was overshadowed by the increase in training time. Deep CNN-ELM Hybrid Models for Fire Detection in Images 251 3.2 The Hybrid Model We propose to use a hybrid architecture for fire detection in images. In this paper, instead of using a simple CNN as feature extractor, we employ state-of-the-art Deep CNNs like the VGG16 and Resnet50. Figure 1(a) and (b) show the architecture of the VGG16-ELM and Resnet50- ELM hybrid models respectively. Usually, only the softmax classifier is replaced by another classifier (ELM or SVM) in a CNN to create a hybrid model. But, we go one step further by replacing the entire fully connected multi-layer perceptron with a single hidden layer ELM. This decreases the complexity of the model even further. The Theory of Extreme Learning Machine: The Extreme Learning Machine is a supervised learning algorithm [13]. The input to the ELM, in this case, are the features extracted by the CNNs. Let it be represented as xi, ti, where xi is the input feature instance and ti is the corresponding class of the image. The inputs are connected to the hidden layer by randomly assigned weights w. The product of the inputs and their corresponding weights act as inputs to the hidden layer activation function. The hidden layer activation function is a non- linear non-constant bounded continuous infinitely differentiable function that maps the input data to the feature space. There is a catalogue of activation functions from which we can choose according to the problem at hand. We ran experiments for all activation functions and the best performance was achieved with the multiquadratics function: √ f(x) = ‖xi − μ 2i‖ + a2 (1) The hidden layer and the output layer are connected via weights β, which are to be analytically determined. The mapping from the feature space to the output space is linear. Now, with the inputs, hidden neurons, their activation functions, the weights connecting the inputs to the hidden layer and the output weights produce the final output function: ∑L βig(wi.xj + bi) = oj (2) i=1 The output in Matrix form is: Hβ = T (3) The error function used in Extreme Learning Machine is the Mean Squared error function, written as: ∑N ∑L E = ( β 2ig(wi.xj + bi)− tj) (4) j=1 i=1 To minimize the error, we need to get the least-squares solution of the above linear system. ‖Hβ∗ − T‖ = minβ‖Hβ − T‖ (5) 252 J. Sharma et al. The minimum norm least-squares solution to the above linear system is given by: β̂ = H†T (6) Properties of the above solution: 1. Minimum Training Error: The following equation provides the least-squares solution, which means the solution for ‖Hβ − T‖, i.e. the error is minimum. ‖Hβ∗ − T‖ = minβ‖Hβ − T‖ 2. Smallest Norm of Weights: The minimum norm of least-squares solution is given by the Moore-Penrose pseudo inverse of H. β̂ = H†T 3. Unique Solution: The minimum norm least-squares solution of Hβ = T is unique, which is: β̂ = H†T Detailed mathematical proofs of these properties and the ELM algorithm can be found in [14]. Both the VGG16 and Resnet50 extract rich features from the images. These features are fed to the ELM classifier which finds the mini- mum norm least squares solution. With enough number of hidden neurons, the ELM outperforms the original VGG16 and Resnet50 networks. Both VGG16 and Resnet50 are pre-trained with ImageNet weights. So, only the ELM classifier is trained on the features extracted by the CNNs. Apart from fast training and accurate classification, there is another advan- tage of this model. This hybrid model does not require large training data. In fact, our dataset consists of just 651 images, out of which the ELM is trained on 60% of images only. This shows its robustness towards lack of training data. A normal Deep CNN would require much higher amount of training data to fine- tune its fully-connected layers and the softmax classifier. Even the pre-trained VGG16 and Resnet50 models required at least 80% training data to fine-tune their fully-connected layers. And, as we will show in the next section, a hybrid CNN-ELM trained with 60% training data outperforms pre-trained VGG16 and Resnet50, fine-tuned on 80% training data. 3.3 Paper Contributions 1. The previous hybrid models have used simple CNNs for feature extraction. We employ state-of-the-art Deep CNNs to make feature extraction more efficient and obtain relevant features since the dataset is difficult to classify. 2. Other hybrid models simply replace the softmax classifier with SVM or some- times ELM.We completely remove the fully connected layers to increase speed of convergence since no fine-tuning is needed and also reduce the complexity of the architecture. Since VGG16 and Resnet50 extract rich features and the ELM is an accurate classifier, we do not need the fully-connected layers. This decreases the number of layers by 2 in VGG16 and by 1 in Resnet50, which is 8192 and 4096 neurons respectively. Deep CNN-ELM Hybrid Models for Fire Detection in Images 253 3. The above point also justifies the use of complex features extractors like VGG16 and Resnet50. If we used a simple CNN then, we might not be able to remove the fully-connected layers since the features might not be rich enough. Due to this, the fully-connected layers would have to be fine-tuned on the dataset which would increase training time and network complexity. 4. Also, we see that the data required for training the ELM classifier is lower than the data required for fine-tuning the fully-connected layers of a pre- trained Deep CNN. 5. We apply our hybrid model on the problem of fire detection in images (on our own dataset). And, to the best of our knowledge, this is the first time a hybrid ELM model has been applied to this problem. 4 Experiments We conducted our experiments to compare training and testing accuracies and execution times of: the VGG16 and Resnet50 models including modifications, Hybrid VGG16 and Resnet50 models with ELM classifier. We also compare our hybrid VGG16-ELM and Resnet50-ELM models with VGG16-SVM and Resnet50-SVM as well. We used pre-trained Keras [5] models and fine-tune the fully-connected layers on our dataset. The training of the models was done on the following hardware specifications: Intel i5 2.5 GHz, 8 GB RAM and Nvidia Geforce GTX 820 2 GB GPU. Each model was trained on the dataset for 10 training epochs. The ADAM optimizer [16] with default parameters α = 0.001, β1 = 0.9, β2 = 0.999 and  = 10−8 was used to fine-tune the fully-connected layers for VGG16 and Resnet50 and their modified versions. The details of the dataset are given in the next subsection. 4.1 The Real World Fire Dataset Since there is no benchmark dataset for fire detection in images, we created our own dataset by handpicking images from the internet.1 This dataset consists of 651 images which is quite small in size but it enables us to test the generalization capabilities and the effectiveness and efficiency of models to extract relevant features from images when training data is scarce. The dataset is divided into training and testing sets. The training set consists of 549 images: 59 fire images and 490 non-fire images. The imbalance is deliberate to replicate real world situations, as the probability of occurrence of fire hazards is quite small. The datasets used in previous papers have been balanced which does not imitate the real world environment. The testing set contains 102 images: 51 images each of fire and non-fire classes. As the training set is highly unbalanced and the testing set is exactly balanced, it makes a good test to see whether the models are able to generalize well or not. For a model with good accuracy, it must be able to extract the distinguishing features from the small amount of fire images. To 1 The dataset is available here: https://github.com/UIA-CAIR/Fire-Detection- Image-Dataset. 254 J. Sharma et al. Fig. 1. Examples of fire images extract such features from small amount of data the model must be deep enough. A poor model would just label all images as non-fire, which is exemplified in the results. Apart from being unbalanced, there are a few images that are very hard to classify. The dataset contains images from all scenarios like fire in a house, room, office, forest fire, with different illumination intensity and different shades of red, yellow and orange, small and big fires, fire at night, fire in the morning. Non-fire images contain a few images that are hard to distinguish from fire images like a bright red room with high illumination, sunset, red coloured houses and vehicles, bright lights with different shades of yellow and red etc. The Figs. 1(a) to (f) show fire images in different environments: indoor, out- door, daytime, nighttime, forest fire, big and small fire. And the Figs. 2(a) to (f) show the non-fire images that are difficult to classify. Considering these charac- teristics of our dataset, detecting fire can be a difficult task. We have made the dataset available online so that it can be used for future research in this area. 4.2 Results Our ELM hybrid models are tested on our dataset and compared with SVM hybrid models and the original VGG16 and Resnet50 Deep CNNmodels. Tables 1 and 2 show the results of the experiments. The dataset was randomly split into training and testing sets. Two cases were considered depending on the amount of training data. The Deep CNN models (VGG16 and Resnet50) were trained only on 80% training data, since 60% is too less for these models. All the hybrid models have been trained on both 60% and 80% of training data. Deep CNN-ELM Hybrid Models for Fire Detection in Images 255 Fig. 2. Examples of non-fire images that are difficult to classify Table 1. Accuracy and execution time Model DT Acc C train Ttrain Ttrain Acctest Ttest VGG16 (pre-trained) 80 100 7149 6089 90.19 121 VGG16 (modified) 80 100 7320 6260 91.176 122 Resnet50 (pre-trained) 80 100 15995 13916 91.176 105 Resnet50 (modified) 80 100 16098 13919 92.15 107 VGG16+SVM 60 99.6 2411 1352 87.4 89 VGG16+SVM 80 100 2843 1784 93.9 81 VGG16+ELM 60 100 1340 281 93.9 24 VGG16+ELM 80 100 1356 297 96.15 21 Resnet50+SVM 60 100 3524 1345 88.7 97 Resnet50+SVM 80 100 4039 1860 94.6 86 Resnet50+ELM 60 100 2430 251 98.9 32 Resnet50+ELM 80 100 2452 272 99.2 26 DT is the percentage of total data used for training the models. Acctrain and Acctest are the training and testing accuracies respectively. Ttrain and Ttest are the training and testing times for the models. TCtrain is the time required to train the classifier part of the models 256 J. Sharma et al. One point to be noted here is that, the SVM hybrid models contain an addi- tional fully-connected layer of 4096 neurons, while the ELM is directly connected to the last pooling layer. The results in Table 1 show that the ELM hybrid models outperform the VGG16, Resnet50 and SVM hybrid models by achieving higher accuracy and learning much faster. In general, we can see that the hybrid models outperform the state-of-the-art Deep CNNs in terms of both accuracy and training time. Apart from accuracy and training time, another important point drawn from the results is the amount of training data required. As we already know, Deep Neural Networks (DNN) require huge amount of training data. So, using pre- trained models can be highly beneficial, as we only need to fine-tune the fully- connected layers. But, with models like VGG16 and Resnet50 which have large fully-connected layers, even fine-tuning requires large amount of training data. We had to train the VGG16 and Resnet50 on at least 80% training data otherwise they were overfitting on the majority class, resulting in 50% accuracy. But in case of hybrid models, especially ELM hybrid models, the amount of training data required is much less. Even after being trained on 60% training data, the ELMmodels were able to outperform the original VGG16 and Resnet50 models which were trained on 80% training data. This shows that reducing the fully-connected layers, or replacing them with a better classifier can reduce the amount of training data required. Also, the ELM is more robust towards lack of training data which adds to this advantage. Among the hybrid models, the ELM hybrid models outperform the SVM hybrid models both in terms of testing accuracy and training time. Also, we can see that the hybrid models with Resnet50 as the feature extractor achieves better results than the hybrid models with VGG16 as the feature extractor. This is due to the depth and the residual connections in Resnet50 in contrast to the simple, shallower (compared to Resnet50) and sequential nature of VGG16. Table 2 compares results between different number of hidden neurons used by ELM. The accuracy increases as the number of hidden neurons increase. The models are tested for 212, 213 and 214 number of neurons. The testing accuracy starts to decrease for 214 neurons, which means the model overfits. All the tests in Table 2 were conducted with 60% training data. Table 2. Number of hidden neurons in ELM CNN features # Hidden neurons Testing accuracy VGG16 feature extractor 4096 93.9 VGG16 feature extractor 8192 94.2 VGG16 feature extractor 16384 91.1 (Overfitting) Resnet50 feature extractor 4096 98.9 Resnet50 feature extractor 8192 99.2 Resnet50 feature extractor 16384 96.9 (Overfitting) Deep CNN-ELM Hybrid Models for Fire Detection in Images 257 5 Conclusion In this paper, we have proposed a hybrid model for fire detection. The hybrid model combines the feature extraction capabilities of Deep CNNs and the classi- fication ability of ELM. The Deep CNNs used for creating the hybrid models are the VGG16 and Resnet50 instead of a simple Deep CNN. The fully connected layers are removed completely and replaced by a single hidden layer feedforward neural network trained using the ELM algorithm. This decreases complexity of the network and increases speed of convergence. We test our model on our own dataset which has been created to replicate a realistic view of the envi- ronment which includes different scenarios, imbalance due to lower likelihood of occurrence of fire. The dataset is small in size to check the robustness of models towards lack of training data, since deep networks require a considerable amount of training data. Our hybrid model is compared with the original VGG16 and Resnet50 models and also with SVM hybrid models. Our Deep CNN-ELM model is able to outperform all other models in terms of accuracy by 2.8% to 7.1% and training time by a speed up of 20x to 51x and requires less training data to achieve higher accuracy for the problem of fire detection. References 1. Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition. CoRR, abs/1406.5774 (2014) 2. Bradski, G.: OpenCV. Dr. Dobb’s J. Soft. Tools 25, 120–126 (2000) 3. Cao, L., Huang, W., Sun, F.: A deep and stable extreme learning approach for classification and regression. In: Cao, J., Mao, K., Cambria, E., Man, Z., Toh, K.- A. (eds.) Proceedings of ELM-2014 Volume 1. PALO, vol. 3, pp. 141–150. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14063-6 13 4. Chino, D.Y.T., Avalhais, L.P.S., Rodrigues Jr., J.F., Traina, A.J.M.: BoWFire: detection of fire in still images by integrating pixel color and texture analysis. CoRR, abs/1506.03495 (2015) 5. Chollet, F.: Keras (2015) 6. Zhao, J., et al.: Image based forest fire detection using dynamic characteristics with artificial neural networks. In: 2009 International Joint Conference on Artificial Intelligence, pp. 290–293, April 2009 7. Frizzi, S., Kaabi, R., Bouchouicha, M., Ginoux, J.M., Moreau, E., Fnaiech, F.: Convolutional neural network for video fire and smoke detection. In: IECON 2016– 42nd Annual Conference of the IEEE Industrial Electronics Society, pp. 877–882, October 2016 8. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mech- anism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980) 9. Gürpinar, F., Kaya, H., Dibeklioglu, H., Salah, A.A.: Kernel ELM and CNN based facial age estimation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 785–791, June 2016 10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 258 J. Sharma et al. 11. Horng, W.-B., Peng, J.-W.: Image-based fire detection using neural networks. In: JCIS (2006) 12. Huang, G., Huang, G.-B., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Netw. 61, 32–48 (2015) 13. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Confer- ence on Neural Networks, Proceedings, vol. 2, pp. 985–990. IEEE (2004) 14. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1), 489–501 (2006) 15. Kim, J., Kim, J., Jang, G.-J., Lee, M.: Fast learning method for convolutional neu- ral networks using extreme learning machine and its application to lane detection. Neural Netw. 87, 109–121 (2017) 16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR, abs/1412.6980 (2014) 17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc (2012) 18. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 19. McDonnell, M.D., Tissera, M.D., van Schaik, A., Tapson, J.: Fast, simple and accu- rate handwritten digit classification using extreme learning machines with shaped input-weights. CoRR, abs/1412.8307 (2014) 20. McDonnell, M.D., Vladusich, T.: Enhanced image classification with a fast-learning shallow convolutional neural network. CoRR, abs/1503.04596 (2015) 21. Nagi, J., Di Caro, G.A., Giusti, A., Nagi, F., Gambardella, L.M.: Convolutional neural support vector machines: hybrid visual pattern classifiers for multi-robot systems. In: ICMLA, no. 1, pp. 27–32. IEEE (2012) 22. Pang, S., Yang, X.: Deep convolutional extreme learning machine and its applica- tion in handwritten digit classification. Intell. Neurosci. 2016 (2016) 23. Tomas Polednik, Bc.: Detection of fire in images and video using CNN. Excel@FIT (2015) 24. Poobalan, K., Liew, S.C.: Fire detection algorithm using image processing tech- niques. In: 3rd International Conference on Artificial Intelligence and Computer Science (AICS2015), Ocotober 2015 25. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. CoRR, abs/1403.6382 (2014) 26. Ribeiro, B., Lopes, N.: Extreme learning classifier with deep concepts. In: Ruiz- Shulcloper, J., Sanniti di Baja, G. (eds.) CIARP 2013. LNCS, vol. 8258, pp. 182– 189. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41822-8 23 27. Custer, R.B.R.: Fire detection: the state of the art. NBS Technical Note, US Department of Commerce (1974) 28. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015) 29. Yu, J.S., Chen, J., Xiang, Z.Q., Zou, Y.X.: A hybrid convolutional neural net- works with extreme learning machine for WCE image classification. In: 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1822–1827, December 2015 30. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229 (2013) Deep CNN-ELM Hybrid Models for Fire Detection in Images 259 31. Shao, J., Wang, G., Guo, W.: An image-based fire detection method using color analysis. In: 2012 International Conference on Computer Science and Information Processing (CSIP), pp. 1008–1011, August 2012 32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014) 33. Tang, Y.: Deep learning using support vector machines. CoRR, abs/1306.0239 (2013) 34. Tao, C., Zhang, J., Wang, P.: Smoke detection based on deep convolutional neural networks. In: 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), pp. 150–153, December 2016 35. Tapson, J., de Chazal, P., van Schaik, A.: Explicit computation of input weights in extreme learning machines. CoRR, abs/1406.2889 (2014) 36. Toulouse, T., Rossi, L., Celik, T., Akhloufi, M.: Automatic fire pixel detection using image processing: a comparative analysis of rule-based and machine learning-based methods. Sig. Image Video Process. 10(4), 647–654 (2016) 37. Töreyin, B.U., Dedeoǧlu, Y., Güdükbay, U., Çetin, A.E.: Computer vision based method for real-time fire and flame detection. Patt. Recogn. Lett. 27(1), 49–58 (2006) 38. Wolfshaar, J.V.D., Karaaba, M.F., Wiering, M.A.: Deep convolutional neural net- works and support vector machines for gender recognition. In: 2015 IEEE Sympo- sium Series on Computational Intelligence, pp. 188–195, December 2015 39. Verstockt, S., Lambert, P., Van de Walle, R., Merci, B., Sette, B.L State of the art in vision-based fire and smoke dectection. In: Luck, H., Willms, I. (eds.) 14th International Conference on Automatic Fire Detection, Proceedings, vol. 2, pp. 285–292. University of Duisburg-Essen. Department of Communication Systems (2009) 40. Vicente, J., Guillemant, P.: An image processing technique for automatically detecting forest fire. Int. J. Therm. Sci. 41(12), 1113–1120 (2002) 41. Zeng, Y., Xu, X., Fang, Y., Zhao, K.: Traffic sign recognition using deep convo- lutional networks and extreme learning machine. In: He, X., et al. (eds.) IScIDE 2015. LNCS, vol. 9242, pp. 272–280. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-23989-7 28 42. Zhang, L., Zhang, D.: SVM and ELM: who wins? object recognition with deep convolutional features from imagenet. CoRR, abs/1506.02509 (2015) 43. Zhang, Q., Xu, J., Xu, L., Guo, H.: Deep convolutional neural networks for forest fire detection, February 2016 44. Zhu, W., Miao, J., Qing, L.: Constrained extreme learning machine: a novel highly discriminative random feedforward neural network. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 800–807, July 2014 Siamese Survival Analysis with Competing Risks Anton Nemchenko1(B), Trent Kyono1, and Mihaela Van Der Schaar1,2,3 1 University of California, Los Angeles, Los Angeles, CA 90095, USA santon834@g.ucla.edu 2 University of Oxford, Oxford OX1 2JD, UK 3 Alan Turing Institute, 96 Euston Rd, Kings Cross, London NW1 2DB, UK Abstract. Survival analysis in the presence of multiple possible adverse events, i.e., competing risks, is a pervasive problem in many industries (healthcare, finance, etc.). Since only one event is typically observed, the incidence of an event of interest is often obscured by other related competing events. This nonidentifiability, or inability to estimate true cause-specific survival curves from empirical data, further complicates competing risk survival analysis. We introduce Siamese Survival Prog- nosis Network (SSPN), a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks. SSPN cir- cumvents the nonidentifiability problem by avoiding the estimation of cause-specific survival curves and instead determines pairwise concor- dant time-dependent risks, where longer event times are assigned lower risks. Furthermore, SSPN is able to directly optimize an approximation to the C-discrimination index, rather than relying on well-known metrics which are unable to capture the unique requirements of survival analysis with competing risks. Keywords: Survival analysis · Competing risks Siamese neural networks · C-index 1 Introduction 1.1 Motivation Survival analysis is a method for analyzing data where the outcome variable is the time to the occurrence of an event (death, disease, stock liquidation, mechan- ical failure, etc.) of interest. Competing risks are additional possible events or outcomes that “compete” with and may preclude or interfere with the desired event observation. Though survival analysis is practiced across many disciplines (epidemiology, econometrics, manufacturing, etc.), this paper focuses on health- care applications, where competing risk analysis has recently emerged as an important analytical tool in medical prognosis [9,22,26]. With an increasing aging population, the presence of multiple coexisting chronic diseases (multi- morbidities) is on the rise, with more than two-thirds of people aged over 65 ©c Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11141, pp. 260–269, 2018. https://doi.org/10.1007/978-3-030-01424-7_26 Siamese Survival Analysis with Competing Risks 261 considered multimorbid. Developing optimal treatment plans for these patients with multimorbidities is a challenging problem, where the best treatment or intervention for a patient may depend upon the existence and susceptibility to other competing risks. Consider oncology and cardiovascular medicine, where the risk of a cardiac disease may alter the decision on whether a cancer patient should undergo chemotherapy or surgery. Countless examples like this involving competing risks are pervasive throughout the healthcare industry and insuffi- ciently addressed in it’s current state. 1.2 Related Works Previous work on classical survival analysis has demonstrated the advantages of deep learning over statistical methods [14,18,27]. Cox proportional hazards model [6] is the baseline statistical model for survival analysis, but is limited since the dependent risk function is the product of a linear covariate function and a time dependent function, which is insufficient for modeling complex non-linear medical data. [14] replaced the linear covariate function with a feed-forward neural network as input for the Cox PH model and demonstrated improved per- formance. The current literature addresses competing risks based on statistical methods (the Fine Gray model [8]), classical machine learning (Random Survival Forest [12,13]), multi-task learning [1]) etc., with limited success. These exist- ing competing risk models are challenged by computational scalability issues for datasets with many patients and multiple covariates. To address this challenge, we propose a deep learning architecture for survival analysis with competing risks to optimize the time-dependent discrimination index. This is not trivial and will be elaborated in the next section. 1.3 Contributions In both machine learning and statistics, predictive models are compared in terms of the area under the receiver operating characteristic (ROC) curve or the time- dependent discrimination index (in the survival analysis literature). The equiva- lence of the two metrics was established in [11]. Numerous works on supervised learning [4,19,20,23] have shown that training the models to directly optimize the AUC improves out-of-sample (generalization) performance (in terms of AUC) rather than optimizing the error rate (or the accuracy). In this work, we adopt and apply this idea to survival analysis with competing risks. We develop a novel Siamese feed-forward neural network [3] designed to optimize concordance and account for competing risks by specifically targeting the time-dependent dis- crimination index [2]. This is achieved by estimating risks in a relative fashion so that the risk for the “true” event of a patient (i.e. the event which actually took place) must be higher than: all other risks for the same patient and the risks for the same true event of other patients that experienced it at a later time. Furthermore, the risks for all the causes are estimated jointly in an effort to generate a unified representation capturing the latent structure of the data and estimating cause-specific risks. Because our neural network issues a joint 262 A. Nemchenko et al. risk for all competing events, it compares different risks for the different events at different times and arranges them in a concordant fashion (earlier time means higher risk for any pair of patients). Unlike previous Siamese neural networks architectures [3,5,25] developed for purposes such as learning the pairwise similarity between different inputs, our architecture aims to maximize the distance between output risks for the dif- ferent inputs. We overcome the discontinuity problem of the above metric by introducing a continuous approximation of the time-dependent discrimination function. This approximation is only evaluated at the survival times observed in the dataset. However, training a neural network only over the observed survival times will result in poor generalization and undesirable out-of-sample perfor- mance (in terms of discrimination index computed at different times). In response to this, we add a loss term (to the loss function) which for any pair of patients, penalizes cases where the longer event time does not receive lower risk. The nonidentifiability problem in competing risks arises from the inability to estimate the true cause-specific survival curves from empirical data [24]. We address this issue by bypassing and avoiding the estimation of the individual cause-specific survival curves and utilize concordant risks instead. Our implemen- tation is agnostic to any underlying causal assumptions and therefore immune to nonidentifiability. We report statistically significant improvements over state-of-the-art com- peting risk survival analysis methods on both synthetic and real medical data. 2 Problem Formulation We consider a dataset H comprising of time-to-event information about N sub- jects who are followed up for a finite amount of time. Each subject (patient) experiences an event D ∈ {0, 1, ..,M}, where D is the event type. D = 0 means the subject is censored (lost in follow-up or study ended). If D ∈ {1, ..,M}, then the subject experiences one of the events of interest (for instance, sub- ject develops cardiac disease). We assume that a subject can only experience one of the above events and that the censorship times are independent of them [7,8,10,17,22,24]. T is defined as the time-to-event, where we assume that time is discrete T ∈ {t1, ..., tK} and t1 = 0 (ti denotes the elapsed time since t1). Let H = {Ti,D Ni, xi}i=1, where Ti is the time-to-event for subject i, Di is the event experienced by the subject i and x Si ∈ R are the covariates of the subject (the covariates are measured at baseline, which may include age, gender, genetic information etc.). The Cumulative Incidence Function (CIF) [8] computed at time t for a certain event D is the probability of occurrence of a particular event D before time t conditioned on the covariates of a subject x, and is given as F (t,D|x) = Pr(T ≤ t,D|x). The cumulative incidence function evaluated at a certain point can be understood as the risk of experiencing a certain event before a specified time. In this work, our goal is to develop a neural network that can learn the complex interactions in the data specifically addressing competing risks survival Siamese Survival Analysis with Competing Risks 263 analysis. In determining our loss function, we consider that the time-dependent discrimination index is the most commonly used metric for evaluating models in survival analysis [2]. Multiple publications in the supervised learning literature demonstrate that approximating the area under the curve (AUC) directly and training a classifier leads to better generalization performance in terms of the AUC (see e.g. [4,19,20,23]). However, these ideas were not explored in the con- text of survival analysis with competing risks. We will follow the same principles to construct an approximation of the time-dependent discrimination index to train our neural network. We first describe the time-dependent discrimination index below. Consider an ordered pair of two subjects (i, j) in the dataset. If the subject i experiences event m, i.e., Di = 0 and if subject j’s time-to-event exceeds the time-to-event of subject i, i.e., Tj > Ti, then the pair (i, j) is a comparable pair. The set of all such comparable pairs is defined as the comparable set for event m, and is denoted as Xm. A model outputs the risk of the subject x for experiencing the event m before time t, which is given as Rm(t, x) = F (t,D = m|x). The time-dependent dis- crimination index for a certain causem is the probability that a model accurately orders the risks of the comparable pairs of subjects in the comparable set for event m. The time-dependent discrimination index [2] for cause m is defined as ∑K k=1 AUC m(tk)wm(tk) Ct(m) = ∑ . (1)K wmk=1 (tk) where AUCm(tk) = Pr{Rm(tk, xi) > Rm(tk, xj)|Ti = tk, Tj > tk,Di = m} , (2) wm(tk) = Pr{Ti = tk, Tj > tk,Di = m} . (3) The discrimination index in (1) cannot be computed exactly since the distribu- tion that generates the data is unknown. However, the discrimination index can be estimated using a standard estimator, which takes as input the risk values associated with subjects in the dataset. [2] defines the estimator for (1) as ∑N i,j=1 1{Rm(T mi, xi) > R (Ti, xj)} · 1{Tj > Ti,Di = m} Ĉt(m) = ∑ . (4)N i,j=1 1{Tj > Ti,Di = m} Note that in the above (4) only the numerator depends on the model. Henceforth, we will only consider the quantity in the numerator and we write it as ∑N C̄ (m) = 1{Rmt (Ti, xi) > Rm(Ti, xj)} · 1{Tj > Ti,Di = m} . (5) i,j=1 The above equation can be simplified as | m∑X | C̄t(m) = 1{Rm(Ti(left),Xmi (left)) > Rm(Ti(left),Xmi (right))} . (6) i=1 264 A. Nemchenko et al. where 1(x) is the indicator function, Xmi (left) (X m i (right)) is the left (right) element of the ith comparable pair in the set Xm and Ti(left) (Ti(right)) is the respective time-to-event. In the next section, we will use the above simplification (6) to construct the loss function for the neural network. 3 Siamese Survival Prognosis Network In this section, we will describe the architecture of the network and the loss functions that we propose to train the network. Denote H as a feed-forward neural network which is visualized in Fig. 1. It is composed of a sequence of L fully connected hidden layers with “scaled exponential linear units” (SELU) activation. The last hidden layer is fed to M layers of width K. Each neuron in the latter M layers estimates the probability that a subject x experiences cause m occurs in a time interval tk, which is given as Prm(tk, x). For an input covariate x the output from all the neurons is a{[ ] }M vector of probabilities given as Prm(tk, x) K . k=1 m=1 The estimate of cumula∑tive incidence function computed for cause m at timek t m mk is given as R̃ (tk, x) = i=1 Pr (ti, x). The final output of the neural net- work for input x is vector of estimates of the cumulative incidence function given {[ ] }M as H(x) = m( ) KR̃ tk, x .k=1 m=1 The loss function is composed of three terms: discrimination, accuracy, and a loss term. We cannot use the metric in (6) directly to train the network because it is a discontinuous function (composed of indicators), which can impede training. We overcome this problem by approximating the indicator function using a scaled sigmoid function σ(αx) = 11+exp(−αx) . The approximated discrimination index is given as Fig. 1. Illustration of the architecture. Siamese Survival Analysis with Competing Risks 265 |∑X m| [ ˆ̄ [ ] ] C (m) = σ α R̃mt (Ti(left),X m i (left))− R̃m(Ti(left),Xmi (right)) . (7) i=1 The scaling parameter α determines the sensitivity of the loss function to dis- crimination. If the value of α is high, then the penalty for error in discrimination is also very high. Therefore, higher values of alpha guarantee that the subjects in a comparable pair are assigned concordant risk values. The discrimination part defined above captures a model’s ability to discrim- inate subjects for each cause separately. We also need to ensure that the model can predict the cause accurately. We define the accuracy of a model in terms of a scaled sigmoid function with scaling parameter κ as follows | m∑X | [ ( ∑ )] L1 = σ κ R̃D(left)(T m m mi(left),Xi (left))− R̃ (Ti(left),Xi (left)) . i=1 m= D(left) (8) The accuracy term penalizes the risk functions only at the event times of the left subjects in comparable pairs. However, it is important that the neural network is optimized to produce risk values that interpolate well to other time intervals as well. Therefore, we introduce a loss term below m ∑M |∑X | ∑ L2 = β Rm(tk,Xmi (right)) 2 . (9) m=1 i=1 tk