Scientific Program – ECDA 2019

The submission deadline is over, more than 150 submissions have been accepted for presenting at the conference, including 11 Plenary Talks and Special Sessions on interesting research themes like „Interpretable Machine Learning“, „Biostatistics and Bioinformatics“, or „Consumer Preferences and Marketing Analytics“. Also, parallel to conference talks, a Workshop on Partial Least Squares Structural Equation Modeling (PLS-SEM) by Marko Sarstedt (University of Magdeburg) took place on Wednesday (March 20, 2019, 9:00-17:00).

• The detailed Program of the ECDA (PDF file) and
• the Book of Abstracts (including index of authors) can be found here.

INVITED SPEAKERS (Plenary Talks)

Jan Baumbach, TU Munich, Germany
Title: „Big Data Science in Systems Medicine – A Disruptive View on Current Medicine“
Abstract: One major obstacle in current medicine and drug development is inherent in the way we define and approach diseases. We discuss the diagnostic and prognostic value of (multi-)omics panels. We have a closer look at breast cancer survival and treatment outcome, as case example, using gene expression panels – and we will discuss the current „best practice“in the light of critical statistical considerations. In addition, we introduce computational approaches for network-based medicine. We discuss novel developments in graph-based machine learning using examples ranging from Huntington’s disease mechanisms via Alzheimer’s drug target discovery back to where we started, i.e. breast cancer treatment optimization – but now from a systems medicine point of view.We conclude that multi-scale network medicine and modern artificial intelligence open new avenues to shape future medicine. We will also have a short glimpse on novel approaches for privacy-aware sensitive medical data sharing. We quickly introduce the concept of federated machine learning and blockchain-based consent management to build a Medical AI Store ensuring privacy by design and architecture.
Hans-Hermann Bock, RWTH Aachen, Germany
Title: „Co-Clustering – A Survey on Models, Algorithms and Applications“
Abstract: When a data set is represented by a two-dimensional two-mode matrix it can often be useful to cluster the rows and the columns of this matrix. Clustering can be performed either separately or simultaneously. This second approach is called two-way clustering, biclustering or co-clustering and will be the subject of this paper. Major applications are known, e.g., from gene analysis, marketing, social sciences and text mining. There exists a large number of co-clustering methods, surveys are given, e.g., by [2], [3], [5], [7]. While some methods are based on empirical argumentations only, others are derived from probabilistic models and/or proceed by optimizing a suitable bi-partitional clustering criterion. Different methods have been developed for data matrices with real-valued, binary or qualitative entries, eventually corresponding to different practical purposes. Methods are also distinguished by the type of domains that should be bi-partitioned: the two index sets of data vectors xij, the two coordinate spaces of two-dimensional data xi = (xi1, xi2), the row and column categories of a contingency table N = (nij), etc. From a methodological point of view models can be classified into block models (yielding bi-partitions) or latent block models (using mixture models), combined with maximum likelihood approaches, eventually leading to generalized k-means or EM algorithms. Contingency tables are typically bi-clustered by maximizing information-type measures. The paper presents a structured survey on the most important co-clustering models, related numerical algorithms and various recent extensions, e.g., for multiway tables or functional data.
[1] Bock, H.-H. (2003): Convexity-based clustering criteria : theory, algorithms, and applications in statistics. Statistical Methods & Applications 12, 293-317. [2] Bock, H.-H. (2014) : Probabilistic two-way clustering approaches with emphasis on the maximum interaction criterion. Archives of Data Science 1 (1), 3-20. [3] Govaert, G., Nadif, M. (2013) : Co-clustering: models, algorithms and applications. Wiley, 2013. [4] Govaert, G., Nadif, M. (2018): Mutual information, phi-squared and model-based co-clustering for contingency tables. Adv. in Data Analysis and Classification 12, 455-488. [5] Pontes, B., Giráldez, R., Aguilar-Ruiz, J.S. (2015): Biclustering on expression data: a review. Journal of Biomedical Informatics 57, 163-180. [6] Schepers, J., Bock, H.-H., Van Mechelen, I. (2017): Maximal interaction two-mode clustering. Journal of Classification 34 (1), 49-75. [7] Van Mechelen, I., Bock, H.-H., De Boeck. P. (2004): Two-mode clustering methods: a structured review. Statistical Methods in Medical Research 13, 363-394.

Harald Hruschka, University of Regensburg, Germany
Title: „Analyzing Retail Market Basket Data by Unsupervised Machine Learning Methods“
Abstract: We compare the performance of several unsupervised machine learning methods, namely binary factor analysis, two topic models (latent Dirichlet allocation and the correlated topic model), the restricted Boltzmann machine and the deep belief net, on a retail market basket data set. We shortly present these methods and outline their estimation. Performance is measured by log likelihood values for a holdout data set. Binary factor analysis vastly outperforms topic models. Both the restricted Boltzmann machine and the deep belief net on the other hand attain a similar performance advantage over binary factor analysis. We also show how to interpret the relationships between the most important hidden variables and observed category purchases. To demonstrate managerial implications, we compute relative basket size increase due to promoting each category for the better performing models. Recommendations which product categories to promote based on the restricted Boltzmann machine and the deep belief net not only have lower uncertainty due to their better predictive performance, they are also more convincing than those derived by binary factor analysis, which leave out most categories with high purchase frequencies.

Sadaaki Miyamoto, University of Tsukuba, Japan
Title: „Relations Among K-means, Mixture of Distributions, and Fuzzy Clustering“
Abstract: The method of K-means is best known among different clustering algorithms. Another class of algorithms using statistical models is EM algorithms, a typical of which is based on Gaussian Mixture Model (GMM). Yet another idea for clustering is to use fuzzy models. The best-known fuzzy method is fuzzy K-means (FKM), a fuzzy extension of the original K-means, which uses an alternate minimization of an objective function. Although it is obvious that fuzzy K-means are an extension of K-means, a relation between FKM and GMM is generally unknown, or they are considered to have no relations, since a statistical model and a fuzzy model are very different. In spite of this general understanding, we show that they have close methodological relations in this talk. The main point to be noted is that we have many generalizations of fuzzy K-means, among which some methods are equivalent or similar to statistical mixture models. Indeed, KL-information based clustering proposed by Ichihashi et al. (2000) is formally equivalent to the iterative solution of GMM using EM algorithm. Generalized method of fuzzy K-means include two more variables for adjusting cluster sizes (or cluster volumes) and cluster covariances. It should be noted that they are not exactly those in statistical models with prior distributions and covariances in clusters but have similar functions. We hence overview the idea of these three classes of algorithms and show a number of methods of generalized fuzzy K-means which are equivalent or similar to statistical mixture of distributions. We moreover introduce what we call a classification function, whereby we uncover theoretical properties of clustering results not only in fuzzy methods but also those by statistical models.To conclude, what we emphasize in this talk is that the above three classes of methods are not independent but have close relations that should be noted for considering theoretical insights and also useful for further methodological development in data clustering.

Masahiro Mizuta, Hokkaido University, Japan
Title: „Meta-Analysis and Symbolic Data Analysis“
Abstract: We point out an important relationship between meta-analysis and symbolic data analysis (SDA) and discuss methods to support meta-analysis with SDA. SDA was proposed by Professor Edwin Diday in 1988. In most of statistical analysis methods, the unit of analysis is an individual, but the unit of analysis in SDA is an object (or class, set, concept), which has a description, e.g. interval value, distributional value, multi values. Diday insisted that symbolic data is “any data taking care on the variation inside classes of standard observation.” A role of SDA is to analyze internal variations of the data or concepts. In the early stages of SDA’s development, we study interval valued data. Nowadays, there are so many types of symbolic data, including modal interval data, distributional data, multivalued data. SDA is used in many fields.
Meta-Analysis, proposed in 1976 is a method to derive results with several studies and is widely used in many fields, including social science, medical science, drug approval. In medical research, meta-analysis has been placed at the top of the evidence pyramid (Oxford Center for Evidence-Based Medicine’s levels of Evidence and Recommendation). There are two models used in meta-analysis, the fixed effect model and the random effects model. Heterogeneity in meta-analysis refers to the variation in study outcomes between studies.
Meta-analysis and SDA have been developed independently, but there is something in common between them. In a meta-analysis, the unit of analysis is an individual study rather than an individual study participant. This framework is almost the same as that of SDA. Meta-analysis is in the top of the evidence pyramid. However, even if formally conducting analytic meta-analysis of multiple studies, it is difficult to find a hidden structure. SDA is a powerful tool for exploratory meta-analysis. There are many tools in SDA community, and most of them are useful for exploratory meta-analysis; symbolic clustering is an effective tool for subgroup analysis in meta-analysis for example.
Francesco Palumbo, University of Naples Federico II, Italy
Title: „Archetypes, Prototypes and Other Types: The Current State and Prospects for Future Research“
Abstract: Statistical and machine learning can significantly speed up human knowledge development, helping to determine the basic categories in a relatively short amount of time. Exploratory data analysis (EDA) can be considered the forefather of statistical learning; it relies on the mind’s ability to learn from data and, in particular, it aims to summarize data-sets through a limited number of interpretable latent features or clusters offering cognitive geometric models to define categorizations. It can also be understood as the implementation of the human cognitive process extended to huge amounts of data: Big Data. But EDA alone cannot be the answer to the questions: „How many, and what are the categories to retain?“ and „What are the observations that can represent a category better than others in human cognitive processes?“. The concept of categorization implies data summarization in a limited number of well-separated groups that must be maximally and internally homogeneous at the same time. This contribution aims to present an overview of the most recent literature on the archetypal analysis (AA) and its related methods as statistical categorization approaches. At the same time some most recent approaches in prototype identification when dealing with large and huge data-sets are presented. In combination with consistent clustering approaches, AA helps to identify those observed or unobserved prototypes that satisfy Rosch’s definition. Those small number of groups that are maximally homogeneous within the units belonging to the same group and maximally heterogeneous among groups, and which allow the development of the human knowledge by the relationships between prototypes and a new unknown object.
Marko Sarstedt, Otto von Guericke University Magdeburg, Germany
Title: „CB-SEM or PLS-SEM? Five Perspectives and Five Recommendations“
Abstract: To estimate structural equation models, researchers can draw on two main approaches: Covariance-based structural equation modeling (CB-SEM) and partial least squares structural equation modeling (PLS-SEM). Concerns about the limitations of the different approaches might lead researchers to seek reassurance by comparing results across approaches. But should researchers expect the results from CB-SEM and PLS-SEM to agree, if the structure of the two models is otherwise the same? Differences in philosophy of science and different expectations about the research situation underlie five different perspectives on this question. We argue that the comparison of results from CB-SEM and PLS-SEM is misleading and misguided, capable of generating both false confidence and false concern. Instead of seeking confidence in the comparison of results across methods, which differ in their specific requirements, computational procedures, and imposed constraints on the model, researchers should focus on more fundamental aspects of research design. Based on our discussion, we derive recommendations for applied research using SEM.

Matthias Schmid, University of Bonn, Germany
Title: „Discrete Time-to-Event Analysis – Methods and Recent Developments“
Abstract: Discrete time-to-event analysis comprises a set of statistical methods to analyze the duration time until an event of interest occurs. In contrast to classical methods for survival analysis, which typically assume the duration time to be continuous, discrete-time methods apply to situations where time is measured (or recorded) on a discrete time scale t = 1, 2, …. These situations are likely to occur in longitudinal studies with fixed follow-up times (e.g., epidemiological studies or panel studies) in which it is only known that events have happened (or not) between pairs of consecutive points in time. Unlike the classical approaches, discrete-time methods are able to handle potentially large numbers of tied duration times. Furthermore, estimation is greatly facilitated by the fact that the log-likelihood of a discrete time-to-event model is closely linked to the log-likelihood of a model with binary outcome.The talk will provide an overview of the most popular methods for discrete time-to-event modeling. In addition, it will cover recent developments on model validation, tree-based approaches, and methods for discrete-time competing risks analysis.

Andrzej Sokolowski, University of Krakow, Poland
Title: „Composite Indicators and Rankings“
Abstract: Rankings are popular both in scientific research and in everyday life. Many institutions rank different objects from the best to the worst. So we have ranking of countries, universities, cities, hospitals, jobs, provinces, books, songs, actors, footballers etc. If such a ranking is based on just one variable measured in strong scale, then the task is trivial. The only thing we should decide is the direction – the bigger the better or the smaller the better. The problem is more complicated with multidimensional case, even with just two diagnostic variables.The aim of the paper is to present and discuss issues connected with the consecutive stages of the process of constructing composite indicators. They are as follows:
- Definition of the general criterion and possible sub-criteria
- Choosing diagnostic variables – initial and final list. Some unjustified statistical procedures
- Setting relations between general criterion and diagnostic variables (stimulants, destimulants, nominants). Merit and automatic identification methods
- Transformation of variables – normalization, standardization, other transformations
- Weighting systems
- Aggregation formulas
- Robustness and sensitivity

Linear ordering methods are sometimes divided into those with benchmark and without. It can be argued that every method has some benchmark – assumed or calculated from the data.

Short review of the literature will be given and some example provided and critically discussed – such as Human Development Index or 200 Best Jobs in USA. Finally, new propositions will be presented such as – iterative procedures, method for choosing the best composite indicator for w given set of variables, Multidimensional Scaling approach, and non-linear ordering.

Rainer Spang, University of Regensburg, Germany
Title: „Zero-Sum Regression“
The performance of Machine Learning algorithms can critically depend on the preprocessing of the input data. This is the case for many types of molecular data used in biomedicine including transcriptomics, proteomics and metabolomics profiles. Biases from experimentation and measurement can strongly effect performance, even after state of the art data normalization protocols where applied. Some Machine Learning algorithms propagate these biases more than others. For supervised learning problems we propose “zero-sum“ regression as a tool designed to be fairly robust against frequent biases. This is achieved by using generalized linear models combined with the zero-sum constraint, which causes the resulting models to be scale invariant.
Dirk Van den Poel, University of Ghent, Belgium
Title: „Using Microblogging to Predict IPO Success by Means of Apache Spark Big Data Analytics“
Abstract: This research investigates the influence of Twitter social media messages on the success/failure of Initial Public Offerings (IPOs). We analyze the (potential) impact of the number of tweets, their sentiment, and many other features on (1) the difference between the IPO price and the closing price of the stock at their first day of trading, and (2) the difference between the closing price on the first day of trading and the closing price after three months of trading.

SPECIAL SESSIONS

Bioinformatics and Biostatistics
Organization: Dominik Heider
This special session invites contributions from all aspects of biostatistics and bioinformatics, with a special emphasis on machine learning and statistical learning for biomedical problems, ranging from biotechnological applications towards medical diagnostics and prognostics. We further encourage work describing novel methods for preprocessing of biomedical data, e.g., feature selection methods.
Complexity, Data Science and Statistics Through Visualization and Classification
Organization: Carmela Iorio, Roberta Siciliano
Finding new ways to tackle complex policy issues at the nexus between water, energy and food resources is the key aim of the MAGIC Project (Moving Towards Adaptive Governance in Complexity: Informing Nexus Security – https://magic-nexus.eu – Horizon 2020, GA 689669) supporting this special session. Complexity, Data Science and Statistics are necessarily integrated when developing quantitative storytelling, visualization and classification tools for a more robust understanding of the Nexus. Complexity is not only refered to data dimensionality but also to the variety, the structure, the knowledge discovery process and fundamentally the statistical modelization. Examples of classification tools are fuzzy clustering procedures when dealing with time series and directional data. Examples of visualization storyboards are Integrated Multi-Metric Sankey and Tree Diagrams.
Consumer Preferences and Marketing Analytics
Organization: Reinhold Decker, Winfried Steiner
This track invites methodological, theoretical or empirical papers which aim to contribute to the general understanding of consumer preferences. The focus is laid on tools/methods like conjoint analysis or discrete choice analysis in a marketing context that aim at informing and improving management decisions. However, we also encourage work that deals with further quantitative techniques to extract preference information from consumer data.
Data Analysis in Finance
Organization: Krzysztof Jajuga
Finance is the most often explored area, where data analysis methods, including classification methods, are successfully applied. There is a plethora of different approaches of data analysis to be used for financial data, including stochastic approach and purely descriptive approach. The proposed session covers wide array of possible topics, including: Big Data in Finance, Time Series Analysis of Financial Market Data, Text Mining of Financial Databases, Financial Risk Analysis Methods, Analysis of Digital Assets.
Data Analysis Models in Economics and Business
Organization: Józef Pociecha
This is a proposal for those who are dealing with applications of data analysis and classification models, machine learning procedures, multivariate time series and other multivariate methods in various areas of economic and business research. We are waiting for examples of a new approach to such type of empirical investigations, useful both for analytics and practice.
Interpretable Machine Learning
Organization: Johannes Fürnkranz, Eneldo Loza Mencía, Ute Schmid
The interest in machine learning has greatly increased recently, in great measure due to impressive advances in some fields achieved by complex methods like deep learning. Many of these advances are about to make their way into society soon. This has renewed the demand on models that are not only accurate but also interpretable. Interpretable models allow the practitioner to obtain important insights about the task at hand. Moreover, ensuring model interpretability is of central importance in application domains where trust into the system is essential for its acceptance and where malfunctioning may result in legal liability.In this special session, we therefore solicit contributions that (i) discuss and evaluate the interpretability of various types of machine learning models, (ii) use representations which are understandable by experts and even non-experts, (iii) introduce models that can be inspected, verified, and possibly also modified by non-expert users, (iv) offer explanations or visualizations of their decisions, (v) develop methods for interpretable learning in complex domains like multi-target learning or structured output prediction.
Statistical Learning
Organization: Angela Montanari
The session aims at putting together new developments in the field of statistical learning (both supervised and unsupervised), emphasizing statistical models and assessment of uncertainty.

TUTORIALS

Partial Least Squares Structural Equation Modeling (PLS-SEM)
by Marko Sarstedt on Wednesday, March 20, 2019, 9:00-17:00 (Room S57 (RW I))
Partial least squares structural equation modeling (PLS-SEM) has recently received considerable attention in a variety of disciplines, including marketing, strategic management, management information systems, and many more.PLS is a composite-based approach to SEM, which aims at maximizing the explained variance of dependent constructs in the path model. Compared to other SEM techniques, PLS allows researchers to estimate very complex models with many constructs and indicator variables.Furthermore, PLS-SEM allows to estimate reflective and formative constructs and generally offers much flexibility in terms of data requirements.This full-day workshop introduces participants to the state-of-the-art of PLS-SEM using the SmartPLS 3 software. After a brief introduction to the basic principles of structural equation modeling, participants will learn the foundations of PLS-SEM and how to apply the method by means of the SmartPLS software. The workshop will cover various aspects related to the evaluation of measurement and structural model results. For this purpose, the instructor will make use of several examples and exercises.Further information can be found on the flyer to the tutorial.