XI Workshop on Probabilistic and Statistical Methods
February 25-27, 2026
XI Workshop on Probabilistic and Statistical Methods
February 25-27, 2026
The Workshop on Probabilistic and Statistical Methods (WPSM) is a meeting organized by the Joint Graduate Program in Statistics UFSCar/USP (PIPGEs, São Carlos, SP, Brazil) with the aim of discussing new developments in Probability, Statistics, and their applications.
Activities include 7 plenary conferences and 2 mini-conferences with national and international researchers and one short-course. Two special sessions are also planned: one on Complex Networks and another on Mathematical Statistics and Machine Learning., 2 poster sessions and 2 oral communication sessions.
Rodney Fonseca - UFBA
An initial screening of which covariates are relevant is a common practice in high-dimensional regression models. The classic feature screening selects only a subset of covariates correlated with the response variable. However, many important features might have a relevant albeit highly nonlinear relation with the response. One screening approach that handles nonlinearity is to compute the correlation between the response and nonparametric functions of each covariate. Wavelets are powerful tools for nonparametric and functional data analysis but are still seldom used in the feature screening literature. In this talk, we present a wavelet feature screening method that can be easily implemented. Theoretical and simulation results show that the proposed method can capture true covariates with high probability, even in highly nonlinear models. We also present an example with real data in a high-dimensional setting. Joint work with Pedro Morettin and Aluísio Pinheiro.
Acknowledgments: The author thanks the CeMEAI-CEPID program and the São Paulo Research Foundation (FAPESP), process 2013/07375-0.
Mariana Rodrigues Motta - UNICAMP
We consider unsupervised classification using a latent multinomial variable to categorize a scalar response into one of the L components of a mixture model that incorporates scalar and functional covariates. This process can be viewed as a hierarchical model with three levels. The first level models the scalar response according to a mixture of parametric distributions. The second level models the mixture probabilities using a generalized linear model with functional and scalar covariates, as proposed by Garcia et al. (2024). For correlated functional covariates, the third level accounts for the correlation among curves through a mixed model. We use basis expansions to reduce dimensionality and a Bayesian approach to estimate the parameters while providing predictions of the latent classification vector. The method is motivated by a study aiming to identify placebo responders in a clinical trial (normal mixture model).
Matthieu Jonckheere - CNRS LAAS (France)
We explain the Fermat density-based estimator for weighted geodesic distances that takes into account the underlying density of the data. The consistency of the estimator is proven using tools from first passage percolation. The macroscopic distance obtained depends on a unique parameter: we discuss the choice of this parameter and the properties of the obtained distance for machine learning tasks as well as new developments for improving its practical robustness.
Daniel Takahashi - UFRN
The system of taking turns during vocal exchanges is fundamental to the communication of several animal species, yet their developmental origins and neural mechanisms remain elusive. Marmoset monkeys readily exchange vocalization when in acoustic contact with conspecifics. Their turn-taking capacity improves during development, decreasing the amount of overlap. We developed a stochastic dynamical systems model of marmoset monkeys based on the interactions among three neural structures ("drive," "motor," and "auditory") with feedback connectivity. Fitting the model to empirical data, we found that the noise level in the auditory sensory system decreases during development, matching the timing of improvement in the capacity to avoid overlapping calls, suggesting a major role of the auditory system in early development.
Keith Levin - University of Wisconsin-Madison (USA)
Spectral methods are widely used to estimate eigenvectors of a low-rank signal matrix subject to noise. Typically, the estimation error depends on the coherence of the signal matrix. We present a method whose entrywise estimation error is independent of coherence under Gaussian noise (spiked Wigner model), achieving optimal estimation rates up to logarithmic factors. Extensions to higher ranks and non-Gaussian settings show promising results. Additionally, new metric entropy bounds for rank-r singular subspaces under the 2-to-infinity distance are derived.
Carlos Cesar Trucios Maza - UNICAMP
Volatility models based on intraday data outperform classical GARCH approaches. We propose a range-based GARCH model that accounts for both leverage effect and extreme observations. An extensive Monte Carlo simulation and empirical applications to US stocks support the method's performance. The approach is available through a user-friendly R package, developed in joint work with Adi Pari (UNICAMP).
Karthik Bharath - University of Nottingham (UK)
Manifolds are appropriate representation spaces for many complex data objects, including directional data, shape data, data from diffusion tensor imaging. Distributions on manifolds are typically difficult to specify in closed form, and efficient strategies to sample from unnormalised densities must navigate the twofold challenge of high-dimensionality and curvature of the manifold. Starting with an intrinsic Langevin diffusion on a compact Riemannian manifold with a target distribution as its invariant distribution, I will present an intrinsic discrete-time Markov chain for sampling from the target, and identify conditions under which error bounds on approximations of linear functionals of the target distribution match optimal bounds in the Euclidean case. I will present some illustrations on positively and negatively curved manifolds, and comment on extensions.
Silvana Schneider - UFRGS
In this work, we present approaches based on cure fraction models with dependent censoring. The association between failure times and dependent censoring can be accommodated either through frailty models or via copula functions. The marginal distributions are modeled using the Weibull distribution and the piecewise exponential (PE) distribution. The results of the simulation study show small relative bias and coverage probability close to the nominal value. To evaluate treatment adherence in Tuberculosis (TB) and subsequent outcomes, we consider a sample of TB/HIV co-infected patients from the Alvorada Cohort, an epidemiological study conducted in Alvorada, Rio Grande do Sul.
Thiago Rodrigo Ramos - UFSCar
The phenomenon of measure concentration is central to understanding the behavior of high-dimensional probabilistic models in machine learning. We will discuss classical inequalities such as Hoeffding’s and McDiarmid’s, their applications to generalization error bounds, model selection, and the analysis of the geometry of high-dimensional data, emphasizing the intuition behind the results.
Daiane de Souza Santos - USP
Modelos de sobrevivência com fração de cura analisam dados com uma parte de indivíduos curados. Propomos modelar a incidência com algoritmos como SVM, Extreme Gradient Boosting, Random Forests, árvores de decisão e redes neurais, permitindo capturar relações complexas. Um método de estimação foi desenvolvido e simulações mostram que o modelo supera abordagens existentes, principalmente na modelagem da incidência.
Andressa Cerqueira - UFSCar
In this talk, we present new results on the weak and strong consistency of the maximum and integrated conditional likelihood estimators for community detection in the Stochastic Block Model with k communities. We show that the maximum conditional likelihood achieves the optimal threshold for exact recovery in the logarithmic degree regime, while the integrated version attains a sub-optimal constant. Both methods are also weakly consistent in the divergent degree regime.
André Fujita - USP
Ao invés de usar características incidentais das redes, focamos na inferência sobre os mecanismos de geração, que são as verdadeiras estruturas de interesse. Apresentamos uma estrutura estatística não paramétrica para inferência sobre esses mecanismos, essencial quando modelos paramétricos são insatisfatórios.
Keith Levin - University of Wisconsin-Madison (EUA)
Estudamos redes construídas a partir de séries temporais correlacionadas. Mostramos que, sob condições adequadas, a aplicação do embedding espectral adjacente a essas redes recupera os embeddings das séries verdadeiras, codificando coeficientes de Fourier dos sinais originais.
Pablo Groisman - UBA (Argentina)
Analisamos a consistência dos k-médias em espaços métricos com distâncias desconhecidas, mostrando resultados de consistência sob a convergência medida-Gromov-Hausdorff. Aplicações incluem estimativas com distâncias de Fermat e Isomap, e de barycenters de medidas usando amostras.
Matthieu Jonckheere - CNRS LAAS (France)
Mostramos que o Q-learning pode falhar na aprendizagem de políticas ótimas em espaços de estados infinitos, usando argumentos regenerativos baseados em passeios aleatórios "cookie".
Daniel Takahashi - UFRN
Propomos um método de adivinhação não paramétrico para séries temporais categóricas, evitando a estimação explícita de probabilidades condicionais. O método apresenta uma taxa de aprendizado independente do tamanho do alfabeto e é quase ótimo, conforme mostrado por limites inferiores minimax.
Uriel Moreira Silva (DEST-ICEx, UFMG)
In this work we propose a manifold version of the particle Metropolis-adjusted Langevin Algorithm (pMALA) of \cite{NemethEtAl2016} for parameter inference in State Space Models (SSM), and which we name particle Manifold Metropolis-adjusted Langevin Algorithm (pmMALA). Our method is a modification of pMALA that uses low-variance Hessian estimates of the log-target density as a metric tensor in the context of Riemannian Manifold Hybrid Monte Carlo (RMHMC) algorithms. A key ingredient in order to ensure proper convergence of RMHMC methods is that the metric tensor is positive definite, and here we satisfy this condition by employing adaptive step-size selection methods that originate from the nonlinear optimization literature and that were recently adapted for RMHMC. The end result is a method that does not require manual tuning of hyperparameters or pre-conditioning of the covariance matrix based on a pilot run, and that can achieve optimal scaling under relatively weak conditions. We illustrate pmMALA's performance using both synthetic and real data, and show that it can obtain substantial inferential gains over conventional pMCMC and pMALA even in challenging nonlinear settings.
Hellen Geremias Gatica Santos (Joint Graduate Program in Statistics, UFSCar/USP - PIPGEs, São Carlos, SP, Brazil and Instituto Carlos Chagas, Curitiba, PR, Brazil), Daiane Aparecida Zuanetti (Department of Statistics, UFSCar, São Carlos, SP, Brazil)
Variable selection is a recurring challenge in statistical modeling, especially in high-dimensional settings such as genome-wide association studies, which require effective strategies for controlling the False Discovery Rate (FDR). A false discovery occurs when a decision rule rejects a true null hypothesis (a false positive result). Popular methods such as LASSO (Least Absolute Shrinkage and Selection Operator), although producing more interpretable models with good predictive performance, do not offer formal guarantees of FDR control in an inferential context, which may lead to the inclusion of irrelevant variables in the final model. The knockoff filter has emerged as a recent alternative, grounded in the construction of variables that act as negative controls in the selection problem. In summary, the procedure consists of two steps: (i) for each variable in the design matrix, a copy (the knockoff) is constructed that preserves the correlation structure among the variables but is, by design, independent of the response; (ii) a variable selection algorithm (e.g., LASSO) is then applied to the augmented matrix containing the original variables and their knockoffs, followed by the computation of a statistic comparing each original and knockoff pair. Under the null hypothesis, this statistic is expected to be approximately symmetric around zero, allowing the definition of a threshold corresponding to a target FDR. This work aims to evaluate the use of the knockoff filter for selecting genetic markers, comparing two existing methods and three new proposals for knockoff generation, followed by their assessment together with the original variables using LASSO. The analyses were conducted under three scenarios with different dependence structures: independent variables (scenario 1), complex correlation typical of DNA microarrays (scenario 2), and simple local correlation observed in microsatellites (scenario 3). Knockoff quality was evaluated based on the preservation of the correlation structure among original variables and on the similarity between the Ridge regression coefficients of null original variables and their corresponding knockoffs. Although generated by simpler models, the knockoffs produced by the proposed methods showed quality comparable to those obtained by existing approaches, particularly in scenarios 2 and 3. LASSO applied either to the original variables alone or to the augmented matrices with knockoffs from each method was evaluated using FDR, power, F1 score, and Matthews correlation coefficient. Overall, LASSO with knockoffs demonstrated strong performance, achieving lower average FDR without substantial loss of power compared to LASSO applied only to the original variables. This result is especially relevant given the scalability and computational efficiency of LASSO. Additionally, the knockoffs generated by the proposed methods displayed competitive performance, with satisfactory FDR and power, particularly in scenarios with some degree of dependence among variables. Their conceptually simpler structures also facilitate application to real genetic data. Overall, the results suggest that LASSO is relatively robust to knockoff quality with respect to preserving the correlation structure of the original variables, maintaining good performance even when this structure is underestimated or overestimated by the knockoffs.
Hugo Gobato Souto (USP & LuizaLabs), Francisco Louzada Neto (USP)
This paper introduces the Difference-in-Differences Bayesian Causal Forest (DiD-BCF), a novel non-parametric model addressing key challenges in DiD estimation, such as staggered adoption and heterogeneous treatment effects. DiD-BCF provides a unified framework for estimating Average (ATE), Group-Average (GATE), and Conditional Average Treatment Effects (CATE). A core innovation, its Parallel Trends Assumption (PTA)-based reparameterization, enhances estimation accuracy and stability in complex panel data settings. Extensive simulations demonstrate DiD-BCF's superior performance over established benchmarks, particularly under non-linearity, selection biases, and effect heterogeneity. Applied to U.S. minimum wage policy, the model uncovers significant conditional treatment effect heterogeneity related to county population, insights obscured by traditional methods. DiD-BCF offers a robust and versatile tool for more nuanced causal inference in modern DiD applications.
Bruno Martins Cordeiro (Department of Computer Science, Universidade Federal de Alfenas), Mateus Henrique Martins (Department of Computer Science, Universidade Federal de Alfenas), Iago Augusto Carvalho (Department of Computer Science, Universidade Federal de Alfenas)
Player transfers between clubs are a crucial financial component of professional soccer. This mechanism enables clubs to trade players, facilitating rapid capital acquisition. These transactions are documented on Transfermarkt, a website containing a substantial volume of data, including buy and sell transactions and loan transfers between soccer clubs. In this work, we employed data from Transfermarkt to construct a complex network based on historical transfer records from 1992 to 2024. We analyzed the structural properties of this network, contrasting the South American and European markets. Results indicate that the volume of transfers and the financial amounts involved have grown steadily in both networks, with the exception of the COVID-19 pandemic period. The European market exhibits a moderate average small-world coefficient of 3.2 and an average clustering coefficient close to 0.08. Furthermore, the degree distribution of the European network does not follow a power-law distribution, resembling a hybrid social network. In this case, clubs tend to transact with partners of their partners, but also negotiate with other soccer clubs outside of their clusters. Conversely, the South American network presents a average smaller small-world coefficient but a higher average clustering coefficient than its European counterpart, suggesting a higher level of connectivity among South American clubs. Future work will further explore the structural properties of these networks through community detection and centrality analysis.
Riquelme Nascimento dos Santos (PIPGES-UFSCar/USP), Ricardo Felipe Ferreira (UFSCar)
Nesta comunicação, apresento uma abordagem não paramétrica para seleção de modelos em cadeias estocásticas com memória de comprimento variável representadas em estruturas denominadas árvores de contexto. A metodologia baseia-se em testes de permutação aplicados localmente a cada possível extensão de contexto, utilizando estatísticas formadas exclusivamente por contagens empíricas e evitando suposições paramétricas fortes sobre a distribuição dos dados. Introduzimos o conceito de intercambiabilidade parcial, que viabiliza a geração de reamostras permutadas mesmo sob dependência temporal. Para lidar com a multiplicidade inerente à estrutura hierárquica de árvore, incorporamos um procedimento de controle rigoroso da taxa de erro familiar (family-wise error rate, FWER), permitindo quantificar a incerteza de maneira global. A combinação entre testes de permutação e controle do FWER resulta em um procedimento conceitualmente sólido para inferência sobre árvores de contexto, constituindo uma alternativa aos métodos baseados em verossimilhança. Ademais, realizamos estudos de simulação para comparação com demais algoritmos de seleção e demonstramos a aplicabilidade do método em dados reais.
Oluwafunmilayo Adenike Dawodu, Diego Carvalho Nascimento, Osafu Augustine Egbon, and Francisco Louzada (PIPGES, UFSCar-USP)
Count data are usually modelled with the Poisson distribution, which assumes the variance is a deterministic function of the mean. However, this may not capture the data's variability or account for overdispersion outliers. These outliers may be lower or upper, necessitating modelling the interdecile range to better represent the data's spread. Understanding these outliers provides valuable insights into data points that deviate from the majority, which is a common analytical challenge. To address this issue, we developed a model to handle both a(n) (a)symmetric interdecile of a multi-valued (daily) to (weekly) single-valued symbolic data. The goal is to analyse the interdecile range and enable end-to-end monitoring of the virus's spread and geographical distribution to mitigate the risk of regional transmission. A Bayesian quantile regression using the asymmetric Laplace distribution (ALD) was developed to model the interdecile range. Experimental results on synthetic datasets demonstrate that the proposed ALD outperforms symmetric models. We apply this proposed model to avian influenza (H5N1) data to reveal disease spread and identify regions with higher spillover exposure.
Victor Coscrato - UFSCar
Neste minicurso, exploraremos a implementação de redes neurais usando PyTorch, desde conceitos fundamentais como tensores e diferenciação automática até a implementação de um modelo prático. Não é necessário conhecimento prévio em PyTorch, mas é recomendável familiaridade com Python e noções básicas de aprendizado de máquina.
For additional information, please contact us here. We will contact you back as soon as possible.
(Submission)
Deadline abstract submission: Jan 31st
Notification of acceptance: Feb 9th
Deadline for early registration: Jan 10th
Workshop: Feb 25-27, 2026.
Undergraduate students: R$ 35,00
Graduate students: R$ 65,00
Others: R$ 135,00
(Registration)
Undergraduate students: R$ 60,00
Graduate students: R$ 100,00
Others: R$ 180,00
For lunch and dinner we recommend:
Workshop venue: Auditório Anfiteatro Bento Prado Júnior (UFSCar).
Special session of Complex Networks:Room 6-001 Auditório Fernão S. Rodrigues Germano (USP).
Special session of Mathematical Statistics and Machine Learning: Room 4-111. Auditório Prof. Luiz Antonio Favaro (USP).
For lodging and accommodations we recommend: Anacã São Carlos, Marklin Hotel & Suites and Sleep Inn
See below some information related to the previous editions of our workshop.
X WPSM |
February 21-23, 2024, ICMC-USP | Book of Abstracts |
IX WPSM |
February 9-11, 2022, UFSCar | Book of Abstracts |
VIII WPSM |
February 12-14, 2020, UFSCar | Book of Abstracts |
VII WPSM |
February 13-15, 2019, UFSCar | Book of Abstracts |
VI WPSM |
February 5-7, 2018, UFSCar | Book of Abstracts |
V WPSM |
February 6-8, 2017, ICMC-USP | Book of Abstracts |
IV WPSM |
February 1-3, 2016, UFSCar | Book of Abstracts • Flyer |
III WPSM |
February 9-11, 2015, ICMC-USP | Book of Abstracts • Flyer |
II WPSM |
February 5-7, 2014, UFSCar | Book of Abstracts • Flyer |
WPSM |
January 28-30, 2013, ICMC-USP | Book of Abstracts |
If you have any question, please do not hesitate in contact us at wpsm.pipges@gmail.com.