Special Session 78: Special Session on Mathematics of Data Science and Applications

Enhanced Efficient Heterogeneous Graph Neural Networks

JIA CAI
Guangdong University of Finance and Economics
Peoples Rep of China
Co-Author(s):    Ranhui Yan
Abstract:
Heterogeneous Graph Neural Networks (HGNNs) have shown powerful performance on heterogeneous graph learning by aggregating information from different types of nodes and edges. However, existing heterogeneous graph related models may confront with three major challenges: (1) Predefined meta-paths are required to capture the semantic relations between nodes from different types. (2) Existing models have to stack too many layers to learn long-range dependencies. (3) Performance degradation and semantic confusion may happen with the growth of the network depth. In this talk, we introduce two models to deal with the above-mentioned challenges. Specifically, we develop an end-to-end Dense connected Heterogeneous Graph Convolutional Network to learn node representations (Dense-HGCN). Dense-HGCN computes the attention weights between different nodes and incorporates the information of previous layers into each layer`s aggregation process via a specific fuse function. Moreover, Dense-HGCN leverages multi-scale information for node classification or other downstream tasks. Furthermore, we develop a Virtual Nodes based Heterogeneous Graph Convolutional Network (VN-HGCN). Virtual nodes are auxiliary nodes that are connected to all the nodes of a certain type in the graph, thus enabling efficient aggregation of long-range information across different types of nodes and edges.

Exploring the Optimal Choice for Generative Processes in Diffusion Models

Yu Cao
Shanghai Jiao Tong University
Peoples Rep of China
Co-Author(s):    Jingrun Chen, Yixin Luo, and Xiang Zhou
Abstract:
The diffusion model has shown remarkable success in computer vision, but it remains unclear whether the ODE-based probability flow or the SDE-based diffusion model is more superior and under what circumstances. Comparing the two is challenging due to dependencies on data distributions, score training, and other numerical issues. In this talk, we will discuss a mathematical approach for this problem by considering two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case. We will demonstrate that the time distribution of the score training error will determine the optimal dynamics in terms of minimizing the sampling error in the continuous-time setting. Numerical validation of this phenomenon is provided using various benchmark distributions, as well as realistic datasets.

Critical transitions in brain: modelling and control

Ting Gao
Huazhong University of Science and Technology
Peoples Rep of China
Co-Author(s):    
Abstract:
With the development of machine learning techniques, data-driven approaches are becoming increasingly important in real-world modeling with stochastic dynamical systems. However, in the field of neural science problems, these approaches are still in their early stages. We first review some neural inverse problems to deepen the understanding of neural dynamics in complex brain systems. Then we propose some applications to model tipping phenomena and optimal control in brain diseases. The methods include large deviation theory, optimal control, and Schrodinger bridge.

Pairwise learning problems with regularization networks and Nystrom subsampling approach

Ting Hu
Xi`an Jiaotong University
Peoples Rep of China
Co-Author(s):    Wang Cheng
Abstract:
Pairwise learning usually refers to the learning problem that works with pairs of training samples, such as ranking, similarity and metric learning, and AUC maximization. To overcome the challenge of pairwise learning in the large scale computation, this paper introduces Nystrom sampling approach to the coefficient-based regularized pairwise algorithm in the context of kernel networks. Our theorems establish that the obtained Nystrom estimator achieves the minimax error over all estimators using the whole data provided that the subsampling level is not too small. We derive the function relation between the subsampling level and regularization parameter that guarantees computation cost reduction and asymptotic behaviors` optimality simultaneously.The Nystrom coefficient-based pairwise learning method does not require the kernel to be symmetric or positive semi-definite, which provides more flexibility and adaptivity in the learning process. We apply the method to the bipartite ranking problem,which improves the state-of-the-art theoretical results in previous works.

An energy-stable machine-learning model of non-Newtonian hydrodynamics with molecular fidelity

Huan Lei
Michigan State University
USA
Co-Author(s):    Huan Lei
Abstract:
We introduce a machine-learning-based approach for constructing a continuum non-Newtonian fluid dynamics model directly from a micro-scale description. To faithfully retain molecular fidelity, we establish a micro-macro correspondence via a set of encoders for the micro-scale polymer configurations and their macro-scale counterparts, a set of nonlinear conformation tensors. The dynamics of these conformation tensors can be derived from a generalized extendable energy functional structure, and be learned from the micro-scale model with clear physical interpretation. The final model, named the deep non-Newtonian model (DeePN$^2$), takes the form of conventional non-Newtonian fluid dynamics models and ensures energy stability. Numerical results demonstrate the accuracy and robustness of DeePN$^2$.

Stochastic Gradient Methods: Bias, Stability and Generalization

Yunwen Lei
The University of Hong Kong
Peoples Rep of China
Co-Author(s):    Shuang Zeng
Abstract:
Recent developments of stochastic optimization often suggest biased gradient estimators to improve either the robustness, communication efficiency or computational speed. Representative biased stochastic gradient methods (BSGMs) include Zeroth-order stochastic gradient descent (SGD), Clipped-SGD and SGD with delayed gradients. The practical success of BSGMs motivates a lot of convergence analysis to explain their impressive training behaviour. As a comparison, there is far less work on their generalization analysis, which is a central topic in modern machine learning. In this paper, we present the first framework to study the stability and generalization of BSGMs for convex and smooth problems. We apply our general result to develop the first stability bound for Zeroth-order SGD with reasonable step size sequences, and the first stability bound for Clipped-SGD. While our stability analysis is developed for general BSGMs, the resulting stability bounds for both Zeroth-order SGD and Clipped-SGD match those of SGD under appropriate smoothing/clipping parameters. Furthermore, our stability analysis incorporates the training errors into the stability bounds and therefore can benefit from low noise conditions.

Learning, approximation and control

Qianxiao Li
National University of Singapore
Singapore
Co-Author(s):    
Abstract:
In this talk, we discuss some interesting problems and recent results on the interface of deep learning, approximation theory and control theory. Through a dynamical system viewpoint of deep residual architectures, the study of model complexity in deep learning can be formulated as approximation or interpolation problems that can be studied using control theory, but with a mean-field twist. In a similar vein, training deep architectures can be formulated as optimal control problems in the mean-field sense. We provide some basic mathematical results on these new control problems that so arise, and discuss some applications in improving efficiency, robustness and adaptability of deep learning models.

Beyond Unconstrained Features: Neural Collapse for Shallow Neural Networks with General Data

Shuyang Ling
NYU Shanghai
Peoples Rep of China
Co-Author(s):    Wanli Hong
Abstract:
Neural collapse is a phenomenon that occurs at the terminal phase of training of DNNs. The features of the data in the same class collapse to their respective sample means and the sample means exhibit an ETF. In the past few years, there have been a surge of works that focus on explaining why the ${\cal NC}$ occurs and how affects generalization. Since the DNNs is notoriously difficult to analyze, most works mainly focus on the unconstrained feature model (UFM). While the UFM explains the ${\cal NC}$ to some extent, it fails to provide a complete picture of how the network architecture and the dataset affect ${\cal NC}$. In this work, we focus on shallow ReLU neural network and try to understand how the width, depth, data dimension, and statistical property of the training dataset influence the neural collapse. We provide a complete characterization on when the ${\cal NC}$ occurs for two or three layer neural works. This sufficient condition depends on the data dimension, sample size, and the signal-to-noise ratio in the data instead of the width. For three-layer neural networks, we show that the ${\cal NC}$ occurs as long as the second-layer is sufficiently wide. Moreover, we show that even if the ${\cal NC}$ occurs, the generalization can still be bad provided that the SNR in the data is sufficiently lower. Our results significantly extend the state-of-the-art theoretical analysis of ${\cal NC}$ under UFM to shallow nonlinear feature models, and characterize the emergence of ${\cal NC}$ via data properties and network architecture.

A Generative Model-Based Variational Method for Wasserstein Gradient Flow

Chengyu LIU
City University of Hong Kong
Hong Kong
Co-Author(s):    
Abstract:
In this work, we present a new generative model-based variational framework that utilizes Normalizing Flows (NFs) to solve Wasserstein Gradient Flows (WGFs), going beyond traditional chain-of-state methods by directly parameterizing the paths using the forward process of NFs as a representation of WGFs. Our approach enhances the capture of the Wasserstein Gradient Structure by incorporating an minimum action cost regularization term during the training of NFs. Additionally, we introduce an innovative adaptive sampling strategy that iteratively generates efficient stochastic collocation points, reflecting the evolving density estimates. This framework naturally incorporates the continuity equation within the network architecture, enabling the efficient estimation of high-dimensional density functions essential for free energy computations and improving the approximation of population dynamics. By integrating seamlessly with stochastic gradient descent techniques prevalent in deep learning, our method demonstrates robust performance on several illustrative problems, showcasing its potential in efficiently modeling the evolution of probability distributions in complex systems.

Error Analysis of Shallow Neural Network on Korobov Space

Yuqing Liu
City University of Hong Kong
Hong Kong
Co-Author(s):    
Abstract:
In this talk, an innovative approach and the result of the approximability of shallow neural networks on Korobov spaces will be presented. Then, the generalization analysis of shallow neural networks for classification on Korobov Space will be discussed. The talk is organized as follows. In the beginning, a dimensional independent rate of approximating functions from the Korobov space by ReLU shallow neural networks will be established. Following the first main result, a generalization error will be emphasized. A specific example will be given to justify the novelty and sufficiency of the main results. In addition, with the approximation error we get, we study the learning rates of the excess misclassification error according to the convex $\eta$-norm loss function $\phi(v)=(1 - v)_{+}^{\eta}$, $\eta\geq1$. The error under Tsybakov noise conditions is also discussed.

The Theory of Parameter Condensation in Neural Networks

Tao Luo
Shanghai Jiao Tong University
Peoples Rep of China
Co-Author(s):    
Abstract:
In this talk, we will first introduce the phenomenon of parameter condensation in neural networks, which refers to the tendency of certain parameters to converge towards the same values during training. Then, for certain types of networks, we prove that condensation occurs in the early stages of training. We further analyze which hyperparameters and training strategies influence parameter condensation. In some cases, we even provide a phase diagram that delineates whether parameter condensation occurs. We will also briefly discuss the relationship between parameter condensation and generalization ability. Finally, towards the end of the training, we study the set of global minima and present a detailed analysis of its geometric structure and convergence properties.

Approximation Rates for Shallow ReLUk Neural Networks on Sobolev Spaces via the Radon Transform

Tong Mao
King Abdullah University of Science and Technology
Saudi Arabia
Co-Author(s):    J.W. Siegel, J. Xu
Abstract:
We consider the problem of how efficiently shallow neural networks with the ReLU$^k$ activation function can approximate functions from Sobolev spaces $W^s(L_p(\Omega))$ with error measured in the $L_q(\Omega)$-norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $q\leq p$, $p\geq 2$, and $s \leq k + (d+1)/2$. The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU$^k$ neural networks enables them to obtain optimal approximation rates for smoothness up to order $s = k + (d+1)/2$, even though they represent piecewise polynomials of fixed degree $k$.

Deep Ridgelet Transform: Harmonic Analysis for Deep Neural Network

Sho Sonoda
RIKEN
Japan
Co-Author(s):    
Abstract:
The ridgelet transform has been developed to study neural network parameters, and it can describe the distribution of parameters. Mathematically, it is defined as a pseudo-inverse operator of neural networks. Namely, given a function $f$, and network $NN[\gamma]$ with parameter $\gamma$, the ridgelet transform $R[f]$ for the network $NN$ satisfies the reconstruction formula $NN[R[f]]=f$. For depth-2 fully-connected networks on a Euclidean space, the ridgelet transform has been discovered up to the closed-form expression, thus we could describe how the parameters are distributed. However, for a variety of modern neural network architectures, the closed-form expression has not been known. In this talk, I will introduce a systematic method to induce the generalized neural networks and their corresponding ridgelet transforms from group equivariant functions, and present an application to deep neural networks.

Solving for the Mean Escape Time with Operator Learning and Deep Neural Networks

Nathanael Tepakbong
City University of Hong Kong
Hong Kong
Co-Author(s):    
Abstract:
Estimating the average amount of time needed for a stochastic particle to reach the boundary of a bounded domain $\Omega$ is a classical problem with applications in numerous fields. As there are no explicit solutions for this Mean Escape Time (MET) outside of the simplest domain geometries, practicioners have to rely on Monte Carlo simulations, series expansion ansatz, or grid-based methods which generally scale poorly with problem size and lack strong theoretical guarantees. By considering the MET as the solution of a PDE induced by the stochastic dynamics of the particle, we propose to learn the solution operator of the PDE by applying dimension reduction tools and Deep Learning. Leveraging results from the approximation theory of Deep Neural Networks, we provide a rigorous theoretical analysis of the sample complexity associated with the MET problem in our framework for different families of stochastic dynamics and domain geometries.

Towards generative diffusion models in infinite dimension

Zhongjian Wang
Nanyang Technological University
Singapore
Co-Author(s):    
Abstract:
We provide an estimate of Wasserstein distance between data distribution and generation of score-based generative models in the infinite dimensional function space. The bound only assumes the $\epsilon$-accurate approximation of score and Gaussian type tail behavior of the data distribution. The key in the analysis is the Lipchitz bound of the score which relates to the Hessian estimate of a viscous Hamilton Jacobian equation (vHJ). The later is shown by utilizing a kernel estimate that is independent of dimension. Our complexity bound scales linearly with the trace of a covariance operator relates to the data distribution.

Approximation from Noisy and Blurring Data

Jianbin Yang
Hohai University, China
Peoples Rep of China
Co-Author(s):    
Abstract:
Approximation of functions from observed data is often needed. This has been widely studied in the literature when data is exact and the underlying function is smooth. However, the observed data is often contaminated with noise and the underlying function may be nonsmooth. To properly handle noisy and blurring data, any effective approximation scheme must contain a noise removal component. To well approximate nonsmooth functions, one needs to have a sparse approximation in, for example, the wavelet domain. This talk presents theoretical analysis and applications of such noise removal schemes through the lens of function approximation. For a given sample size, approximation from uniform grid data and scattered data is investigated.

The role of structures in neural networks

Ding-Xuan Zhou
dingxuan.zhou@sydney.edu.au
Australia
Co-Author(s):    
Abstract:
The classical approximation theory developed 35 years ago is for fully-connected neural networks. This theory does not apply to neural networks with structures arising from applications of deep learning in speech recognition, computer vision, natural language processing, and many other domains. The structures and related network architectures raise some essential differences between the classical fully-connected neural networks and structured ones used in deep learning. This talk describes some approximation and generalization properties of structured neural networks such as deep convolutional neural networks.

StringNET: Neural Network based Variational Method for Transition Pathways

Xiang ZHOU
City University of Hong Kong
Hong Kong
Co-Author(s):    Jiayue Han, Shuting Gu and Xiang Zhou
Abstract:
Rare transition events in meta-stable systems under noisy fluctuations are crucial for many non-equilibrium physical and chemical processes. The primary contributions to reactive flux are predominantly near the transition pathways that connect two meta-stable states. This work examines the temperature-dependent maximum flux path, the minimum energy path, and the minimum action path at zero temperature. We propose the StringNET method for training these paths using variational formulations and deep learning techniques. Unlike traditional chain-of-state methods, StringNET directly parametrizes the paths through neural network functions, utilizing the arc-length parameter as the main input. The tasks of gradient descent and re-parametrization in the string method are unified into a single framework using loss functions to train deep neural networks. More importantly, the loss function for the maximum flux path is interpreted as a softmax approximation to the numerically challenging minimax problem of the minimum energy path. To compute the minimum energy path efficiently and robustly, we developed a pre-training strategy that includes the maximum flux path loss in the early training stage, accelerating the computation. We demonstrate the superior performance of this method through various analytical and chemical examples, as well as the two- and four-dimensional Ginzburg-Landau functional energy.