DataDrivenEnzymeRateEqs

DataDrivenEnzymeRateEqs.data_driven_rate_equation_selection
DataDrivenEnzymeRateEqs.display_rate_equation
DataDrivenEnzymeRateEqs.fit_rate_equation
DataDrivenEnzymeRateEqs.@derive_general_mwc_rate_eq
DataDrivenEnzymeRateEqs.@derive_general_qssa_rate_eq

DataDrivenEnzymeRateEqs.data_driven_rate_equation_selection — Method

data_driven_rate_equation_selection(
    general_rate_equation::Function,
    data::DataFrame,
    metab_names::Tuple{Symbol,Vararg{Symbol}},
    param_names::Tuple{Symbol,Vararg{Symbol}};
    range_number_params::Union{Nothing, Tuple{Int,Int}} = nothing,
    forward_model_selection::Bool = true,
    max_zero_alpha::Int = 1 + ceil(Int, length(metab_names) / 2),
    n_reps_opt::Int = 20,
    maxiter_opt::Int = 50_000,
    model_selection_method::String = "current_subsets_filtering",
    p_val_threshold::Float64 = 0.4,
    save_train_results::Bool = false,
    enzyme_name::String = "Enzyme",
    subsets_min_limit::Int = 1,
    subsets_max_limit::Union{Int, Nothing}=nothing,
    subsets_filter_threshold::Float64=0.1,
)

This function is used to perform data-driven rate equation selection using a general rate equation and data.

There are three model_selection methods:

currentsubsetsfiltering:

This method iteratively fits models that are subsets of the top 10% from the previous iteration, saving the best model for each n params based on training loss. Optimal number of parameters are selected using the Wilcoxon test on test scores from LOOCV, and the best equation is the best model with this optimal number.

cvsubsetsfiltering:

This method implements currentsubsetsfiltering separately for each figure, leaving one figure out as a test set while training on the remaining data. For each number of parameters, it saves the test loss of the best subset for that figure. It uses the Wilcoxon test across all figures' results to select the optimal number of parameters. Then, for the chosen number, it trains all subset with this n params on the entire dataset and selects the best rate equation based on minimal training loss.

cvallsubsets:

This method fits all subsets for each figure, using the others as training data and the left-out figure as the test set. It selects the best model for each number of parameters and figure based on training error and computes LOOCV test scores. The optimal n params is determined by the Wilcoxon test across all figures' test scores. The best equation is the subset with minimal training loss for this optimal n params when trained on the entire dataset.

Arguments

general_rate_equation::Function: Function that takes a NamedTuple of metabolite concentrations (with metab_names keys) and parameters (with param_names keys) and returns an enzyme rate.
data::DataFrame: DataFrame containing the data with column Rate and columns for each metab_names where each row is one measurement. It also needs to have a column source that contains a string that identifies the source of the data. This is used to calculate the weights for each figure in the publication.
metab_names::Tuple: Tuple of metabolite names that correspond to the metabolites of rate_equation and column names in data.
param_names::Tuple: Tuple of parameter names that correspond to the parameters of rate_equation.

Keyword Arguments

save_train_results::Bool: A boolean indicating whether to save the results of the training for each number of parameters as a csv file.
enzyme_name::String: A string for enzyme name that is used to name the csv files that are saved.
range_number_params::Tuple{Int,Int}: A tuple of integers representing the range of the number of parameters of generalrateequation to search over.
forward_model_selection::Bool: A boolean indicating whether to use forward model selection (true) or reverse model selection (false).
max_zero_alpha::Int: An integer representing the maximum number of alpha parameters that can be set to 0.
n_reps_opt::Int n repetitions of optimization
maxiter_opt::Int max iterations of optimization algorithm
model_selection_method::String - which model selection to find best rate equation (default is currentsubsetsfiltering)
p_val_threshold::Float64 - pval threshold for Wilcoxon test
save_train_results::Bool: A boolean indicating whether to save the results of the training for each number of parameters as a csv file.
enzyme_name::String: A string for enzyme name that is used to name the csv files that are saved.
subsets_min_limit::Int - The minimum number of filtered subsets (those with training loss within 10% of the minimum)

that must be kept for each number of parameters. These subsets are used to generate the subsets for the next iteration (only subsets of these are considered). Relevant to model selection methods currentsubsetsfiltering or cvsubsetsfiltering.

subsets_max_limit::Union{Int, Nothing} - The maximum number of filtered subsets (those with training loss within 10% of the minimum)

subsets_filter_threshold::Float64 - This sets the percentage limit for filtering subsets in each iteration.

Only the subsets with a training loss close to the best (within this percentage) are kept. Relevant to model selection methods currentsubsetsfiltering or cvsubsetsfiltering.

Returns

NamedTuple: A named tuple with the following fields:
- results: df with train and test results
- best_n_params: optimal number of parameters
- best_subset_row: row of the best rate equation selected - includes fitted params

source

DataDrivenEnzymeRateEqs.display_rate_equation — Method

display_rate_equation(
rate_equation::Function,
metab_names::Tuple{Symbol,Vararg{Symbol}},
param_names::Tuple{Symbol,Vararg{Symbol}};
nt_param_removal_code = nothing

)

Return the symbolic rate equation for the given rate_equation function.

Arguments

rate_equation::Function: The rate equation function.
metab_names::Tuple{Symbol,Vararg{Symbol}}: The names of the metabolites.
param_names::Tuple{Symbol,Vararg{Symbol}}: The names of the parameters.
nt_param_removal_code::NamedTuple: The named tuple of the parameters to remove from the rate equation.

source

DataDrivenEnzymeRateEqs.fit_rate_equation — Method

fit_rate_equation(
    rate_equation::Function,
    data::DataFrame,
    metab_names::Tuple{Symbol, Vararg{Symbol}},
    param_names::Tuple{Symbol, Vararg{Symbol}};
    n_iter = 20

)

Fit rate_equation to data and return loss and best fit parameters.

Arguments

rate_equation::Function: Function that takes a NamedTuple of metabolite concentrations (with metab_names keys) and parameters (with param_names keys) and returns an enzyme rate.
data::DataFrame: DataFrame containing the data with column Rate and columns for each metab_names where each row is one measurement. It also needs to have a column source that contains a string that identifies the source of the data. This is used to calculate the weights for each figure in the publication.
metab_names::Tuple{Symbol, Vararg{Symbol}}: Tuple of metabolite names that correspond to the metabolites of rate_equation and column names in data.
param_names::Tuple{Symbol, Vararg{Symbol}}: Tuple of parameter names that correspond to the parameters of rate_equation.
n_iter::Int: Number of iterations to run the fitting process.

Returns

loss::Float64: Loss of the best fit.
params::NamedTuple: Best fit parameters with param_names keys

Example

using DataFrames
data = DataFrame(
    Rate = [1.0, 2.0, 3.0],
    A = [1.0, 2.0, 3.0],
    source = ["Figure 1", "Figure 1", "Figure 2"]
)
rate_equation(metabs, params) = params.Vmax * metabs.S / (1 + metabs.S / params.K_S)
fit_rate_equation(rate_equation, data, (:A,), (:Vmax, :K_S))

source

DataDrivenEnzymeRateEqs.@derive_general_mwc_rate_eq — Macro

derive_general_mwc_rate_eq(metabs_and_regulators_kwargs...)

Derive a function that calculates the rate of a reaction using the general MWC rate equation given the list of substrates, products, and regulators that bind to specific cat or reg sites.

The general MWC rate equation is given by:

\[Rate = \frac{{V_{max}^a \prod_{i=1}^{n} \left(\frac{S_i}{K_{a, i}}\right) - V_{max, rev}^a \prod_{i=1}^{n} \left(\frac{P_i}{K_{a, i}}\right) \cdot Z_{a, cat}^{n-1} \cdot Z_{a, reg}^n + L \left(V_{max}^i \prod_{i=1}^{n} \left(\frac{S_i}{K_{i, i}}\right) - V_{max, rev}^i \prod_{i=1}^{n} \left(\frac{P_i}{K_{i, i}}\right)\right) \cdot Z_{i, cat}^{n-1} \cdot Z_{i, reg}^n}}{Z_{a, cat}^n \cdot Z_{a, reg}^n + L \cdot Z_{i, cat}^n \cdot Z_{i, reg}^n}\]

where:

$V_{max}^a$ is the maximum rate of the forward reaction
$V_{max, rev}^a$ is the maximum rate of the reverse reaction
$V_{max}^i$ is the maximum rate of the forward reaction
$V_{max, rev}^i$ is the maximum rate of the reverse reaction
$S_i$ is the concentration of the $i^{th}$ substrate
$P_i$ is the concentration of the $i^{th}$ product
$I_i$ is the concentration of the $i^{th}$ catalytic site inhibitor
$R_i$ is the concentration of the $i^{th}$ allosteric regulator
$K_{a, X}$ is the binding constant of the $X$ metabolite for active MWC state
$K_{i, X}$ is the binding constant of the $X$ metabolite for inactive MWC state
$Z_{a, cat}$ is the allosteric factor for the catalytic site in the active MWC state
$Z_{i, cat}$ is the allosteric factor for the catalytic site in the inactive MWC state
$Z_{a, reg}$ is the allosteric factor for the regulatory site in the active MWC state
$Z_{i, reg}$ is the allosteric factor for the regulatory site in the inactive MWC state
$L$ is the ratio of inactive to active enzyme conformations in the absence of ligands
$n$ is the oligomeric state of the enzyme

Arguments

metabs_and_regulators_kwargs...: keyword arguments that specify the substrates, products, catalytic sites, regulatory sites, and other parameters of the reaction.

Returns

A function that calculates the rate of the reaction using the general MWC rate equation
A tuple of the names of the metabolites and parameters used in the rate equation

source

DataDrivenEnzymeRateEqs.@derive_general_qssa_rate_eq — Macro

derive_general_qssa_rate_eq(metabs_and_regulators_kwargs...)

Derive a function that calculates the rate of a reaction using the Quasi Steady State Approximation (QSSA) given the list of substrates, products, and regulators.

The general QSSA rate equation is given by:

\[Rate = \frac{V_{max} \left(\frac{\prod_{i=1}^{n}S_i}{(K_{S1...Sn})^n}\right) - V_{max, rev} \left(\frac{\prod_{i=1}^{n}P_i}{(K_{P1...Pn})^n}\right)}{Z}\]

where:

$V_{max}$ is the maximum rate of the forward reaction
$V_{max, rev}$ is the maximum rate of the reverse reaction
$S_i$, $P_i$, $R_i$ is the concentration of the $i^{th}$ substrate (S), product (P), or regulator (R)
$K_{X_1...X_n}$ is the kinetic constant
$Z$ is a combination of all terms containing products of [S], [P], and [R] divided by KSP_R

Arguments

metabs_and_regulators_kwargs...: keyword arguments that specify the substrates, products, catalytic sites, regulatory sites, and other parameters of the reaction.

Returns

A function that calculates the rate of the reaction using the general qssa rate equation
A tuple of the names of the metabolites and parameters used in the rate equation

source