You would want to use a flexible formulation that would capture non-linearity automatically, e.g., some version of a generalized additive model. A poor man's choice is a polynomial $x_k$, $x_k^2$, ..., $x_k^{p_k}$, but such polynomials produce terrible overswings at the ends of the range of their respective variables. A much better formulation would be to use (cubic) B-splines (see a random intro note from the first page of Google here, and a good book, here). B-splines are a sequence of local humps:
http://ars.sciencedirect.com/content/image/1-s2.0-S0169743911002292-gr2.jpg
The height of the humps is determined from your (linear, logistic, other GLM) regression, as the function you are fitting is simply
$$ \theta = \beta_0 + \sum_{k=1}^K \beta_k B\Bigl( \frac{x-x_k}{h_k} \Bigr) $$
for the specified functional form of your hump $B(\cdot)$. By far the most popular version is a bell-shaped smooth cubic spline:
$$
B(z) = \left\{ \begin{array}{ll} \frac14 (z+2)^3, & -2 \le z \le -1 \\ \frac14 (3|x|^3 - 6x^2 +4 ), & -1 < x < 1 \\ \frac14 (2-x)^3, & 1 \le x \le 2 \\ 0, & \mbox{otherwise} \end{array} \right.
$$
On the implementation side, all you need to do is to set up 3-5-10-whatever number of knots $x_k$ would be reasonable for your application and create the corresponding 3-5-10-whatever variables in the data set with the values of $B\Bigl( \frac{x-x_k}{h_k} \Bigr) $. Typically, a simple grid of values is chosen, with $h_k$ being twice the mesh size of the grid, so that at each point, there are two overlapping B-splines, as in the above plot.
It's fairly straight forward. This is a log-linear model. A one-unit change in $x_i$ is associated with a $100 * \beta_i$ percent change in $y$.
$100\% * 0.0642 = 6.42\% $
We can now interpret it in percentages. For each one unit increase in educ, there is a 6.42% increase in your outcome variable.
ETA in light of whuber's comment:
To be more precise, the relationship between the percent change in $y$ and change in $x$ is:
$100\% * (e^{\beta_i} - 1)$
So,
$100\% * (e^{0.0642} -1) = 6.630564\%$
A whuber states, this is important as the absolute value of $\beta_i$ grows larger than 0.1. The simple $\beta_i * 100$ rule serves as an easy approximation when $|\beta_i| < 0.1$.
Best Answer
It is a bad idea not to include baseline BMI as this implicitly makes a strong assumption about the relationship between pre- and post-treatment BMI.
$$ Y_{change} = Y_{post} - Y_{pre} = \alpha + \beta\text{Female} + \theta\text{Treatment} + \text{Error} $$ is equivalent to $$ \color{white}{Y_{change} = } Y_{post} = \alpha + \beta\text{Female} + Y_{pre} + \theta\text{Treatment} + \text{Error} $$
If the outcome is change from baseline but you don't include pre-treatment BMI in the predictors, you assume that the coefficient for pre-treatment BMI is fixed at 1. The regression can handle estimating one more coefficient so you don't have to make unnecessary assumptions.
Including pre-treatment BMI as a predictor also adjusts for any differences in BMI between the treatment and control group, which can occur by chance even in a randomized study. Since you are working with observational data, it is even more important to adjust for possible confounders. (A confounder is a covariate that is associated with the treatment and/or the outcome. For example, diet and exercise are potential confounders for weight loss.)
After we add the effect of pre-treatment BMI the model becomes:
$$ \text{(1)} \quad Y_{change} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$
In fact, consider modeling post-treatment BMI rather than the change in BMI:
$$ \text{(2)} \quad Y_{post} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$
The treatment effect $\theta$ is the same in models (1) and (2). But the interpretation is more straightforward in (2).
Finally, the plots suggest that you should include an interaction between Gender and Treatment. This is the most interesting feature in the data: the response is very different between males and females in the treatment and control groups. (I guess this is what you mean by non-linear relationship.) Re-doing these plots with post-BMI on the y-axis might be even more interesting.
You can read more about adjusting for pre-treatment measurements in Chapter 19, Section 3 of Regression and Other Stories [1] and in the BBR course notes, which argue strongly against modeling change from baseline and in favor of modeling post-treatment outcome [2].
[1] A. Gelman, J. Hill, and A. Vehtari. Regression and Other Stories. Cambridge University Press, 2020. See Chapter 19, Section 3 for a discussionabout pre-treatment predictors.
[2] Biostatistics for Biomedical Research course notes. Available online.
Previous CV posts discuss change from baseline in lots more detail. Thank you to @EdM for the references.
Is it valid to include a baseline measure as control variable when testing the effect of an independent variable on change scores?
Best practice when analysing pre-post treatment-control designs