Your way of using the letter $n$, rather than using that for the sample size, is irritating. I'll write consistently with that and use $m$ for the sample size.
You have a design matrix
$$
X=\begin{bmatrix} 1 & x_{11} & \cdots & x_{1n} \\
1 & x_{21} & \cdots & x_{2n} \\
\vdots & \vdots & & \vdots \\
1 & x_{m1} & \cdots & x_{mn} \end{bmatrix}
$$
Then
$$
\begin{bmatrix} Y_1 \\ \vdots \\ Y_m \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1n} \\
1 & x_{21} & \cdots & x_{2n} \\
\vdots & \vdots & & \vdots \\
1 & x_{m1} & \cdots & x_{mn} \end{bmatrix}
\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{bmatrix}
+ \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_m \end{bmatrix}.
$$
Write this as
$$
Y=X\beta+\varepsilon.
$$
Then the least-squares estimate are
$$
\hat\beta = (X^T X)^{-1} X^T Y.
$$
The potentially messy part --- the only nonlinear part --- is the matrix inversion.
If you want just $\hat\beta_k$, put a row vector in front of the above, with the $k$th entry equal to $1$ and the others $0$.
Let us assume that the probability of observing $\boldsymbol{x}$ under the condition that we are in class $\mathcal{C}_i$ is given by
$$p(\boldsymbol{x}|\mathcal{C}_i)=\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x})}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x})}.$$
If we get data $\mathcal{D}=\{(\boldsymbol{x}_1,\boldsymbol{t}_1),\ldots,(\boldsymbol{x}_N,\boldsymbol{t}_N) \}$, in which $\boldsymbol{t}$ is encoding the class as one-hot-encoding vector. For three classes $\boldsymbol{t}=[0,0,1]^T$ signifies that the corresponding observation $\boldsymbol{x}$ is from the third class.
We are assuming that all observations are drawn independently (this allows us to multiply all probability distributions) from the same distribution (this allows using the same probability distribution for all observations). We rewrite the previous probability by using $\boldsymbol{t}=[t_1,\ldots,t_J]^T$. Note, that these are scalars (not bold face).
$$p(\boldsymbol{x}|\boldsymbol{t})=\prod_{j=1}^{J}\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x})}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x})}\right]^{t_j}.$$
The likelihood to observe the data $\mathcal{D}$ is given by
$$p(\boldsymbol{x}_1,\ldots,\boldsymbol{x}_N|\boldsymbol{t}_1,\ldots,\boldsymbol{t}_N)=\prod_{n=1}^{N}\prod_{j=1}^{J}\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x}_n)}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x}_n)}\right]^{t_{nj}}.$$
Hence, the log-likelihood is given by
$$\log p(\boldsymbol{x}_1,\ldots,\boldsymbol{x}_N|\boldsymbol{t}_1,\ldots,\boldsymbol{t}_N)=\sum_{n=1}^{N}\sum_{j=1}^{J} t_{nj} \log\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x}_n)}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x}_n)}\right],$$
in which $t_{nj}$ is the $j^{\text{th}}$ component of the class vector $\boldsymbol{t}_n$ for the $n^\text{th}$ observation $\boldsymbol{x}_n$.
Best Answer
The same for any finite number of predictors, i.e., $$ p(y_i;\vec{x_{i}})=\frac{1}{1+\exp\{-x_i'\beta\}}, $$ where $x_i=(1, x_{1i},...,x_{pi})$ and $\beta=(\beta_0,...,\beta_p)$. Thus the likelihood function is Binomial $$ \mathcal{L}(\beta)=\prod_{i=1}^m p(y_i;\vec{x_{i}})^{y_i} (1-p(y_i;\vec{x_{i}}))^{1-y_{i}}. $$
E.g., if you have two predictors, i.e., $x_1$ and $x_2$, and total of $n$ observations, i.e., $\{(y_i, x_{1i}, x_{2i}\}$ then the likelihood function is given by $$ \prod_{i=1}^n p(y_i; x_{1i}, x_{2i})^{y_i} (1-p(y_i; x_{1i}, x_{2i}))^{1-y_{i}}, $$ where $$ p(y_i; x_{1i}, x_{2i})=\frac{1}{1+\exp\{-\beta_0 + \beta_1x_{1i}+\beta_2x_{2i}\}}. $$