So basically what you get by applying Backward Euler is the following two equations,
\begin{equation}
y_{1}^{n+1} = y_{1}^n + h * f(y_1^{n+1}, y_2^{n+1}),
\end{equation}
\begin{equation}
y_{2}^{n+1} = y_{2}^n + h * g(y_1^{n+1}, y_2^{n+1}).
\end{equation}
Theoretically, using this scheme your method should be more stable. So maybe you can start testing your solution by using larger time steps and checking if the method stays stable. Considering what you explained above I suspect either you're having some bug in your code or you are introducing some errors in the initial values by using Forward Euler to start the computations which is not allowed. $f$ and $g$ are both known functions I assume.
General argument based on the implicit function theorem
The implicit system that you have to solve is
$$
y_i=y_0+h\bigl(a_{i1}f(t_1,y-1)+...+a_{is}f(t_s,y_s)\bigr).\tag{1a}
$$
This can be written as one big system
$$Y=Y_0+h(A\otimes I)F(T,Y),\tag{1b}$$
where
$$
Y=\pmatrix{y_1\\\vdots\\y_s}, ~~Y_0=\pmatrix{y_0\\\vdots\\y_0}, ~~ F(T,Y)=\pmatrix{f(t_1,y_1)\\\vdots\\f(t_s,y_s)}
$$
and the tensor product is the Kronecker product. For $h=0$ this has the obvious and unique solution $Y=Y_0$. The composite map $F$ is as smooth as the given function $f$, and $Y\mapsto Y-h(A\otimes I)F(T,Y)$ has an invertible Jacobian at $Y=Y_0$ for sufficiently small $h$. The existence and uniqueness of the solution of equation (1) in dependence of $h$ in some small interval around $0$ can now be asserted via the implicit function theorem.
Direct solution in the linear case
If $f(t,y)=My+g(t)$ as per question, then the implicit system is in fact linear and can be solved as an $ns\times ns$ system,
$$
Y=Y_0+h(A\otimes I)[(I\otimes M)Y+G(T)]
\\\implies\\
Y=[I-h(A\otimes M)]^{-1}[Y_0+h(A\otimes I)G(T)]
$$
where $G(T)$ is again the vertically stacked collection of vectors $(g(t_1),...,g(t_s))$.
As no further numerical method is involved, the only restriction on $h$ is that $h^{-1}$ should not be an eigenvalue of $(A\otimes M)$. This is satisfied if $h\|A\|\,\|M\|<1$.
Using the Banach fixed-point theorem directly for the non-linear case
One can make this a bit more quantitative using the Banach fixed-point theorem for the fixed-point map $$P(Y)=Y_0+h(A\otimes I)F(T,Y).$$ Using a suitable composite norm on $Y$, for instance $\|Y\|=\max_{k=1,..,s}\|y_k\|$, one finds that $P$ has a Lipschitz constant $$L(P)=h\|A\|\,L(f)$$ on some ball $\|Y-Y_0\|\le R$ where $R$ is such that $f$ has Lipschitz constant $L(f)$ on $\|y-y_0\|\le R$. Choosing $h$ small enough, one can achieve $L(P)\le\frac12$.
Then with
$$\|P(Y)-Y_0\|\le \|P(Y_0)-Y_0\|+\|P(Y)-P(Y_0)\|\le h\|A\|\|F(T,Y_0)\|+L(P)\|Y-Y_0\|$$
one needs to ensure that also the first term is smaller than $R/2$ to have $\|Y-Y_0\|\le R\implies \|P(Y)-Y_0\|\le R$ so that all conditions of the fixed-point theorem are satisfied.
Some classes of implicit methods
Examples that you might want to explore are the Gauß methods and the SDIRK methods. The Gauß methods, based on Gauß integration, have the highest order for the number of stages. The SDIRK methods (Singly Diagonally Implicit Runge-Kutta) have a lower triangular matrix $(a_{ij})$ with the same entry on the diagonal. This has the advantage that the implicit system can be solved stage-by-stage, and in each stage a simplified Newton method can be employed that has the same matrix.
Butcher tableaus for some lower order examples can be found in these slides of Butcher's.
Best Answer
Let us introduce the vector of unknowns $Y = (y_1,y_2)^\top$ which satisfies the ODE system $Y'(t) = F(t,Y)$. At each step, one needs to solve $$ Y_{n+1} = Y_{n} + h\left(\theta\, F(t_n,Y_n) + \left(1-\theta\right) F(t_{n+1},Y_{n+1}) \right) , $$ where $t_n = nh$ and $Y_n \simeq Y(t_n)$. In general, this is a nonlinear equation for $Y_{n+1}$, which is rewritten $G(Y_{n+1}) = 0$, where $$ G : Y \mapsto Y - h\left( \left(1-\theta\right) F(nh+h,Y) + \theta\, F(nh,Y_n) \right) - Y_{n} \, . $$ It is solved at each step using Newton's method $$ Y_{n+1}^{k+1} = Y_{n+1}^{k} - \left( G'(Y_{n+1}^{k})\right)^{-1}\, G(Y_{n+1}^{k}) \, , $$ where $$ G' : Y \mapsto I - \left(1-\theta\right) h\, \partial_Y F(nh+h,Y) $$ is the Jacobian matrix of $G$, and $\partial_Y F(nh+h,\cdot)$ is the Jacobian matrix of $F(nh+h,\cdot)$. The initial guess is $Y_{n+1}^{0} = Y_n$, and the final value is $Y_{n+1}^\infty = Y_{n+1}$.
The code below performs the proposed integration:
A comparison of the numerical solution (blue) with the analytical solution (red) in the phase space $y_1$-$y_2$ is given in the graphics output