Note that the linearity assumption you're speaking of only says that the conditional mean of $Y_i$ given $X_i$ is a linear function. You cannot use the value of $R^2$ to test this assumption.
This is because $R^2$ is merely the squared correlation between the observed and predicted values and the value of the correlation coefficient does not uniquely determine the relationship between $X$ and $Y$ (linear or otherwise) and
both of the following two scenarios are possible:
I will discuss each in turn:
(1) High $R^2$ but the linearity assumption is still be wrong in an important way: The trick here is to manipulate the fact that correlation is very sensitive to outliers. Suppose you have predictors $X_1, ..., X_n$ that are generated from a mixture distribution that is standard normal $99\%$ of the time and a point mass at $M$ the other $1\%$ and a response variable that is
$$ Y_i = \begin{cases}
Z_i & {\rm if \ } X_i \neq M \\
M & {\rm if \ } X_i = M \\
\end{cases} $$
where $Z_i \sim N(\mu,1)$ and $M$ is a positive constant much larger than $\mu$, e.g. $\mu=0, M=10^5$. Then $X_i$ and $Y_i$ will be almost perfectly correlated:
u = runif(1e4)>.99
x = rnorm(1e4)
x[which(u==1)] = 1e5
y = rnorm(1e4)
y[which(x==1e5)] = 1e5
cor(x,y)
[1] 1
despite the fact that the expected value of $Y_i$ given $X_i$ is not linear - in fact it is a discontinuous step function and the expected value of $Y_i$ doesn't even depend on $X_i$ except when $X_i = M$.
(2) Low $R^2$ but the linearity assumption still satisfied: The trick here is to make the amount of "noise" around the linear trend large. Suppose you have a predictor $X_i$ and response $Y_i$ and the model
$$ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$
was the correct model. Therefore, the conditional mean of $Y_i$ given $X_i$ is a linear function of $X_i$, so the linearity assumption is satisfied. If ${\rm var}(\varepsilon_i) = \sigma^2$ is large relative to $\beta_1$ then $R^2$ will be small. For example,
x = rnorm(200)
y = 1 + 2*x + rnorm(200,sd=5)
cor(x,y)^2
[1] 0.1125698
Therefore, assessing the linearity assumption is not a matter of seeing whether $R^2$ lies within some tolerable range, but it is more a matter of examining scatter plots between the predictors/predicted values and the response and making a (perhaps subjective) decision.
Re: What to do when the linearity assumption is not met and transforming the IVs also doesn't help?!!
When non-linearity is an issue, it may be helpful to look at plots of the residuals vs. each predictor - if there is any noticeable pattern, this can indicate non-linearity in that predictor. For example, if this plot reveals a "bowl-shaped" relationship between the residuals and the predictor, this may indicate a missing quadratic term in that predictor. Other patterns may indicate a different functional form. In some cases, it may be that you haven't tried to right transformation or that the true model isn't linear in any transformed version of the variables (although it may be possible to find a reasonable approximation).
Regarding your example: Based on the predicted vs. actual plots (1st and 3rd plots in the original post) for the two different dependent variables, it seems to me that the linearity assumption is tenable for both cases. In the first plot, it looks like there may be some heteroskedasticity, but the relationship between the two does look pretty linear. In the second plot, the relationship looks linear, but the strength of the relationship is rather weak, as indicated by the large scatter around the line (i.e. the large error variance) - this is why you're seeing a low $R^2$.
It is very often stated that minimizing least squared residuals is preferred over minimizing absolute residuals because of the reason that it is computationally simpler. But, it may also be better for other reasons. Namely, if the assumptions are true (and this is not so uncommon) then it provides a solution that is (on average) more accurate.
Maximum likelihood
Least squares regression and quantile regression (when performed by minimizing the absolute residuals) can be seen as maximizing the likelihood function for Gaussian/Laplace distributed errors, and are in this sense very much related.
Gaussian distribution:
$$f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
with the log-likelihood being maximized when minimizing the sum of squared residuals
$$\log \mathcal{L}(x) = -\frac{n}{2} \log (2 \pi) - n \log(\sigma) - \frac{1}{2\sigma^2} \underbrace{\sum_{i=1}^n (x_i-\mu)^2}_{\text{sum of squared residuals}} $$
Laplace distribution:
$$f(x) = \frac{1}{2b} e^{-\frac{\vert x-\mu \vert}{b}}$$
with the log-likelihood being maximized when minimizing the sum of absolute residuals
$$\log \mathcal{L}(x) = -n \log (2) - n \log(b) - \frac{1}{b} \underbrace{\sum_{i=1}^n |x_i-\mu|}_{\text{sum of absolute residuals}} $$
Note: the Laplace distribution and the sum of absolute residuals relates to the median, but it can be generalized to other quantiles by giving different weights to negative and positive residuals.
Known error distribution
When we know the error-distribution (when the assumptions are likely true) it makes sense to choose the associated likelihood function. Minimizing that function is more optimal.
Very often the errors are (approximately) normal distributed. In that case using least squares is the best way to find the parameter $\mu$ (which relates to both the mean and the median). It is the best way because it has the lowest sample variance (lowest of all unbiased estimators). Or you can say more strongly: that it is stochastically dominant (see the illustration in this question comparing the distribution of the sample median and the sample mean).
So, when the errors are normal distributed, then the sample mean is a better estimator of the distribution median than the sample median. The least squares regression is a more optimal estimator of the quantiles. It is better than using the least sum of absolute residuals.
Because so many problems deal with normal distributed errors the use of the least squares method is very popular. To work with other type of distributions one can use the Generalized linear model. And, the method of iterative least squares, which can be used to solve GLMs, also works for the Laplace distribution (ie. for absolute deviations), which is equivalent to finding the median (or in the generalized version other quantiles).
Unknown error distribution
Robustness
The median or other quantiles have the advantage that they are very robust regarding the type of distribution. The actual values do not matter much and the quantiles only care about the order. So no matter what the distribution is, minimizing the absolute residuals (which is equivalent to finding the quantiles) is working very well.
The question becomes complex and broad here and it is dependent on what type of knowledge we have or do not have about the distribution function. For instance a distribution may be approximately normal distributed but only with some additional outliers. This can be dealt with by removing the outer values. This removal of the extreme values even works in estimating the location parameter of the Cauchy distribution where the truncated mean can be a better estimator than the median. So not only for the ideal situation when the assumptions hold, but also for some less ideal applications (e.g. additional outliers) there might be good robust methods that still use some form of a sum of squared residuals instead of sum of absolute residuals.
I imagine that regression with truncated residuals might be computationally much more complex. So it may actually be quantile regression which is the type of regression that is performed because of the reason that it is computationally simpler (not simpler than ordinary least squares, but simpler than truncated least squares).
Biased/unbiased
Another issue is biased versus unbiased estimators. In the above I described the maximum likelihood estimate for the mean, ie the least squares solution, as a good or preferable estimator because it often has the lowest variance of all unbiased estimators (when the errors are normal distributed). But, biased estimators may be better (lower expected sum of squared error).
This makes the question again broad and complex. There are many different estimators and many different situations to apply them. The use of an adapted sum of squared residuals loss function often works well to reduce the error (e.g. all kinds of regularization methods), but it may not need to work well for all cases. Intuitively it is not strange to imagine that, since the sum of squared residuals loss function often works well for all unbiased estimators, the optimal biased estimators is probably something close to a sum of squared residuals loss function.
Best Answer
Quantile regression assumes
You might also consider semiparametric regression (e.g., proportional odds or hazards models) which are more efficient and also allow you to estimate the mean.
My RMS course notes goes a bit more into quantile and semiparametric regression in the chapter on ordinal models for continuous $Y$.