The location on the image plane will give you a ray on which the object lies. You’ll need to use other information to determine where along this ray the object actually is, though. That information is lost when the object is projected onto the image plane. Assuming that the object is somewhere on the road plane is a huge simplification. Now, instead of trying to find the inverse of a perspective mapping, you only need to find a perspective projection of the image plane onto the road. That’s a fairly straightforward construction similar to the one used to derive the original perspective projection.
Start by working in camera-relative coordinates. A point $\mathbf p_i$ on the image plane has coordinates $(x_i,y_i,f)^T$. The original projection maps all points on the ray $\mathbf p_i t$ onto this point. Now, we’re assuming that the road is a plane, so it can be represented by an equation of the form $\mathbf n\cdot(\mathbf p_o-\mathbf r)=0$, where $\mathbf n$ is a normal to the plane and $\mathbf r$ is some known point on it. We seek the intersection of the ray and this plane, which will satisfy $\mathbf n\cdot(\mathbf p_i t-\mathbf r)=0$. Solving for $t$ and substituting gives $$\mathbf p_o = {\mathbf n\cdot \mathbf r \over \mathbf n\cdot \mathbf p_i}\mathbf p_i.$$ Moving to homogeneous coordinates, this mapping is the linear transformation represented by the matrix $$
M = \pmatrix{1&0&0&0 \\ 0&1&0&0 \\ 0&0&1&0 \\ {n_x \over \mathbf n\cdot\mathbf r} & {n_y \over \mathbf n\cdot\mathbf r} & {n_z \over \mathbf n\cdot\mathbf r} & 0},
$$ i.e., $$
\mathbf p_o = M\pmatrix{x_i \\ y_i \\ f \\ 1}.
$$ Once you have this, it should be obvious how to complete the mapping back to world coordinates.
All that’s left is to find the parameters $\mathbf n$ and $\mathbf r$ that describe the road plane in camera coordinates. That’s also pretty simple. Since we’re taking the road to be the plane $y=0$ in world coordinates, its normal there is $(0,1,0)^T$. As for a known point on the road, the origin will do. Another reasonable choice is the point at which the camera’s optical axis meets the road, since the the camera-relative coordinates of that point will be of the form $(0,0,z)^T$. Convert both of these into camera-relative coordinates, and you’re done.
Note that you don’t necessarily need to know anything about the camera to compute a perspective transformation that will map from the image plane to the road plane. If you can somehow find four pairs of non-colinear points, i.e., a pair of quadrilaterals, that correspond to each other on these two planes, a planar perspective transformation that relates them can be computed fairly easily. See here for details. Essentially, you calibrate the camera view by matching a region of the image to a known region in the road plane.
Update 2018.10.22: If you have the complete camera matrix $P$, which you do, there’s a fairly straightforward way to construct the back-mapping to points on the road with a few matrix operations. We choose a coordinate system for the road plane, which gives us a $4\times3$ matrix $M$ that maps from these plane coordinates to world coordinates, i.e., $\mathbf X = M\mathbf x$. The image of this point is $PM\mathbf x$. If $PM$ is invertible, which it will be unless the camera center is on the road plane, the matrix $(PM)^{-1}$ maps from image to plane coordinates, and so the back-mapping from image to world coordinates on the road is $M(PM)^{-1}$. For the plane $Y=0$, a natural choice for $M$ is $$M=\begin{bmatrix}1&0&0\\0&0&0\\0&1&0\\0&0&1\end{bmatrix},$$ which simply inserts a $Y$-coordinate of zero to obtain world coordinates. You can adjust the origin of this coordinate system by changing the last column of $M$.
I can answer this:
What is the affine transformation converting world coordinates to camera coordinates? (camera world coordinates: $c=(c_x,c_y,c_z)^\top$, visual center world coordinates, $v=(v_x,v_y,v_z)^\top$)
I'm assuming the traditional camera image coordinates (before projection) having $z$ drilling "into" the image, $x$ pointing from left to right, and $y$ pointing downward.
Now let's track how the axes must be rotated without translation:
1. the new $z$ axis ($z'$) will point along $v-c$.
1. the new $x$ axis ($x'$) is perpendicular to $z$ and $z'$
1. the new $y$ axis ($y'$) is perpendicular to $x'$ and $z'$.
You can find three vectors that point along the new axes in world coordinates, normalize them, then put them in the rows of a $3\times 3$ matrix $R$: this converts world coordinates to rotated camera orientation.
Finally, if you know the translation $t$ in world coordinates (it would be $(-10,-10,-10)^\top$ to translate to the camera's position in world coordinates) then the translation in camera coordinates is $t'=Rt$
Let's actually carry this out for your example. Let's work on a triad of orthogonal vectors:
$z'=(-1,-1,-1)$, pointing in the direction the camera must face.
$x'=z'\times z=(-1,1,0)^\top$
$y'=z'\times x'=(1,1,-2)^\top$
Normalizing these and using them as the rows of a matrix you get:
$$
R=\frac{1}{\sqrt{6}}\begin{bmatrix}
-\sqrt{3}&\sqrt{3}&0\\
1&1&-2\\
-\sqrt{2}&-\sqrt{2}&-\sqrt{2}
\end{bmatrix}
$$
Then $t'=Rt=(0,0,10\sqrt{3})$.
Notice that the angle of declination is an odd angle near $35^\circ$ rather than exactly $45^\circ$. (I had a hard time seeing this at first, but if you draw a cube and check the angle between $(1,1,0)$ and $(1,1,1)$ you'll see what I mean.)
Now you've converted world coordinates to rotated frame that is aligned with your camera's frame, but differs by a translation.
This gives you the resulting affine transformation $\begin{bmatrix}R&t'\\0_{1\times 3}&1\end{bmatrix}$
which carries world coordinates to camera coordinates.
As a sanity check, you can confirm that the world's origin maps to camera $(0,0,10\sqrt{3})^\top$ and that the world camera location $(10,10,10)$ now maps to the camera's origin. A third check of your choice should be sufficient to convince you this is the right $R$ and $t'$.
One caveat: I'm not 100% sure the step with $z\times z'$ is always in this order. I picked it this way on this occasion because it gave the right orientation for $x'$ and $y'$ in the end. Hopefully that is all consistent, but maybe there is some sign ambiguity after all.
The second question is how to construct the "UP" vector.
I don't understand what you are asking. If you mean the camera coordinates for the direction of the world $z$-axis, then that would just be $R(0,0,1)^\top +t'$.
Finally, I will have to rotate camera as well from "landscape" to "portrait" orientation .
I'm interpreting this to mean that you'd want to rotate the image plane so that the $y$-axis is horizontal, which could be done with a $\pi/4$ rotation in either way around the camera $z$-axis.
This transformation should be entirely obvious:
$$U=
\begin{bmatrix}
0&-1&0\\
1&0&0\\
0&0&1\end{bmatrix}
$$
$U$ gives the rotation in the clockwise direction around the $z$ axis (which would look to be counterclockwise if you are looking up the $z$ axis into the picture) and $U^\top$ would give the rotation in the other direction.
Best Answer
Turns out the math is sound. The problem was due to image dimension conventions across python packages i.e. I had one library which mapped (y,x) to (width, height) and another which mapped (x,y) to (width, height). Correcting for this removed the offset.