I'm on the right track now, I did as I've suggest earlier :
Should I apply the object's transform to the target point and to a given vector between the camera eye point and target point, then I'll get my new target and my new camera eye point (which is the end of my new vector)? Will that be enough information to setup the camera pose properly?
but I've also applied the transform to the up-vector of the camera to get it right. Result: for any kind of translation/rotation applied to the object the camera move properly (Yeah!). The only problem that remains is when I scale the object. If I shrink it, it gets smaller in the camera view after the transform is applied (and vice-versa).
See if this reasoning helps. In 3D computer graphics, everything is rendered from the viewpoint of an imaginary camera, or viewer. The 3D content being viewed is represented by coordinates in a world coordinate system, and the camera location and orientation is also specified in world coordinates.
The viewing transform (the guts of all 3D graphics) consists of two 3D transforms, followed by a 3D --> 2D projection.
Coordinate translate: World to Eye. First, all world points must be put into local viewing coordinates where the camera defines the origin. This is done as a coordinate translation, where the camera's position coordinates are subtracted from each world point's coordinates....the result now is that all points revolve around an origin at the camera. This is how you want it, since if the camera (viewer) pans or tilts, points stay the same distance away, just changing direction.
Coordinate rotate: Local world to Eye gaze rotator. In the most general formulation, the camera can adopt any possible 3D orientation. Not only can the camera's center axis point out in any 3D direction, held in that pointing direction, the camera can be spun around the axis 360 degrees. (I'm gonna hurl!)
Therefore, a general purpose way to represent the orientation of the camera (viewer) is to use a 3x3 rotational matrix (avoiding the use of angles). The camera has its own local x-y-z axes. Some camera defs have them looking out the local x-axis (and some look out the local y-axis), and z points out the top of the camera. In any event, the orientation of the camera in world coordinates is represented by 3 direction vectors comprising the columns of the 3x3:
3x3 matrix R =
[ camera_x_axis_dir
camera_y_axis_dir
camera_z_axis_dir ]
Visually, you can imagine what these numbers mean. Imagine the camera coming with its 3 axes as 1-meter long sticks that point out from the camera, forming perfect right angles...x sticks out to the right, y-sticks out forward, and z-sticks out upward. Whereever the camera is setup at whatever orientation, slide the camera over to the world origin, without upsetting its orientation. Construct a unit sphere about the origin. Where does the cam's x-axis stick touching the unit sphere? These [ x y z ] world coordinates are the numbers in camera_x_axis_dir. Same for the y and z sticks. All camera orientations can be represented numerically in this simple manner. This is a 1:1 representation...there is only one set of 9 numbers to represent a unique camera orientation. This is better for software than using angles to represent orientation, since angles are not 1:1. (The information about 3D orientation is said to be overcompressed in roll-pitch-yaw angle form).
Once you know the camera orientation, the next step in rendering a 3D point is to coordinate-rotate the point so that the apparent location of the point is expressed in camera-oriented coordinates. If I'm looking directly at a 3D point, after coordinate rotation the point will fall on the camera's local y-axis (x and z coords will be zero). If the point is directly behind the camera, its transformed coordinates put it on the negative y-axis. If the camera is looking at the world upside-down, the coordinate rotate will flip the world upside down.
The last step is (assuming the optical axis is +y), to assume the viewing screen is directly in front of the viewer (perpendicular to the optical axis), and so points are projected onto the screen. The ray going out from the camera to the point is intersected (conceptually) with the view screen, and the intersection points (x, z) say where to draw the 3D point on the 2D view screen. (Da Vinci was the first tech-artist to popularize this model of the projective transform).
Not mentioned, but there is also some backspace clipping of 3D content unviewable because it is located behind the camera. The rotate transform clarifies this decision, as all 3D content with negative y-coords (in local, rotated camera coordinates) is in the viewer's backspace.
I hope this clarifies what the 3D viewing transform steps are, why they are needed, and the order undertaken.
Best Answer
Working in homogeneous coordinates, the Cartesian equation $ax+by+cz+d=0$ can be expressed as $\mathbf\pi^T\mathbf X=0$, where the homogeneous vector $\mathbf\pi=[a:b:c:d]$. If $\mathtt M$ is a nonsingular transformation matrix, then $$\mathbf\pi^T\mathbf X=\mathbf\pi^T\mathtt M^{-1}\mathtt M\mathbf X=(\mathtt M^{-T}\mathbf\pi)^T(\mathtt M\mathbf X) = 0,$$ which shows that the vectors that represent planes are covariant: if points transform as $\mathbf X'=\mathtt M\mathbf X$, then planes transform as $\mathbf\pi'=\mathtt M^{-T}\mathbf\pi$.
In your case, the equation of the plane in camera coordinates is given by the point-normal form $\mathbf N\cdot(\mathbf X-\mathbf P)=0$, so we have $\mathbf\pi_C=[\mathbf N^T;-\mathbf N^T\mathbf P]^T$. We have for the world-to-camera mapping the matrix $\mathtt M = \left[\begin{array}{c|c}\mathtt R & \mathbf T\end{array}\right]$ and so camera-coordinate planes are transformed into world coordinates by $(\mathtt M^{-1})^{-T} = \mathtt M^T$, i.e., $$\mathbf\pi_W = \mathtt M^T\mathbf\pi_C = \left[\begin{array}{c|c} \mathtt R^T & \mathbf 0 \\ \hline \mathbf T^T & 1\end{array}\right]\begin{bmatrix} \mathbf N \\ -\mathbf N^T \mathbf P\end{bmatrix} = \begin{bmatrix} \mathtt R^T \mathbf N \\ \mathbf N^T\mathbf T-\mathbf N^T\mathbf P \end{bmatrix}.$$
For your example, $\mathbf\pi_C = [1,2,1,-9]^T$ and $$\mathbf\pi_W = \left[\begin{array}{r}0&0&1&0\\-1&0&0&0\\0&-1&0&0\\3&3&9&1\end{array}\right]\left[\begin{array}{r}1\\2\\1\\-9\end{array}\right] = \left[\begin{array}{r}1\\-1\\-2\\9\end{array}\right]$$ and so the equation of the plane in world coordinates is $x-y-2z+9=0$.
Using your approach, transform $\mathbf P_C$ to world coordinates: $$\mathbf P_W = \left[\begin{array}{c|c} \mathtt R^T & -\mathtt R^T\mathbf T \end{array}\right] \begin{bmatrix}\mathbf P_C\\1\end{bmatrix} = \left[\begin{array}{r}0&0&1&-9\\-1&0&0&3\\0&-1&0&3\end{array}\right]\begin{bmatrix}1\\4\\0\\1\end{bmatrix} = \left[\begin{array}{r}-9\\2\\-1\end{array}\right].$$ Compared to what you described in your question, it looks like you only translated $\mathbf P_C$, but to convert to world coordinates it must be both translated and rotated. Normal vectors are covariant, so $$\mathbf N_W = (\mathtt R^{-1})^{-T}\mathbf N_C = \mathtt R^T\mathbf N_C = \left[\begin{array}{r}0&0&1\\-1&0&0\\0&-1&0\end{array}\right]\begin{bmatrix}1\\2\\1\end{bmatrix} = \left[\begin{array}{r}1\\-1\\-2\end{array}\right],$$ giving for the world-coordinate equation of the plane $$(1,-1,2)\cdot(x+9,y-2,z+1)=x-y-2z+9=0$$ as above. Comparing this to your calculation, you transformed the normal vector incorrectly as well.
We can check distances, as you suggest: The distance of the plane from the camera (camera-coordinate origin) is $${[1,2,1]\cdot[1,4,0]\over\|[1,2,1]\|} = {9\over\sqrt6}.$$ The world coordinates of the camera are the last column of the camera-to-world matrix, and the world-coordinate distance of this point from the plane is $${|[1,-1,-2]\cdot[-9,3,3]+9|\over\|[1,-1,-2]\|} = {9\over\sqrt6}.$$ You can also check that this is indeed the correct plane by transforming a few points on it to world coordinates and then plugging those coordinates into its world-coordinate equation.