Here’s an approach that anticipates doing more with the camera view later.
We’ll be working in two dimensions, but the same technique applies in three. We will assume that the camera view is a perspective projection as illustrated here:
This will necessitate working in homogeneous coordinates.
The first thing to do is to switch to the camera’s coordinate system. The origin of this coordinate system is at the camera’s position and by convention, the camera sights along the negative $y'$ direction (negative $z'$ in 3-d). The world to camera transformation is thus a translation to the camera’s position followed by a rotation. The matrix of this translation is easy to produce. It’s simply $$T=\pmatrix{1&0&-x_4\\0&1&-y_4\\0&0&1}.$$ For the angle $\phi$ that the camera’s line of sight makes with the world $x$-axis, we have $$\cos\phi = {x_4-x_3\over\|P_4-P_3\|}\\\sin\phi = {y_4-y_3\over\|P_4-P_3\|}.$$ To get aligned with the camera’s line of sight, we start by rotating clockwise through this angle, but we also need to rotate clockwise by an additional 90 degrees to get it to point down the camera’s negative $y'$-axis. Putting those two rotations together produces the rotation matrix $$R=\pmatrix{-\sin\phi&\cos\phi&0\\-\cos\phi&-\sin\phi&0\\0&0&1},$$ with $\cos\phi$ and $\sin\phi$ as above. Combining these two matrices, we have $$RT = \pmatrix{-\sin\phi&\cos\phi&x_4\sin\phi-y_4\cos\phi\\-\cos\phi&-\sin\phi&x_4\cos\phi+y_4\sin\phi\\0&0&1},$$ i.e., $$\begin{align}x'&=-(x-x_4)\sin\phi+(y-y_4)\cos\phi\\y'&=-(x-x_4)\cos\phi-(y-y_4)\sin\phi.\end{align}$$
The line labeled “i” in the above diagram is the image plane, which is perpendicular to the camera’s line of sight and at a distance $f$ from the camera (the focal distance). Note that, since the camera is looking down the negative $y'$-axis, $f<0$. The perspective projection $M$ maps a point in the $x$-$y$ plane onto the intersection of the image plane with the ray emanating from the camera and passing through the point. If we take $f=-1$, then the bounds of the visible region in the image plane are $\pm\tan\theta$, so if the $x'$-coordinate of the projection of a point is in this range, then it’s visible.
In the camera coordinate system, a projection matrix is very simple: $$P=\pmatrix{1&0&0\\0&1&0\\0&\frac1f&0}.$$ Putting this all together, given a point $Q=(x,y)$, we compute $$M(Q)=PRT\pmatrix{x\\y\\1}$$ and recover the projected $x'$-coordinate by dividing the first component of the resulting vector by the third. We can save ourselves a bit of work, though, by taking advantage of $P$’s simple form. Note that $$\pmatrix{1&0&0\\0&1&0\\0&-1&0}\pmatrix{x'\\y'\\1}=\pmatrix{x'\\y'\\-y'},$$ so we really only need to transform the target point into camera coordinates, after which we can just check that $-\tan\theta\le-x'/y'\le\tan\theta$. We might have $y'=0$, however, so let’s rewrite this as $|x'|\le|y'|\tan\theta$ to avoid dividing by zero.
You might object that the projection also maps points behind the camera onto the image plane, but that’s easily dealt with: check the sign of the camera-relative $y'$-coordinate. If it’s positive, the point is behind the camera, so there’s no need to compute its projection. You can eliminate the $y'=0$ case at the same time. If this seems backwards to you, you can always have the camera point in the positive $y'$ direction instead so that visible points have a positive $y'$-coordinate, but you’ll have to modify $R$ and $P$ accordingly.
As I mentioned above, the same approach works in 3-d, except that you’ll be working with $4\times4$ matrices. The rotation matrix will be a bit more complicated, but the translation will still be straightforward. Taking $f=-1$ again, the projection will result in $(x',y',z',-z')$. Assuming that the field of view is a circular cone, the test for visibility will then be $$x'^2+y'^2\le z'^2\tan^2\theta.$$
Postscript: This is, of course, overkill when the field of view is a right circular cone, whether in two dimensions or three. Checking that $(Q-P_4)\cdot(P_4-P_3)\ge\|Q-P_4\|\,\|P_4-P_3\|\cos\theta$ is much simpler and more efficient. However, the procedure that I’ve outlined here applies generally to any size and shape aperture, which becomes much more interesting when you move to three dimensions.
Question 1:
Assume your screen coordinate system is centered at one of the corners of the screen field and the axes are aligned with the two perpendicular edges of the screen field meeting at that corner.
Assume you know the position of the orthogonal projection $C$ of the focal point $F$ of the camera onto the screen (for example, it is the center of the rectangular field of the camera as it looks like on the picture you have attached to your post). Let the position of the orthogonal projection $C$ of the focal point $F$ has coordinates $(c_1, \, c_2)$ in pixel units.
Assume each pixel is a square of edge-length $\text{px}$.
Assume you know the focal distance $f$ between the focal point $F$ of the camera and the screen of the camera, i.e. if $F$ is the focal point, then you know $\text{dist}(F, \, C) = f$.
Assume you are given a pixel $P$ on the screen with pixel coordinates $(x_{px}, \, y_{px})$
Then, the angle $\phi$ between the pixel $P$ and the camera's optical axis $FC$ is
$$\tan(\phi) \, = \, \text{px} \, \frac{ \sqrt{\,(x_{px}^2 - c_1)^2 + (y_{px}^2 - c_2)^2\,}}{f} $$
$$\phi = \arctan\left(\text{px} \, \frac{ \sqrt{\,(x_{px}^2 - c_1)^2 + (y_{px}^2 - c_2)^2\,}}{f}\right) $$
Also, one probably needs $\cos(\phi)$ and $\sin(\phi)$ rather then $\phi$ itself, so
$$\cos(\phi) \, = \, \text{px} \, \frac{f}{ \sqrt{\,(x_{px}^2 - c_1)^2 + (y_{px}^2 - c_2)^2 + \text{px}^2 f^2\,}} $$
$$\sin(\phi) \, = \, \text{px} \, \frac{\sqrt{\,(x_{px}^2 - c_1)^2 + (y_{px}^2 - c_2)^2\,}}{ \sqrt{\,(x_{px}^2 - c_1)^2 + (y_{px}^2 - c_2)^2 + \text{px}^2 f^2\,}} $$
Question 2:
Yes, there will be significant changes. In the case on the diagram with the car and the drone, the vertical axis $H$, the camera's optical axis and the line connecting the drone with the car are coplanar (all three lie in the same plane) and it is very easy to calculate the angle between the vertical axis $H$ and the car as the sum of the angle $\theta$, between the vertical axis $H$ and the camera axis, with the angle $\phi$, between the camera axis and the car. But in general, the three lines above are not coplanar. If you know the angle $\psi$ between (i) the plane formed by the vertical axis $H$ and the camera's optical axis $FC$, and (ii) the plane formed by the camera's optical axis $FC$ and the line connecting the drone with the car, then the angle $\sigma$ between the vertical axis $H$ and the line between the drone and the car is calculated by the spherical law of cosines
$$\cos(\sigma) = \cos(\theta) \cos(\phi) + \sin(\theta) \sin(\phi)\cos(\psi)$$ and then
$$\text{GSD}_{\text{rate}} = \frac{1}{\cos(\sigma)} = \frac{1}{\cos(\theta) \cos(\phi) + \sin(\theta) \sin(\phi)\cos(\psi)}$$ In the simplified case, the three lines are coplanar exactly when $\psi = \pi$, which implies $\cos(\pi) = -1$, and then $$\cos(\theta) \cos(\phi) + \sin(\theta) \sin(\phi)\cos(\pi) = \cos(\theta) \cos(\phi) - \sin(\theta) \sin(\phi)$$ and then $$\cos(\theta) \cos(\phi) - \sin(\theta) \sin(\phi) = \cos(\theta + \phi)$$ Thus, you recover the original simplified formula. The angle $\psi$ can be calculated from the image on the screen, kind of like in a manner very similar to the answer of question 1, as long as we know the pixel coordinates $(x_{\text{vert}}, \, y_{\text{vert}})$ of the point $Q$ at which the vertical axis $H$ intersects the plane of the screen. Then by the Euclidean law of cosines
$$|PQ|^2 = |PC|^2 + |QC|^2 - 2\, |PC| |QC| \cos(\psi)$$
so
$$\cos(\psi) = \frac{\,|PC|^2 \, + \, |QC|^2 \, - \, |PQ|^2\,}{2 \, |PC| |QC|}$$
or more explicitly
$$\cos(\psi) = \frac{\,(x_{\text{px}} - c_1)^2 + (y_{\text{px}} - c_2)^2 \, + \, (x_{\text{vert}} - c_1)^2 + (y_{\text{vert}} - c_2)^2\, - \, (x_{\text{px}} - x_{\text{vert}})^2 - (y_{\text{px}} - y_{\text{vert}})^2\,}{2 \, \sqrt{(x_{\text{px}} - c_1)^2 + (y_{\text{px}} - c_2)^2\, } \, \sqrt{(x_{\text{vert}} - c_1)^2 + (y_{\text{vert}} - c_2)^2}}$$
Alternatively, you can use the dot product formula $$\cos(\psi) = \frac{(x_{\text{px}} - c_1)(x_{\text{vert}} - c_1) + (y_{\text{px}} - c_2)(y_{\text{vert}} - c_2)}{ \sqrt{(x_{\text{px}} - c_1)^2 + (y_{\text{px}} - c_2)^2\, } \, \sqrt{(x_{\text{vert}} - c_1)^2 + (y_{\text{vert}} - c_2)^2}}$$
Edit 1. How to calculate the coordinates $(x_{\text{vert}}, \, y_{\text{vert}})$. Assume you can determine two points $(x_1, \, y_1)$ and $(x_2, \, y_2)$ on the screen lying on the edge of an object or on an axis that is the projection of an object or and axis in 3D which is perpendicular to the ground in 3D. On the picture for example, the grey pole in the middle could be one such object (or it could be the vertical edge of a building or something like that). Then, construct the unit vector $(u_{\text{vert}}, \, v_{\text{vert}})$ where
\begin{align}
u_{\text{vert}} \, &=\, \frac{x_2 \, -\, x_1}{\sqrt{(x_2-x_1)^2 + (y_2 - y_1)^2}}\\
v_{\text{vert}} \, &=\, \frac{y_2 \, -\, y_1}{\sqrt{(x_2-x_1)^2 + (y_2 - y_1)^2}}
\end{align}
Then
\begin{align}
x_{\text{vert}} \, &=\, c_1 \, +\, f\,\tan(\theta)\, u_{\text{vert}}\\
y_{\text{vert}} \, &=\, c_2 \, +\, f\, \tan(\theta)\, v_{\text{vert}}\\
\end{align}
Edit 2.
Assume the camera is initially aligned with the vertical axis $H$.
Assume, in order to describe the motion of the camera better, we translate the camera's coordinate system at the projected focal center $C$, so that the world's coordinate system axes $x$ and $y$ are exactly aligned with the camera's coordinate axes $x$ and $y$.
Assume that the camera is first tilted at the angle $\theta$ and after that rotated (around the vertical axis $H$) at the angle $\lambda$ (which looks like it is the case on the photo). Now, during the $\theta-$tilt the camera's $y-$axis is rotated in 3D space, but it always intersects the vertical axis $H$. After that, when the $\lambda-$rotation around $H$ takes place, the camera's $y-$axis is rotated again, but its intersection point with the $H$ axis stays fixed (because every point on the $H$ axis stays fixed during a rotation around $H$). That intersection point is $(x_{\text{vert}}, \, y_{\text{vert}})$. Therefore, the latter lies on the $y-$axis of the camera's coordinate system, centered at $C$. Consequently,
\begin{align}
x_{\text{vert}} \, &=\, c_1 \\
y_{\text{vert}} \, &=\, c_2 \, - \, f\,\tan(\theta)\\
\end{align}
In this case, the formula for $\cos(\psi)$ simplifies to
$$\cos(\psi) = \frac{ c_2 - y_{\text{px}} }{ \sqrt{(x_{\text{px}} - c_1)^2 + (y_{\text{px}} - c_2)^2\, } }$$
Comment. I am not an expert on pixels to be honest, but I guess common sense dictates that each pixel is a little square, whose edges are parallel to the screen's coordinate axes. The pixels have the same edge-length, I called pixel size, and I denoted their edge-length by $\text{px}$ centimeters or milimeter, whichever you have as information. When using parameters from measurements on the screen in terms of pixels, we convert them to metric measurements by multiplying them by pixel size $\text{px}$. That is why the first formulas, that feature pixel coordinates and focal distance $f$ require scaling by pixel-size. But when working with measurements only from the screen, then no need to multiply by pixel-size, because everything is a ratio, so they cancel out.
Best Answer
The most efficient way I've found, is to project the point to an arbitrary projection plane (corresponding to the view frustum, or camera field of view), and check that the projected point is within the rectangular boundaries of the visible projection plane.
Let the focal point of the camera be at origin. The view from the camera is a bipyramid, with apex at the focal point. We can ignore the pyramid on the other side of the focal point (there's just the image sensor there).
Choose an arbitrary projection plane, at a distance $d \gt 0$ from the focal point. (You normally already have such a projection plane, for describing the scene in 2D; if you do not, choose $d = 1$ for simplicity.)
Let $\hat{n}$ be the unit vector along the sight line, i.e. from the focal point to the projection plane, perpendicular to the projection plane, with length $1$. Usually that point is at the center of the projection plane, but it is quite possible to choose a skewed camera frustum (frustum being the portion of the pyramid in front of the camera that is projected to the projection plane).
Let $\hat{u}$ be the unit vector "right" along the projection plane, and $\hat{v}$ the unit vector "up" along the projection plane. Let the projection plane be $w$ wide (in world units, not projection plane units), and $h$ tall.
An arbitrary point $\vec{p}$ must be in front of the camera, $$\vec{p} \cdot \hat{n} \gt 0 \tag{1a}\label{G1a}$$for it to be visible. Usually, we use $$\vec{p} \cdot \hat{n} \gt d \tag{1b}\label{G1b}$$ i.e. only consider points at or beyond the projection plane, ignoring points between the focal point and the projection plane. Choose which one is appropriate for you.
Next, calculate $$\begin{aligned} \vec{p}^\prime &= \frac{d \, \vec{p}}{\hat{n} \cdot \vec{p}} - d \hat{n} \\ u^\prime &= \vec{p}^\prime \cdot \hat{u} \\ v^\prime &= \vec{p}^\prime \cdot \hat{v} \\ \end{aligned} \tag{2a}\label{G2a}$$ where $\vec{p}^\prime$ is the relative vector from origin of projection plane, to where the ray along vector $\vec{p}$ intersects the projection plane. Its first term is $\vec{p}$ projected to the projection plane, and the second term ($d \hat{n}$) is the origin of the projection plane. $u^\prime$ and $v^\prime$ are the coordinates on the projection plane. If the projection plane is centered, then the limits for $u^\prime$ and $v^\prime$ are $$\left\lbrace \begin{aligned} -\frac{w}{2} \le u^\prime & \le \frac{w}{2} \\ -\frac{h}{2} \le v^\prime & \le \frac{h}{2} \\ \end{aligned} \right. \tag{2b}\label{G2b}$$ otherwise the projected point is outside the projection plane visible to the camera.
To recap, in step-by-step algorithm form:
Calculate $$d^\prime = \vec{p} \cdot \hat{n}$$ If $$d^\prime \lt 0$$ then point $\vec{p}$ is behind the camera, and outside the field of view.
Otherwise,
Calculate $$\vec{p}^\prime = d \left( \frac{1}{d^\prime}\vec{p} - \hat{n} \right)$$
Calculate $$u^\prime = \vec{p}^\prime \cdot \hat{u}$$ If $$u^\prime \gt u_\max$$ or $$u^\prime \lt u_\min$$ the point is outside the camera field of view.
Note that when the view is centered, $u_\min = -w/2$, and $u_\max = w/2$.
Otherwise,
Calculate $$v^\prime = \vec{p}^\prime \cdot \hat{v}$$ If $$v^\prime \gt v_\max$$ or $$v^\prime \lt v_\min$$ the point is outside the camera field of view.
Note that when the view is centered, $v_\min = -h/2$, and $v_\max = h/2$.
Otherwise, point $\vec{p}^\prime$ is within the bounds on the projection plane, and therefore point $\vec{p}$ within the field of view.
The maximum cost of this test, ignoring setting $d$, $w$, $h$, $\hat{n}$, $\hat{u}$, and $\hat{v}$ that only change if the camera direction changes, is three vector dot products, one vector-scalar product and one division of vector by a scalar, one vector subtraction, two scalar absolute values, two multiplications by two, and three comparisons. This is very cheap and very efficient.
You can make it even more efficient by using an additional vector $\vec{o}$ pointing to the negative corner of the projection plane, and $\vec{p}^\prime = \frac{d\,\vec{p}}{\hat{n}\cdot\vec{p}} - \vec{o}$, and $\vec{u} = \hat{u}/w$ and $\vec{v} = \hat{v}/h$ instead of $\hat{u}$ and $\hat{v}$ above. Then, the valid range is $0 \le u^\prime \le 1$ and $0 \le v^\prime \le 1$, and you shave off a couple of (scalar) operations.