Suppose that you have $X_1,\dots,X_n$ random variables (whose values will be observed in your experiment) that are conditionally independent, given that $\Theta=\theta$, with conditional densities $f_{X_i\mid\Theta}(\,\cdot\mid\theta)$, for $i=1,\dots,n$. This is your (postulated) statistical (conditional) model, and the conditional densities express, for each possible value $\theta$ of the (random) parameter $\Theta$, your uncertainty about the values of the $X_i$'s, before you have access to any real data. With the help of the conditional densities you can, for example, compute conditional probabilities like
$$
P\{X_1\in B_1,\dots,X_n\in B_n\mid \Theta=\theta\} = \int_{B_1\times\dots\times B_n} \prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta)\,dx_1\dots dx_n \, ,
$$
for each $\theta$.
After you have access to an actual sample $(x_1,\dots,x_n)$ of values (realizations) of the $X_i$'s that have been observed in one run of your experiment, the situation changes: there is no longer uncertainty about the observables $X_1,\dots,X_n$. Suppose that the random $\Theta$ assumes values in some parameter space $\Pi$. Now, you define, for those known (fixed) values $(x_1,\dots,x_n)$ a function
$$
L_{x_1,\dots,x_n} : \Pi \to \mathbb{R} \,
$$
by
$$
L_{x_1,\dots,x_n}(\theta)=\prod_{i=1}^n f_{X_i\mid\Theta}(x_i\mid\theta) \, .
$$
Note that $L_{x_1,\dots,x_n}$, known as the "likelihood function" is a function of $\theta$. In this "after you have data" situation, the likelihood $L_{x_1,\dots,x_n}$ contains, for the particular conditional model that we are considering, all the information about the parameter $\Theta$ contained in this particular sample $(x_1,\dots,x_n)$. In fact, it happens that $L_{x_1,\dots,x_n}$ is a sufficient statistic for $\Theta$.
Answering your question, to understand the differences between the concepts of conditional density and likelihood, keep in mind their mathematical definitions (which are clearly different: they are different mathematical objects, with different properties), and also remember that conditional density is a "pre-sample" object/concept, while the likelihood is an "after-sample" one. I hope that all this also help you to answer why Bayesian inference (using your way of putting it, which I don't think is ideal) is done "using the likelihood function and not the conditional distribution": the goal of Bayesian inference is to compute the posterior distribution, and to do so we condition on the observed (known) data.
This isn't a stats question but a question relating to basic properties of calculus and algebra.
It may help to consider a simpler problem, to avoid any confusion about the issue:
$$\frac{\partial{}}{\partial{p_i}} \sum_{i = 1}^n p_i^2$$
Think of the summation written out:
$$
\frac{\partial{}}{\partial{p_i}} (p_1^2 + p_2^2 + ... + p_{i-1}^2 + p_i^2 + p_{i+1}^2 + ... + p_n^2)
$$
Take the derivative term by term:
$$
= \frac{\partial{p_1^2}}{\partial{p_i}} + \frac{\partial{p_2^2}}{\partial{p_i}} + ... + \frac{\partial{p_{i-1}^2}}{\partial{p_i}} + \frac{\partial{p_{i}^2}}{\partial{p_i}} + \frac{\partial{p_{i+1}^2}}{\partial{p_i}} + ... + \frac{\partial{p_{n}^2}}{\partial{p_i}}
$$
Now take those derivatives (leaving the $i^{\rm{th}}$ term unevaluated for the moment):
$$
= 0 + 0 + ... + 0 + \frac{\partial{p_{i}^2}}{\partial{p_i}} + 0 + ... + 0
$$
and we now see why the summation disappears - there's only one term that isn't zero:
$$
= \frac{\partial{p_{i}^2}}{\partial{p_i}} = 2p_i
$$
Your question is the same but with a different, slightly more complicated function.
Regarding the original problem:
$$
l(p_i,y_i) = \sum_{i = 1}^n \left( \ln(p_i) + y_i \ln(1 - p_i) \right)
$$
is fine, but as soon as you took derivatives, you went astray.
Best Answer
To answer your first question: Yes - the likelihood function, $l(x \mid \theta)$ is the same regardless of your perspective on whether or not $\theta$ is a random variable or fixed - that is, regardless of whether or not you're a bayesian or frequentist. Likelihood calculations always assume that some fixed parameter of a distribution generated your data, which is random. So you always want to think about it as: "What is the parameter that probably generated the data I'm seeing". Every time you run the experiment you'll get new data and your estimate for that parameter will change. I hope that explains why its not written as $l(\theta \mid x)$. Inferences on $\theta$ given data are posterior distribution ($\pi(\theta \mid x)$) inferences, not likelihood inferences. As you stated:
$$\pi(\theta \mid x) = \frac{l(x \mid \theta)\pi(\theta)}{l(x)}$$
Where $l(x) = \int l(x \mid \theta)d\theta$ is the marginal likelihood. This is the likelihood of the data given $\theta$ evaluated at every possible value of $\theta$. You're effectively removing $\theta$ from your posterior calculation by marginalizing it out. The numerator, on the other hand, is the likelihood of the data given $\theta$ weighted by your prior, $\pi(\theta)$. For more info please browse the site or ask any question you have. There is a lot one can say about the likelihood function.
For the second question, I wouldn't get too tripped up on taking the derivative with respect to a value that you are conditioning on. You aren't conditioning on a specific value for $\theta$ (although this is possible when calculating Observed Fisher's Information). You're simply indicating that $\theta$ is not a random variable - because it's not in this case. That way, when you take expectations, you treat $\theta$ as a constant (even if you don't know what that number is) and you take expectations of $x$, which follows some probability distribution (Binomial, Poisson, ...) with some fixed parameter $\theta$.
Regarding the partial derivative - again, don't worry about the weird notation. It could have been a regular derivative. Often fisher's information is written as the second derivative with respect to $\theta$ (so two derivatives) and it's usually a matrix, so partial derivatives are common notation.
$$I(\theta \mid x)= - \text{E}_x \left [ \frac{\partial{}^2}{\partial{\theta}^2} \log [ l(x \mid \theta) ] \right ]$$