You certainly do have sampling error, even if you don't know what population your sample came from.
As with random samples, if all you want to do is make statements about the sample itself, then you do not need p values or any form of inferential statistics. Indeed, you don't need any specific sample size either. I can measure myself and my wife and say "I am taller than she is" (N of 2). I can just measure myself and say "I am 5 foot 8" (N = 1).
However, even with non-random samples you are usually interested in inference. You therefore have to assume either a) that the non-randomness in your sample isn't affecting things (a dangerous assumption!) or b) That there is some population from which your sample is random, and that that population is interesting.
In real life, this often gets blurred. In the many cases where there is no way to take a random sample (too expensive, too impractical, unethical, illegal, impossible) people often write as if they are inferring to something sort of in between a and b.
It seems like you're imagining a very simple sampling model.
The simplest model for sampling is called aptly Simple Random Sampling. You select a subset of the population (e.g., by dialing phone numbers at random) and ask whomever answers how they're voting. If 487 say Clinton, 463 say Trump, and the remainder give you some wacky answer, then the polling firm would report that 49% of voters prefer Clinton, while 46% prefer Trump. However, the polling firms do a lot more than this. A simple random sample gives equal weight to every data point. However, suppose your sample contains--by chance--600 men and 400 women, which clearly isn't representative of the population as a whole. If men as a group lean one way, while women lean the other, this will bias your result. However, since we have pretty good demographic statistics, you can weight* the responses by counting the women's responses a bit more and the men's a bit less, so that the weighted response represents the population better. Polling organizations have more complicated weighing models that can make a non-representative sample resemble a more representative one.
The idea of weighting the sampled responses is on pretty firm statistical ground, but there is some flexibility in choosing what factors contribute to the weights. Most pollsters do reweight based on demographic factors like gender, age, and race. Given this, you might think that party identification (Democratic, Republican, etc) should also be included, but it turns out that most polling firms do not use it in their weights: party (self)-identification is tangled up with the voter's choice in a way that makes it less useful.
Many polling outfits also report their results among "likely voters". In these, respondents are either selected or weighted based on the likelihood that they'll actually turn up to the polls. This model is undoubtedly data-driven too, but the precise choice of factors allows for some flexibility. For example, including interactions between the candidate and voter's race (or gender) wasn't even sensible until 2008 or 2016, but I suspect they have some predictive power now.
In theory, you could include all sorts of things as weighting factors: musical preference, eye color, etc. However, demographic factors are popular choices for weighting factors because:
- Empirically, they correlate well with voter behavior. Obviously, there is no iron-clad law that 'forces' white men to be lean Republican, but over the last fifty years, they have tended to.
- The population values are well known (e.g., from the census or Vital Records)
However, pollsters also see the same news everyone else does, and can adjust the weighting variables if necessary.
There are also some "fudge factors" that are sometimes invoked to explain poll results. For example, respondents sometimes are reluctant to give "socially-undesirable" answers. The Bradley Effect posits that white voters sometimes downplay their support for white candidates running against a minority to avoid appearing racist. It is named after Tom Bradley, an African-American gubernatorial candidate who narrowly lost the election despite leading comfortably in the polls.
Finally, you're completely correct that the very act of asking someone's opinion can change it. Polling firms try to write their questions in a neutral way. To avoid issues with the order of possible responses, the candidates' names might be listed in random order. Multiple versions of a question are also sometimes tested against each other. This effect can also be exploited for nefarious ends in a push poll, where the interviewer isn't actually interested in collecting responses but in influencing them. For example, a push poll might ask "Would you vote for [Candidate A] even if it was reported that he was a child molester?".
* You might also set explicit targets for your sample, like including 500 men and 500 women. This is called
stratified sampling--the population is stratified into different groups, and each group is then sampled random. In practice, this isn't done very often for polls, because you'd need to stratify into a lot of exhaustive groups (e.g., college-educated men between 18-24 in Urban Texas).
Best Answer
A particularly biased / non-representative sample is unlikely if you sample randomly.
In an ideal world you'd have a non-random sample which perfectly accurately represents the population such that the proportion of every demographic is the same in the sample as it is in the population as a whole.
This is pretty hard problem to solve in the real world though (to say the least), as you'd need to understand every demographic and how it affects your results. You might say "white, 24-year-old, college-educated women" is specific enough and you just need to make sure your sample has the right proportion of such people (and similarly for every other similar demographic), but they may be more or less likely to act in a certain way based on where they live, where they studied, where they grew up, their religion and many other factors. So you need to take all of that into account too. That'll be a whole lot of work, and in the process you'll probably answer your original query anyway without ever using the sample you generated. Basically doing that just doesn't make a whole lot of sense.
In the real world a random sample is a "good enough" attempt to obtain an accurate representation of the population.
Now it is indeed possible to get a random sample that doesn't reflect what the population as a whole looks like particularly well (i.e. a "biased" sample).
But the probability of getting any given sample when sampling randomly decreases significantly as the sample becomes more biased and a less accurate representation of the population as a whole. This applies especially when you have larger samples.
This is acceptable since statistics is generally about having high confidence of being correct rather than having absolute certainty.
Think of it this way: if 70% of your population is women and you randomly pick one person, you have a 70% chance of picking a woman. So you would expect roughly 70% of your random sample to be women. The maths might not work out to exactly 70% in all cases, but that's the general idea. So the sample proportions should roughly correspond to the proportions of the population as a whole. You should be rather surprised if your sample somehow ends up with 0% women.
There could also be issues depending on how you obtain a random sample. If you want to sample from everyone living in a country, you could, for example, get a random subset of registered voters or people with driver's licences. But then your sample would be heavily biased towards people who are registered to vote or have driver's licences.
This may also lead to a partially random sample where you combine differently-sized random samples from different sources such that the end result is more representative of the population as a whole. Although I'm not sure whether and how often this is done in practice. Finding a single data source for the entire population would be preferable.
But that's a whole other question.