# Data Collection

**Data is the word we use for the informatio**n that we collect in order to do our research (the singular for this word is datum but it is rarely used). Data collection is also known as sampling. It might not seem obvious, but HOW you go about obtaining your subjects can be as crucial to the validity of your outcome as the question you ask and the type of statistical procedure you decide to use to analyze your data. For an excellent data collection plan...

There are two broad categories of data collection in research:

- Probability Sampling
- Non-probability sampling

Probability sampling is also called random sampling and is considered to be the most powerful and desirable method because theoretically each member of the larger population from which the sample is drawn had an equal chance of being chosen. Of course, it may occur to you that this can be very easy to imagine, but very hard to execute. Even if you have complete control over the sampling procedure (let's say you have 3,000+ experimental rats to test out your new cancer treatment) you can see right away that any subjects you pull from this sample are NOT by definition random. They may be randomly chosen from your subject pool, but the fact that they were in your pool to begin with makes them by definition NOT randomly selected. How can we randomly sample human beings in similar studies? If they have the cancer we are trying to treat, they are also by definition NOT randomly selected.

Systematic sampling might get us around some (but not all) of these problems. In a more benign example, let's say we are surveying hospital patients to determine what factors cause them to perceive their interactions with the nursing staff as positive and comfortable. If we surveyed all the patients in several hospitals, we would not be creating a random sample, however, if we chose every ith (let's say 10th) patient admitted to all 20 hospitals within 30 miles of our university, then we would come closer to obtaining some of the advantages of a probabilistic selection without being truly probabilistic in our procedures. Every patient in all 20 hospitals had a 10% chance of being chosen - that's still not random.

Stratified sampling is useful when we know that the larger population, to which we wish to generalize our conclusions, has two or more subpopulations. For example, let's say we are curious about whether or not nursing students feel adequately prepared for their quantitative analysis studies by their high school mathematics coursework. It might occur to you that our population of nursing students has a large female and smaller but still substantial male subpopulation. So we might want to stratify our sample relative to the proportion of females and males at the school - if your school has 400 female and 180 male students, you might want to take 10% from each group (40 females and 18 males.) Or, in this case, because mathematics education techniques and trends changes from generation to generation, we might want to look at our 18 to 25-year-olds as contrasted with our 26-to-35 year olds as contrasted with our 36-to-45 year olds etc. and we would take 10% of each group. Non-probability sampling means that there will be no way to even approximate a chance to be selected, or that you don't try to approximate it.

Contrast the method of the quota sample with the stratified sampling described above. You decide to just find 5 nursing students - any five - in each age group and ask them about their perceptions of how well-prepared by their high school math courses they feel to take quantitative analysis, and not even bother with the relative proportions of age groups. Or finally, the convenience sample is just what the name says: convenient. The subjects who just happen to be there and available. If I want to know how my Introductory Psychology students at Santa Monica College like the Virtual Office Hours system for posting student questions for faculty, I merely survey them at the end of the course.

### Resources

"How To" Articles