[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/sci/ - Science & Math


View post   

File: 134 KB, 600x902, Discriminant-Analysis-of-Principal-Components-DAPC-to-infer-population-substructure-A.png.jpg [View same] [iqdb] [saucenao] [google]
9542622 No.9542622 [Reply] [Original]

Hello /sci/, how do I learn about PCA? I have a classification problem and I know PCA is probably the best way to sort it out given the data I have. I have an intuitive understanding of what PCA is (taking a big data cloud and finding the axis with the most variance) but I need to understand its nuances as a tool so I don't make any dumb mistakes and can talk to stats experts in a knowledgeable way. I also need to understand clustering better. Any book/article recommendations about how to use PCR as a tool without getting too bogged down in the theory?

>> No.9542662

Pca isn't a classification tool though. You can use it to do am external check to see if a given set of parameters might work as a classifier but it doesnt have a way to do any of the things classifiers do

Anyway as a rule of thumb if you have a classification problem, throw random forests on it and see if you get anything. If random forests can't do anything with your data you might as well stop

>> No.9542695

>>9542662

Sorry, I wasn't specific enough. I want to do PCA then do some type of clustering. I don't know about random forests but they look like they would solve the problem as well. What's a good resource to learn about them?

>> No.9542698

>>9542695
None come to mind but I'm pretty sure some googling will pull up python and r packages. Best way to learn is doing.

>> No.9542890

>>9542622

PCA is a dimension reduction / variance explanation tool. It's useful for multivariate analysis to project your data into orthogonal (i.e. "independent") components. Each component is a linear combination of your original variables. If your dataset has correlated variables, it seeks to combine those variables into a supervariable that best explain their variance. Then, it seeks another supervariable to explain the rest of the data and so on. It's just a variable combiner; a way to make your dataset smaller. That can be useful in some machine learning problems (like basic clustering) because it reduces the number of variables / dimensions to a manageable and interpretable size.

However, it often performs shittily on real datasets due to strong constraints (like orthogonality). There are better tools out to there to reduce dimensionality while taking into account the "shape" of data (e.g self-organizing maps).

As for clustering, pre-conditioning your data with a dimension reduction technique like PCA can definitely help if you suspect many of your variables are correlated. It can help with classification too. Just be aware of what your transformed variables actually represent to gain intuition about what you're doing.

>> No.9542905
File: 91 KB, 600x378, Druidbro.jpg [View same] [iqdb] [saucenao] [google]
9542905

>>9542890
this. PCA is just about squishing down a big nth-dimensional data set into two or three dimensions so you can visualize it better and identify which variables actually contribute significantly to variance. (good rule of thumb is any principal component that doesn't explain >10% of total variance isn't important.)
actual cluster analysis requires further techniques; PCA just isn't about that.

t. guy who did relative warp analysis (PCA for morphometrics) for his thesis.

>> No.9543807

>>9542890

Thanks for the advice, I do expect many of the variables I'm using to be correlated to one another. I think I've confused myself because whenever I've seen PCA it's been a prelude to doing some type of clustering so I thought the two procedures were linked (apparently k-means clustering does have some relationship to PCA that I can't completely follow). Can you recommend and resources for learning about clustering with PCA as an input?

>>9542905

See my comment above, I think I'm more stuck on the clustering than the PCA, if you have any recommendations about where I can learn about either that'd be helpful.

>>9542698

I have a biology background, not a CS background. I fuck around with R and Python but a problem I've run into is that a lot of the packages seem to assume a pretty in-depth knowledge of the background maths. I usually agree that doing is the best way to learn but I think I need to learn a bit more about the theory so that I can be sure I'm not just manipulating my data in a way that doesn't have real-world relevance. As I understand it you can basically manipulate cluster analysis to make whatever clusters you want, so I want to know enough to be able to exercise intelligent judgement in what I'm doing.

>> No.9543820

Ignore OP. It's just another /pol/ moron that doesn't understand genetics or what PCA is.

>> No.9543832

>>9542622
PCA is not a classification technique, it's a dimensionality reduction method. Read elements of statistical learning and practice on kaggle.

>> No.9543834

>>9543820
Pls no bully, it's not actually genetic data, or even human data, that I want to do PCA/Clustering on (although PCA with SNP data is what I'm most familiar with, which I think is why my OP was a bit confused).

>> No.9543843

>>9543834
Do you even understand WHY you want to do PCA/clustering? You've stated in the op post that you have classification problem. Clustering/PCA has nothing to do with solving the classification problem, unless you want to use clusters as additional features.

>> No.9543858

>>9543834
Look, you're not smarter than the combined knowledge of all geneticists that have done research for the last few decades.

Just buy a textbook on human population genetic and educate yourself.

Oh wait, let me guess... you think education is liberal brainwashing, and science is a conspiracy and only you, begging for help on 4chan, are smart enough to figure it all out?

>> No.9543927

>>9542622
we use PCA as a classification tool in the sense that the model for our data may have n components and we don't know in advance what n is, so PCA can help us resolve this

>> No.9544367
File: 131 KB, 1200x793, moody_Belfast1955-1.jpg [View same] [iqdb] [saucenao] [google]
9544367

>>9543858

Lol, I just said I'm not working on genetic data, and I'm not interested in PCA and clustering to solve a population genetics question. I came to /sci/ to ask a /sci/ question. I literally ask for book/article recommendations in the OP, what about my posts suggests I don't trust conventional educational resources? I don't know why you're imputing some ulterior motive to me.

>>9543843
>>9543927

I butchered the explanation in my OP. Basically, I have data about the concentration of a bunch of chemical compounds in different varieties of cannabis. For each variety I have data on about 100 different compounds, and it's likely that some of the chemicals are part of the same synthetic path so the levels of certain groups will likely co-correlate. I'm familiar with analyses where people take SNP data, do PCA then clustering to find plant varieties that are closely related. I also know transcription experiments where you measure levels of a bunch of mRNA molecules, do PCA then clustering and see what families of proteins get upregulated together. My idea was to do PCA (because for each cannabis variety I have concentration data on 100 compounds) and then to do clustering to find cannabis varieties that would likely have similar medicinal effects (because they would have chemical profiles similar to one another). Perhaps there are better ways to do the clustering, someone else suggested random forests, but I still think the PCA step is worth it to reduce noise in the data.

I think I've confused things by using terminology and colloquial terms interchangeably, so sorry for that.

>> No.9544391

>>9544367
https://en.wikipedia.org/wiki/Factor_analysis#Exploratory_factor_analysis_versus_principal_components_analysis

maybe this section is helpful, there seems to be some autism at play here in the field as well

>> No.9544407

>>9544367
So in your OP you claim you need help understand PCA.... then you go on to claim you already know how to do PCA here >>9543834


Which is it? Based on the wording of your posts, it's completely obviously you do not understand what PCA is, or how bioinformatics works. I'm also going to bet that you think you can do bioinformatic analysis on your own and somehow disprove the entire scientific community. If that's true then you're a fool.

You fail to realize that little PCA figures have little to do with human genetics. People that believe in pseudoscientiific ideas of race only cling to them because you can make them appear, on a graph at least, that humans belong to separate groups. This only fools people who do understand genetics or how PCA works. Your fixation with this PCA is inane and a waste of time.

>> No.9544412

>>9542622
Watch this: https://www.youtube.com/watch?v=a9jdQGybYmE

It helped me get an A on multivariate exam, also watch the previous SVD part.

In general, don't be shy to use Youtube. Amazing stuff there.

>> No.9545435

>>9544367
Usually when I'm examining RNA seq data I only use PCA to make sure I don't have a weird replicate fucking up my experiment. I'd look for GO term enrichment to find enriched pathways.

>> No.9545550

>>9543820
>>9543858
>>9544407
You're being a dick, and seem to be projecting on OP.