/sci/ - Science & Math

File: 134 KB, 600x902, Discriminant-Analysis-of-Principal-Components-DAPC-to-infer-population-substructure-A.png.jpg [View same] [iqdb] [saucenao] [google]

Anonymous Sun Feb 25 00:17:04 2018 No.9542622 [Reply] [Original]

Hello /sci/, how do I learn about PCA? I have a classification problem and I know PCA is probably the best way to sort it out given the data I have. I have an intuitive understanding of what PCA is (taking a big data cloud and finding the axis with the most variance) but I need to understand its nuances as a tool so I don't make any dumb mistakes and can talk to stats experts in a knowledgeable way. I also need to understand clustering better. Any book/article recommendations about how to use PCR as a tool without getting too bogged down in the theory?

Anonymous Sun Feb 25 00:41:55 2018 No.9542662

Pca isn't a classification tool though. You can use it to do am external check to see if a given set of parameters might work as a classifier but it doesnt have a way to do any of the things classifiers do

Anyway as a rule of thumb if you have a classification problem, throw random forests on it and see if you get anything. If random forests can't do anything with your data you might as well stop

>>	Anonymous Sun Feb 25 01:07:19 2018 No.9542695 >>9542662 Sorry, I wasn't specific enough. I want to do PCA then do some type of clustering. I don't know about random forests but they look like they would solve the problem as well. What's a good resource to learn about them?

>>	Anonymous Sun Feb 25 01:09:06 2018 No.9542698 >>9542695 None come to mind but I'm pretty sure some googling will pull up python and r packages. Best way to learn is doing.

Anonymous Sun Feb 25 03:07:53 2018 No.9542890

>>9542622

PCA is a dimension reduction / variance explanation tool. It's useful for multivariate analysis to project your data into orthogonal (i.e. "independent") components. Each component is a linear combination of your original variables. If your dataset has correlated variables, it seeks to combine those variables into a supervariable that best explain their variance. Then, it seeks another supervariable to explain the rest of the data and so on. It's just a variable combiner; a way to make your dataset smaller. That can be useful in some machine learning problems (like basic clustering) because it reduces the number of variables / dimensions to a manageable and interpretable size.

However, it often performs shittily on real datasets due to strong constraints (like orthogonality). There are better tools out to there to reduce dimensionality while taking into account the "shape" of data (e.g self-organizing maps).

As for clustering, pre-conditioning your data with a dimension reduction technique like PCA can definitely help if you suspect many of your variables are correlated. It can help with classification too. Just be aware of what your transformed variables actually represent to gain intuition about what you're doing.

Anonymous Sun Feb 25 03:19:31 2018 No.9542905
File: 91 KB, 600x378, Druidbro.jpg [View same] [iqdb] [saucenao] [google]

>>9542890
this. PCA is just about squishing down a big nth-dimensional data set into two or three dimensions so you can visualize it better and identify which variables actually contribute significantly to variance. (good rule of thumb is any principal component that doesn't explain >10% of total variance isn't important.)
actual cluster analysis requires further techniques; PCA just isn't about that.

t. guy who did relative warp analysis (PCA for morphometrics) for his thesis.

Anonymous Sun Feb 25 12:38:20 2018 No.9543807

>>9542890

Thanks for the advice, I do expect many of the variables I'm using to be correlated to one another. I think I've confused myself because whenever I've seen PCA it's been a prelude to doing some type of clustering so I thought the two procedures were linked (apparently k-means clustering does have some relationship to PCA that I can't completely follow). Can you recommend and resources for learning about clustering with PCA as an input?

>>9542905

See my comment above, I think I'm more stuck on the clustering than the PCA, if you have any recommendations about where I can learn about either that'd be helpful.

>>9542698

I have a biology background, not a CS background. I fuck around with R and Python but a problem I've run into is that a lot of the packages seem to assume a pretty in-depth knowledge of the background maths. I usually agree that doing is the best way to learn but I think I need to learn a bit more about the theory so that I can be sure I'm not just manipulating my data in a way that doesn't have real-world relevance. As I understand it you can basically manipulate cluster analysis to make whatever clusters you want, so I want to know enough to be able to exercise intelligent judgement in what I'm doing.

>>	Anonymous Sun Feb 25 12:42:57 2018 No.9543820 Ignore OP. It's just another /pol/ moron that doesn't understand genetics or what PCA is.

>>	Anonymous Sun Feb 25 12:50:02 2018 No.9543832 >>9542622 PCA is not a classification technique, it's a dimensionality reduction method. Read elements of statistical learning and practice on kaggle.

>>	Anonymous Sun Feb 25 12:50:35 2018 No.9543834 >>9543820 Pls no bully, it's not actually genetic data, or even human data, that I want to do PCA/Clustering on (although PCA with SNP data is what I'm most familiar with, which I think is why my OP was a bit confused).

Anonymous Sun Feb 25 12:54:42 2018 No.9543843

>>9543834
Do you even understand WHY you want to do PCA/clustering? You've stated in the op post that you have classification problem. Clustering/PCA has nothing to do with solving the classification problem, unless you want to use clusters as additional features.

Anonymous Sun Feb 25 13:00:35 2018 No.9543858

>>9543834
Look, you're not smarter than the combined knowledge of all geneticists that have done research for the last few decades.

Just buy a textbook on human population genetic and educate yourself.

Oh wait, let me guess... you think education is liberal brainwashing, and science is a conspiracy and only you, begging for help on 4chan, are smart enough to figure it all out?

>>	Anonymous Sun Feb 25 13:24:18 2018 No.9543927 >>9542622 we use PCA as a classification tool in the sense that the model for our data may have n components and we don't know in advance what n is, so PCA can help us resolve this

Anonymous Sun Feb 25 16:22:41 2018 No.9544367
File: 131 KB, 1200x793, moody_Belfast1955-1.jpg [View same] [iqdb] [saucenao] [google]

>>9543858

Lol, I just said I'm not working on genetic data, and I'm not interested in PCA and clustering to solve a population genetics question. I came to /sci/ to ask a /sci/ question. I literally ask for book/article recommendations in the OP, what about my posts suggests I don't trust conventional educational resources? I don't know why you're imputing some ulterior motive to me.

>>9543843
>>9543927

I butchered the explanation in my OP. Basically, I have data about the concentration of a bunch of chemical compounds in different varieties of cannabis. For each variety I have data on about 100 different compounds, and it's likely that some of the chemicals are part of the same synthetic path so the levels of certain groups will likely co-correlate. I'm familiar with analyses where people take SNP data, do PCA then clustering to find plant varieties that are closely related. I also know transcription experiments where you measure levels of a bunch of mRNA molecules, do PCA then clustering and see what families of proteins get upregulated together. My idea was to do PCA (because for each cannabis variety I have concentration data on 100 compounds) and then to do clustering to find cannabis varieties that would likely have similar medicinal effects (because they would have chemical profiles similar to one another). Perhaps there are better ways to do the clustering, someone else suggested random forests, but I still think the PCA step is worth it to reduce noise in the data.

I think I've confused things by using terminology and colloquial terms interchangeably, so sorry for that.

>>	Anonymous Sun Feb 25 16:32:53 2018 No.9544391 >>9544367 https://en.wikipedia.org/wiki/Factor_analysis#Exploratory_factor_analysis_versus_principal_components_analysis maybe this section is helpful, there seems to be some autism at play here in the field as well

Anonymous Sun Feb 25 16:38:22 2018 No.9544407

>>9544367
So in your OP you claim you need help understand PCA.... then you go on to claim you already know how to do PCA here >>9543834

Which is it? Based on the wording of your posts, it's completely obviously you do not understand what PCA is, or how bioinformatics works. I'm also going to bet that you think you can do bioinformatic analysis on your own and somehow disprove the entire scientific community. If that's true then you're a fool.

You fail to realize that little PCA figures have little to do with human genetics. People that believe in pseudoscientiific ideas of race only cling to them because you can make them appear, on a graph at least, that humans belong to separate groups. This only fools people who do understand genetics or how PCA works. Your fixation with this PCA is inane and a waste of time.

>>	Anonymous Sun Feb 25 16:40:13 2018 No.9544412 >>9542622 Watch this: https://www.youtube.com/watch?v=a9jdQGybYmE It helped me get an A on multivariate exam, also watch the previous SVD part. In general, don't be shy to use Youtube. Amazing stuff there.

>>	Anonymous Mon Feb 26 00:50:05 2018 No.9545435 >>9544367 Usually when I'm examining RNA seq data I only use PCA to make sure I don't have a weird replicate fucking up my experiment. I'd look for GO term enrichment to find enriched pathways.

>>	Anonymous Mon Feb 26 01:58:37 2018 No.9545550 >>9543820 >>9543858 >>9544407 You're being a dick, and seem to be projecting on OP.

Advanced search
Text to find
Subject [?]Search by post subject. Leave empty for any.
Username [?]Search for user name. Leave empty for any user name.
Tripcode [?]Search for tripcode. Leave empty for any.
Email [?]Search by email. Leave empty for any.
Filename [?]Search by image filename. Leave empty for any.
From Date [?]Enter what date to start searching from. Format is YYYY-MM-DD
To Date [?]Enter what date to start searching until. Format is YYYY-MM-DD
Image hash
Search in	All Posts OPs Only
Deleted posts	Show all posts Show only deleted posts Only show non-deleted posts
Internal posts	Show all posts Show only internal posts Show only archived posts
Order	New posts first Old posts first
Capcode	All Posts Only by Users Only by Mods Only by Admins Only by Developers
Results	Posts Threads
Action	[ Simple ]