[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/sci/ - Science & Math


View post   

File: 99 KB, 1200x1600, Neural_network_example.png [View same] [iqdb] [saucenao] [google]
14956046 No.14956046 [Reply] [Original]

Neural network noob here, I have a question.
I am making a fully connected (dense) neural network to perform regression from feature vector (250 elements) to feature vector (150 elements).
Are there any rules or guidelines for choosing the number of layers or number of neurons per layer?
To me it seems like recommendations/rules like this should exist as this looks to be a classical neural network problem, but have found nothing while searching the literature.

Do such rules exist?
How many layers/neurons should be used here?
Any advice is appreciated

>> No.14956215

>>14956046
No there are no recommendations other than 'deep is better than wide' but that's probably not even true anymore either. This is where Neural Nets stop being a science or math and instead an art and also how a shit ton of people were getting published in the field 5-10 years ago. All they had to do was find a network that performed better than the previous best at some banal problem.

In theory, a wide enough single layer should be able to solve any problem (in the same way a Fourier or Taylor expansion works), but no one does that because of efficiency.

My recommendation is to look into literature for a neural net solving a similar problem with a similar dataset and find what structure they used and then copy it and potentially modify bits and pieces as you see fit.

This is also why other ML/statistical techniques have become really popular. Designing Neural Nets is a pain. I use BARTs pretty regularly myself.

>> No.14956285

>>14956215
that you for the answer, I will try both the "more deep than wide" and "more wide than deep (extremely shallow)" and see what works best.
From reading literature I started seeing myself that it is becoming more of an art than a science.
Why is very wide shallow network inefficient?

The problem with copying other publications is that I am using a rather unique self-made training/validation data which not being used anywhere before is a big hook for my work, and everyone just seems to be using dense nets for these types of problems anyway

>> No.14956288

It's like cooking food. We know what ingredients matched with what technique makes good food, but there might be some really good recipe that no one's figured out yet.

Google something like neural network topology to learn the general guidelines.

>> No.14956354

>>14956288
thank you for the advice, will do
However, combining every reasonably possible number of layers with every reasonably possible number of neurons in each layer seems like an impossibly huge task, even for small networks

>> No.14956551

>>14956046
>>14956215
Linking your comment for anon.
https://en.m.wikipedia.org/wiki/Universal_approximation_theorem

>> No.14956554

>>14956354
That's why you do an optimization loop over these variables ("hyperaprameters") => hyperparamter tuning

take a look at "optuna" , a framework for that task

>> No.14956578

>>14956354
The problem often defines what network architecture will work best - CNN being a great example of this.

For most "simple" problems having more than 2 or 3 hidden layers rarely helps, and having those layers have too many nodes can lead to overfitting. So testing the 9 combinations of {1,2,3} x {1 x N, 2 x N, 4 x N}, where N is the input vector size, will work for many problems or give you a great idea where to start from.

>> No.14956597

>>14956215
So basically you are saying that computer science researchers found a way to make brains, and now they just feed it shit, make random number of nodes and params, and then it makes itself without a single researcher understanding what is happening in the hidden layer?

>> No.14956600

Isn't it something like 3 layers usually ?
I'm just a statistics student but one would expect the number of neurons is related to the number of hidden parameters.

>> No.14956603

>>14956597
The original algorithms for neural networks were based on how they thought the brain worked at the time 1960's / 70's. Turns out they were wrong, so they are not simulating the brain at all. Even today it takes a shit load of computing power to simulate a single human neuron.

>> No.14956614

>>14956554
I thought hyperparameters were associated with a static network, and allow to tune learning rate, etc, and not represent the structure of the network itself like number of layers (different number of layers = different network)
It it right?

>> No.14956618

>>14956578
Will try this
but You seem to restrict number of neurons in each layer to same size as input, instead of making it less or more than that.
Is there some sort of hidden reason for it, or was it just for show?

>> No.14956627

>>14956597
No, I'm saying it's not exactly clear why one certain network is better than another for a given dataset. This is mostly because you often don't know what the underlying true relationship is between input and output.

All a neural network is, is a summation of functions within summations of functions. All fit to optimize onto the data. It's not exactly clear why one given number of summations fed into another function is particularly better than another setup, it just IS and is just closer to the true underlying relationship. It doesn't help that typically these problems have a high number of dimensions (both on the input and output), making visualization really difficult. If it was just one input and one output, it'd be pretty easy to fit (and you should ideally be able to find an understandable empirical relationship).

>> No.14956629

>>14956215
>>14956288
>>14956554
could something like this be considered rules/guidelines for dense net construction: https://towardsdatascience.com/17-rules-of-thumb-for-building-a-neural-network-93356f9930af ?
No explanations for them are given so they kind of seem like asspuls, but its better than nothing I guess, will have to try these too

>> No.14956644

>>14956618
> Is there some sort of hidden reason for it
Just somewhere to start from. You seem to think that adding or removing a few nodes makes a noticeable difference. Like the guidelines >>14956629 suggest, scale the hidden layer size to the input but when tweak the node drop / kill percentage.

>> No.14956659

>>14956644
>You seem to think that adding or removing a few nodes makes a noticeable difference
I dont, but if input is, say, 200, then there should be a difference between having hidden layers be 800, 400, 200, 100 or 50, should it not? I is a considerable part of the input size

>> No.14956669

>>14956659
> then there should be a difference between having hidden layers be 800, 400, 200, 100 or 50
Actually for many problems you'll find that not to be true. Think about it, there is only a finite amount of information embedded in the input data so after a certain layer size adding more nodes is pointless, there's nothing else left to extract.

>> No.14956681

>>14956669
I understand, but that is not true going the opposite way - there should be a minimum size, right?
obviously using 1 node per layer would be nonsense. but would using 2, 5 or 10 or 100 be better? just blindly sweeping through the numbers means there are tens of networks to train and test

>> No.14956692

>>14956681
Yes there will be a minimum size which is typically half the input size.

>> No.14956705

>>14956692
Oh, so it is an empirically determined limit
Does the depth change depending on how wide the layers are?

>> No.14956716

>>14956705
Like all your questions they are in general problem dependant.

>> No.14956875

>>14956716
so basically, the answers to almost all questions relevant to me are "trial and error, see what works for your specific input-output datasets", correct?

>> No.14956931

Go with a feeling.

If its not back-connected its just a series of a filter and not even AI

>> No.14956995

Layers are generated by a given shared parameter, such as sharpness of pixel color change. As many groups and subgroups there are.

>> No.14957081

>>14956614
No.

Hyperparameters are all parameters you can tweak, which are not inside your model (weights, biases), aka which form your function approximation after training.

Try optuna and report back. Also, there are little cases, where hyperparamter tuning rescues your ass. If it is not working, something else is wrong. HP tuning will, in general, only suqeeze performance out of your network. Follow some standard networks and use your brain. Mostly, garbage in equals garbage out.

>> No.14958094

>>14956046
There might be but my math is way too low. it's a black box toy with a lot of knobs and anything goes usually. If it doesn't, turning knobs harder usually won't work cuz you are most likely missing a different knob or mechanism.
Learning rate, depth, width only hinders you when you go full retard value, and won't be the deal breaker alone.
In general people just gradually taper down the width each layer to the out put size needed. How gradual? Go figure.
Other options such as activation function, weight initialization, the data set you feed it, and validation can help more for it to generalize something useful. Maybe a momentum term.

>> No.14958891

bump

>> No.14958936
File: 199 KB, 637x945, 1667487868610203.jpg [View same] [iqdb] [saucenao] [google]
14958936

>>14956046
I think a fun idea for neural networks is NEAT. That type of network changes it's shape instead of using weights. It can add and remove neurons.

You don't have to guess a network structure because it evolves one. So your problem of not knowing where to connect things or how many is sidestepped.

>NEAT informations
https://www.cs.ucf.edu/~kstanley/neat.html

>> No.14959337

>>14958936
is it something like a liquid state machine or extreme learning machine?

>> No.14959482

>>14956046
I actually work in ML field (specific domain related to medicine).
I can tell you right now that the biggest mistakes people do is having too small a dataset for a neural network, and that, for tabular data, Xgboost >> neural nets until a certain N.
>>14956681
Generally it depends on your data problems, and you would normally not "just" use feedforward layers. It's classical to create the layer sizes in powers of 2 (64, 128, 256, 512, 1024, etc); historically it had to do with computational efficiency and in some settings it probably still does, but that's not necessarily true with modern libraries (I could be wrong). Is your feature vector in the real number space or one-hot encoded/label encoded? If the latter, I'd use an embedding layer before the rest of the feedforward.
Regardless, I'd just heuristically go 250 -> 256 -> 256 -> 256 -> 150, and then 250 -> 512 -> 256 -> 256 -> 256 -> 150, and then maybe 250 -> 1024 -> 512 -> 256 -> 150. Why? Because there isn't nearly enough information about your problem to do anything more than guess. You don't have to fully train to completion; usually the initial gradient decent loss changes are a good indicator of what your final model will be capable of.
Regardless, I would spend more time with things like learning rate and cosine annealing/warm restarts; it's no secret in the field that despite "WOW LOOK AT NEW ARCHITECTURE", meta-analysis shows that most performance gains in the past X years is from tweaking training strategy (Obviously not talking about transformers vs. feedforward layer only, but mainly on the many flavors of GANs/Transformers/RNNss etc).
>>14959337
>NEAT
t's more sophisticated than what I'm about to say, but essentially it's an evolutionary algorithm that prunes connections between nodes between layers, adds new nodes, mutates nodes and connections, etc. So it evolves the architecture of the neural network itself, while under normal SGD you have a static architecture and change the weights

>> No.14959503
File: 1.19 MB, 1440x810, odomtech.png [View same] [iqdb] [saucenao] [google]
14959503

>>14956046
dr. robert duncan, the cia whistleblower on AI and synthetic telepathy, has a good video on neural networks and how they can be used to retrain the neural networks in a human's brain.
https://www.youtube.com/watch?v=m67l0plq86o

>> No.14959634

>>14956603
I don 't even understand why anyone is impressed with AI. Imagine what a human would do if they could play 3 million games of chess in ten minutes. AI is staggeringly retarded.

>> No.14960436

>>14957081
tried it, limited success so far
I made sure my data had enough variation before starting the training

>> No.14960437

>>14958936
Like I said, I am a novice in this and dont think I know how to implement something as advanced.
An interesting idea though, maybe will give this a try some time in the future once I have more practical know-how

>> No.14960440

>>14959482
Thank you for the suggestions, will try these architectures
All my data input data are real numbers

>> No.14960493
File: 105 KB, 1007x1175, 1664819140534093.jpg [View same] [iqdb] [saucenao] [google]
14960493

>>14959337
I don't know what these things are

>>14960437
There are a lot of TV shows on YouTube about making NEAT play Flappy Bird.

I played with it some. NEAT works a lot better if you declare the networks to be "gendered" and have the top 80% of "female" networks reproduce with the top 20% of "male" networks and just delete all the other ones. I got the idea from someone complaining about women on YouTube.

This categorization only applies when the genetic portion of the algorithm is used.

I don't usually suggest Python, because I dislike Python, but it has a lot of nice AI libraries and examples you can use to play with NEAT with a lot less effort than you are imagining.

>tl;dr: If some teenager on YouTube can make NEAT play Flappy Bird, you can too.

>question
Is there a good alternative to YouTube? A lot of times I go there to hear anime music and after 2 or 3 songs get autoplayed somehow into CNN or sometimes a foreign network that plays stories about the Ukraine war.

I don't care at all about that war, except I hope one side or the other wins soon so prices on things can return to normal and bad people in my country can quit using a battle over an irrelevant territory on the other side of the planet as an excuse to censor the internet to save people from """misinformation""" or """disinformation""".

Also sometimes it autoplays me into boring people /g/ recommends like Luke Smith.

Neither of these things is related to anime music, and both are unwelcome disruptions.

>> No.14960781

bumping for actual science on /sci/

>> No.14961772

>>14960781
second it

>> No.14961775

>>14960440
have tried, 1 of them (first one) worked rather well, will try hyperparameter tuning on it

>> No.14963544

>>14960493
I know about Python supremacy regarding ML, but I am currently using Matlab because doing ML with it ties in better into my whole workflow because literally everything else is in matlab.
And it's not that bad, all I use is the base functionality

>> No.14963980

bump

>> No.14965592

the thread seems to have fizzled out. welp, it was a good neural net discussion while it lasted

>> No.14966467

Its a bit of a toss up usually. Start small and work up so you dont overfit right off the bat. Do a few experiments to establish hyperparameter bounds and then use a derivative-free solver (like baesian optimzation or any global method) to optimize those hyperparameters.

>> No.14967223

>>14966467
already doing that, nevertheless, thank you

>> No.14967224

>>14959482
where are these rules from?
how are they known?
is it from some sort of article?
can you provide a doi/link?

>> No.14967237
File: 279 KB, 1120x935, 3243554.jpg [View same] [iqdb] [saucenao] [google]
14967237

>>14956046
AI "engineers" are retarded come monkeys who literally just come up with arbitrary heuristics through trial and error that can only be done with a megacorporate computing center. There is no arcane knowledge. There is no trick. There is no real rationale. It's literally just "layers go brrrrr".

>> No.14969384

>>14967237
"helpful" input

>> No.14969427

>>14969384
>t. AI fan

>> No.14969776

>>14957081
ok, I have tried hyperparameter tuning for a network with reduced dataset.
now that I have the "best" training parameters, would the same network with the same training parameters but more data perform better then the one with the reduced training dataset?
I have not yet tried to optimize the architecture (neurons/layer or number of layers), because I got more or less satisfactory results with a suggested architecture
in principle, optimal training conditions (hyperparameters) would be different for a network with a different architecture, right?

>> No.14970020

>>14969427
not by a longshot

>> No.14970268

I'm also new to this and from my understanding it's arbitrary. but not like VERY arbitrary like 50 hidden layer. you just gotta play it and find the right amount

>> No.14970500

>>14970268
I seem to be getting this a lot
the lack of rigor & general rules is killing me here

>> No.14970921

>>14967224
>where are these rules from?
Which ones? If you mean the power-of-2 rule (512, 1024), that's just field culture. Some things like CUDA efficiency increase when you set, say, batch size by powers of 2 (it's because, at it's core, your computer is binary/byte-based; so powers of 2 or divisible by 8 theoretically is the most efficient way to load certain data like batch size).
As far as learning adjustments vs. architectures, that's just from watching the field evolve. Warm restarts is from this paper https://arxiv.org/abs/1608.03983 but there are many variations on it, almost always improve results all things being equal. There is a paper I can't find at the moment that did a comparison of most SOTA models and found that proper hyperparameter optimization is the main driver, and they were able to get several old and new model types to get exact-same performance by spending time on hyperparameter optimization for all of them (in between the lines: When people introduce a new model, they want it to look good, and they'll spend a shit-ton of time optimizing their model. Then they will take a comparison model, do 0 work on it and use as-is, and go tada! Our optimized model is now SOTA compared to last month's published model which we did not also optimize nicely!). But a lot of stuff is just picked up from different papers; just lots of little details that appear to work most of the time.
>>14969776
Generally you would split train/validation/test (70%/15%/15% of total dataset) and use your training set for model training and validation set to pick the best hyperparameters. You would then build your model with those hyperparameters on training + val data and report the performance on the test set (which should NEVER be touched until you have a "final" model). Finally, you'd rebuild the model with all the data for any future predictions to make sure you're taking advantage of all data.

>> No.14970926

>>14967224
>where are these rules from?
Which ones? If you mean the power-of-2 rule (512, 1024), that's just field culture. Some things like CUDA efficiency increase when you set, say, batch size by powers of 2 (Tensor Cores).
As far as learning adjustments vs. architectures, that's just from watching the field evolve. Warm restarts is from this paper https://arxiv.org/abs/1608.03983 but there are many variations on it, almost always improve results all things being equal. There is a paper I can't find at the moment that did a comparison of most SOTA models and found that proper hyperparameter optimization is the main driver, and they were able to get several old and new model types to get exact-same performance by spending time on hyperparameter optimization for all of them (in between the lines: When people introduce a new model, they want it to look good, and they'll spend a shit-ton of time optimizing their model. Then they will take a comparison model, do 0 work on it and use as-is, and go tada! Our optimized model is now SOTA compared to last month's published model which we did not also optimize nicely!). But a lot of stuff is just picked up from different papers; just lots of little details that appear to work most of the time.
>>14969776
Generally you would split train/validation/test (70%/15%/15% of total dataset) and use your training set for model training and validation set to pick the best hyperparameters. You would then build your model with those hyperparameters on training + val data and report the performance on the test set (which should NEVER be touched until you have a "final" model). Finally, you'd rebuild the model with all the data for any future predictions to make sure you're taking advantage of all data.
>>14970500
Learning theory is not arbitrary, it defines the scaffold of how you should approach your problem with ML. Most people just don't understand that it's probabilistic in nature and domain-specific.

>> No.14971161

>>14956046
machine learning is literally trial and error

>> No.14971428

>>14970921
>>14970926
Thank you so much for the informative answer and the reference
Regarding the "field culture" - there still must have been something (for example some old journal article or textbook) that started this trend, or did different people just come up with it independently?
And regarding the warm restarts - as I understand it is essentially just a different/new solver, right?

I have 2 more questions for you:
1) Should hyperparameter optimization be done on an untrained ("empty") network or can it be done on a network that was previously trained using sub-optimal hyperparameters (for example, different learning rate, different batch size, ect)? In architecture (number of layers, neurons per layer, dropout, etc) these networks are the same in both cases
2) I do my hyperparameter optimization on a dataset that is smaller than my entire dataset, but similar in its structure (the range of values and variation of values in the are similar in both cases) in order to reduce time needed for hyperparameter optimization - I do not have powerful hardware.
Is it reasonable to assume that once I have chosen the optimal hyperparameters for a network by using a reduced dataset, further training with the bigger dataset would improve performance by using the same hyperparameters?

>> No.14971435

>>14970926
do you have any learning resources, like journal articles or textbooks, about this "learning theory"? Getting familiar even with the basics would help me a lot, as it seems like currently I know nothing

>> No.14971988

>>14971161
this guy >>14970926 would disagree

>> No.14972376

this is one of the best threads on /sci/ right now OP, have a bump

>> No.14972563

I hate it, either you have theoretical work with no applications and since it's analysis you need to memorize 10000 identities to be able to prove anything, or you're making up building blocks based on intuition and throwing them together in different combinations until it works (using several GPU-months of time)

>> No.14974137

>>14972563
is a GTX 1050 Ti sufficient for training deep learning networks?

>> No.14974159

>>14974137
it's sufficient for training muh ballz

>> No.14974277

>>14974137
It depends on network size, and maybe on how easy the task is to train, but definitely not for the methods institutions and companies use for optimizing large networks, which means repeating the training many times (maybe 1000s if they have / rent enough GPUs) with different configurations of the network. More efficient ways of picking configurations to test is an active area of research

>> No.14974492

>>14974277
Yes, picking the right architecture is one of the problems I am facing now - optimizing multiple hyperparameters is very slow with my equipment, but it is the only thing I have

>> No.14975082

>>14956215
I thought you needed at least two hidden layers in order to be able to model non-linear effects in the data?

>> No.14976094

>>14975082
you only need 1, because then the output layer applies a linear transformation to the nonlinear activations of the hidden layer

>> No.14977043

>>14976094
yes, 1 layer is sufficient, but the question is - how many neurons are sufficient?

>> No.14977769

>>14956046
I do algotrading and also wanted to understand neural networks better so that I could apply them effectively, I wanted good THEORY. I'd tell you what I learned, but I feel like I'd be leaking alpha, so I won't :p. Keep searching the literature for answers, ask the right questions, it's out there.....
also 99.999% of people who do machine learning don't even know what induction is, it's crazy, it blows my fucking mind.

>> No.14978689

>>14977769
Is there any good literature You can recommend?

>> No.14978893

>>14977769
>>14977769
>I'd tell you what I learned, but I feel like I'd be leaking alpha, so I won't :p.
which is not possible since pure maths is disconnected from applied maths. Conclusion, you know nothin.

>> No.14979534

bump, am now interested in getting some answers like OP asked in >>14971435

>> No.14980824

>>14977769
I mean like textbooks and journal articles and stuff

>> No.14981337

bumping for scientific interest

>> No.14982927

sadly, good textbooks sources are hard to come by as a lot of the developments in the field are very much new

>> No.14984213

Anyone still here?

>> No.14985724

It would be great if anyone could answer these questions: >>14971428