[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/sci/ - Science & Math


View post   

File: 30 KB, 474x714, stats.jpg [View same] [iqdb] [saucenao] [google]
10805292 No.10805292 [Reply] [Original]

Thread for anyone currently working in the Data Meme business.

Is pic related any good? I'm currently employed as a Data Scientist, but my main contribution to the team is that I write better Python/SQL/Q than the rest of them, it's fine but I want to go beyond that and get more stats/ML knowledge.

Also fuck R, take the Juliapill.

>> No.10805328
File: 25 KB, 432x432, download (14).jpg [View same] [iqdb] [saucenao] [google]
10805328

>>10805292
Old school, must read... Bump for being Juliapilled. Haven't read that yet though.

>> No.10805478

>>10805292
Based Julia. I'm currently reading time series analysis by Shumway.

>> No.10805548

>>10805292
Haven't read the books that compete with ISL but it's a standard.

How did you get your job, OP? What are your skills that helped you find a DS job? Portfolio? I'm doing Kaggle comps as I'm finishing my CS degree.

>> No.10805651

>>10805548
I started out in a less technical role more similar to a Data Analyst and gradually got more technical.

Basically for DS you either go in that way, or go straight in after a STEM degree or whatever.

The field is still somewhat loose, so it also kind of depends what the job is, i.e. whether they need an ML engineer, a data analyst, a data engineer (read: DBA with more technical ability) etc

As a start, make sure you have
>Python (esp. pandas, numpy, some plotting library)
>SQL of some flavour
>Knowing basic Excel can actually help a lot
>depending on what you're actually doing and what industry, a lower-level language like C or Java might be a good idea

Totally depends on industry though.

>> No.10805746

>>10805651
thanks 4 reply
have a bump

>> No.10807502
File: 10 KB, 256x256, wes_mckinney.jpg [View same] [iqdb] [saucenao] [google]
10807502

I pray to Wes every night

>> No.10807551

You don't just read one book though. I'm reading about GLMs, categorical data, time series analysis, survival analysis, Bayesian learning, statistical learning, etc. Etc

>> No.10807558

>>10807551
Any recommendations?

I might pick up Shumways time series if it's any good

>> No.10807613

>>10805292
Yes, good motivation plus simple examples.
http://sgsa.berkeley.edu/current_students/books/

>> No.10807616

>>10805292
Julia book stats
https://people.smp.uq.edu.au/YoniNazarathy/julia-stats/StatisticsWithJulia.pdf

>> No.10807654

>>10805292
Red pill me on Julia?

>> No.10807662

>>10807558
I thought Shumway was overcomplicated. I only read parts of the ARIMA section though

>> No.10807687

>>10805328
>>10805478
What do these books cover?

t. time series anon that hasn't posted yet

>> No.10807706

So at work, I've been experimenting with this thing where I apply a smoothing convolution to a time series, forecast the smoothed series, and then deconvolve it back to linear space using Richardson-Lucy. The idea is that some of the noise in the signal might not actually be completely noise, and that it might actually be a predictable signal with some slight variation in how late or how soon it arrives. The convolution smooths it all over so that the "noise" occurs more uniformly in time, and then the deconvolution recovers the original signal.

What does /data/ think?

>> No.10807716
File: 84 KB, 732x924, Data%2C_2366.jpg [View same] [iqdb] [saucenao] [google]
10807716

This needs to be the OP image of the next thread.

>> No.10807735

>>10805292
Go for Elements if you already have some stats/ML knowledge I'd say.
Also,
>be bioinformatics student
>pretty everything in the subject I'm doing my thesis in is written in R
I've learned to appreciate R but I'd like to take the Julia pill at some point as well.
>>10807551
>>10807558
Seconding this, specifically GLMs and Bayesian learning. Doesn't necessarily have to be books though; notes/reviews would be cool too.

>> No.10807746 [DELETED] 

Am I a bad data scientist if I don't know shit about statistics, but am areally good with optimization/regression/signals processing?

>> No.10807749

Am I a bad data scientist if I don't know shit about statistics, but am really good with optimization/regression/signals processing?

>> No.10808158

>>10807687
Starts out introducing the field. Then goes through time domain analysis, frequency domain analysis, state space representation+kalman filtering. ARIMA is a part of time domain for instance.

>> No.10808161

>>10807558
Look up agrestis books

>> No.10808165

>>10807749
Haven't you done statistical signal processing?

>> No.10808169
File: 74 KB, 408x634, 1562430535828.png [View same] [iqdb] [saucenao] [google]
10808169

Show me interesting or shitty stat representations

>> No.10809779

>>10807749
Depends on what you define as "not knowing shit"

>> No.10810127

>>10805292

>Theoretical Statistics: Topics for a Core Course
>Mathematical Statistics - Sun Shao
>Statistical Inference George Casella and Roger L. Berger

Which did it best, anons? Everything else is for engineer retards who can't into proofs

>> No.10810132

>>10809779
I took an introductory stats course in college and got a B-. That was the last stats course I took. Really good at amath, though.

>> No.10810509

Wtf bros I'm a statistician and I'm always cucked from jobs by 'data scientists'

Trying to swap now, any advice? I'm still doing my master's in stats but I really need to be employable soon

>> No.10810550

>>10810509

Just learn programming and you'll be just as good or better.

>> No.10810691

Explain ARIMA to me like I'm a retard

I don't get how it works and I may need to implement it soon

>> No.10810882
File: 6 KB, 224x225, ohfug.jpg [View same] [iqdb] [saucenao] [google]
10810882

>>10810691
>not in sklearn

>> No.10812610

bump

>> No.10812616

Today I wrote code to make accurate weekly forecasts years in advance using only ten randomly scattered datapoints over a period of six months. Get on my level, plebs.

>> No.10814120

>>10812616
Forecasts of what, fag

>> No.10814131

>>10814120
How many of each product will be sold within each zipcode of the US.

>> No.10814136

>>10814131
>zero units sold every month because the product is imaginary

Genius

>> No.10814225

How much actual statistics do I need to know for a data analyst job? I'm good at kaggle competitions, I know statistical learning theory, but stuff like hypothesis testing goes over my head, haven't bothered.

>> No.10814410

>>10810127
I just downloaded statistical inference and it wasn't as heavy on regression as I was hoping. Do you know any book that covers multivariate regression well?

>> No.10815977

>>10814410
Just learn optimization m8.

>> No.10815996

>>10815977
I just downloaded Convex Optimization by Boyd and Vandenberghe. Will convex optimization help me estimate functions from datasets?

>> No.10816005

OP, ISLR is a great place to start and covers concepts really well.
It is EXTREMELY light on math though, so don't think you're getting the full treatment.

t. took an actuary exam that used ISLR as part of its material, but went way beyond it in math

>> No.10816014

Hey guys, I'm in a program called SharpestMinds. They connect you with a mentor and plug you into a hiring network.
I'm not here to advertise for them, just saying if you are a pretty well-qualified guy who's knocked out some MOOCs, but you still can't land a job, SM might be what you're looking for.

*Full disclosure you'll have to give them 10% of your first year's salary.*

*Full disclosure you have to find a mentor who will actually take you on, which means passing a coding and stats challenge*

*Full disclosure you should basically be 80% of the way there in qualifications*

However, they don't get paid anything until they get you a job. So it's a cool concept.
Currently they're averaging a new mentee hire every 3 days.

Just to be clear, it is fucking hard to get an entry level DS job if you don't have experience. And even if you get the interview they'll probably keep lobbing technical questions at you until they find a way to trip you up.

>> No.10816028

>>10810509
>doing my master's in stats but I really need to be employable soon

Don't sweat it man. Take a breather. Do some MOOCs for data science and maybe a bootcamp. Yes the bootcamp's expensive but with an MS Stats you can start out 100k as a data scientist. Just look for more stats-oriented roles.
t. was in the same boat as you a year ago

>> No.10816088

>>10816028
Not him but can you get a decent position with just a certificate and no degree? I'm currently a Junior in a B.Sc Materials Science and Engineering program but I'm not being challenged at all and I'm honestly struggling to afford school. To keep it short, I'm looking to make decent money with the certificate (~$60k) to pay off debt, finish undergrad, and then research AI applications to Mat Sci.

>> No.10816097

>>10816088
No man, to get a data science job you often need a grad STEM degree.
I'm not saying it's absolutely necessary, I've known some good data scientists with just a bachelor's in marketing. But employers won't take you seriously without the bachelor's.
If I were you I'd just lighten the courseload and go work in a restaurant bro. You're not in any rush and sometimes doing that a few years can do you some good.

>> No.10816175

>>10816097
Well, what if it's a bachelor's in time away status? I figured the biggest push for any applications I put forward would be my projects rather than my certificate; will they really trash my application in spite of my projects simply because I don't have a degree?

>> No.10816263

>>10814225
Very little stats, from experience it's mostly data cleaning. Stuff like hypothesis testing will hopefully come with practice, it's unintuitive and only makes sense if you understand data generation processes (i.e. where the test statistic comes from). A first course in probability should give you that knowledge. After that, you can push further by studying decision theory and stuff.

>> No.10816357

>>10815996
It's useful for finding the maximum likelihood estimator in the convex case.

>> No.10816369

>>10816357
That's not what I need. I intend to learn convex optimization eventually so it's still nice to have the book I guess.

>> No.10816376

>>10816369
MLE is used for time series analysis, but you aren't doing that?

>> No.10816930

>>10805292
bump

>> No.10816942

Can someone intuitively explain how lasso and ridge differs from normal linear regression? I know that they use some absolute/squared penalty but what exactly does it penalize.

>> No.10817010

>>10816376
It's more that my data isn't convex unless I fuck with it or add another dimension, but maybe I'm misunderstanding. The dataset that I'm trying to work with is five dimensional and some of the dimensions are pretty sparce. I was going to try to use Bayesian regression or whatever looked most suitable after I had learned enough statistics.

>> No.10817052

>>10816942
Yeah man.
Ridge and Lasso introduce a penalty on the values of the linear regression coefficients, and try to shrink certain coefficient values in order to reduce the RMSE. Ridge uses the L2-norm and Lasso uses the L1-norm.

Lasso can completely eliminate irrelevant coefficients, ridge can only make them very small.

It's covered quite well in the OP book actually (which is free, just google ISLR)
This guy has a bunch of good videos
https://www.youtube.com/watch?v=5asL5Eq2x0A

>> No.10818242

>>10817010
>my data isn't convex unless I fuck with it or add another dimension
splain

>> No.10818294

>>10818242
One of the dimensions strictly increases over time (with a few exceptions) and the rest are dimensions of time

>> No.10818317
File: 113 KB, 356x249, I_don't_need_it.png [View same] [iqdb] [saucenao] [google]
10818317

>>10810882
fuggg

>> No.10818598

>>10810550
Programming is like literacy in a bunch of fields these days.

>> No.10819140

>>10818294
Show us an example.

>> No.10819150

i did a data science masters.
my job title is data scientist.
but all i ever do at work is descriptive statistics (which a highschooler could do).
And i am not allowed to use python. I have to use all these gayass proprietary languages that nobody else uses.

FUG

>> No.10819262

>>10810691
Start with learning ARCH and GARCH models, do a few calculations by hand. Then ARIMA will begin to make sense!

>> No.10819383

>>10819140
I don't have one handy. I want to predict gross sales as part of some scheduling software I'm trying to make. I'm starting by breaking the time dimension up in a way that I think will make the data smoother and easier to predict, specifically as year, day of the year, day of the week, and hours. I haven't learned enough yet to know if this approach is flawed or invalid, but I figured that representing the data this way would better for dealing with holidays and outliers. How far off am I, and what books do I need to read?

>> No.10821230

>>10805292
Nice

>> No.10821291
File: 137 KB, 798x1200, phenotype.jpg [View same] [iqdb] [saucenao] [google]
10821291

>>10805292
>Daniela Witten

>> No.10821417
File: 23 KB, 692x526, 1559334643687.jpg [View same] [iqdb] [saucenao] [google]
10821417

>working around codemonkeys
>don't really know what I'm doing
>use the phrase "linear regression"
>they all act super impressed like it's some kind of AI wizardry
So this is the true power of data science

>> No.10822208

>>10821417
My manager once asked me to do linear regression for a classification problem. 90% in professional jobs are retarded which is why the data science meme is possible

>> No.10822212

>>10819150
Are you using SAS?

>> No.10822236

Hey anons. Looks like I'll be implementing a conv net in Keras pretty soon, working off an architecture described in a paper. I'll have help from my lead, who has more experience with this kind if thing. Personally I've worked with neural nets a good bit, but never from the ground-up like this. Any pointers?

>> No.10822835

>>10822236
Hyperparameter tuning is your friend. You can design the world's best model, but it won't do anything if you give it the wrong hyperparameter values. Use this https://scikit-optimize.github.io/#skopt.gp_minimize

>> No.10823331

What is the job title of someone who develops software systems and implements/integrates ml components into them?

>> No.10824130

>>10822835
How does it compare to RandomizedSearchCV for example? I use that to narrow down the choices before a small GridSearchCV. How do you guys find parameters?

>> No.10824582

>>10824130
It finds the optimum after ~20 samples instead of 200+ samples.

>> No.10825117
File: 171 KB, 618x380, 18437217314.jpg [View same] [iqdb] [saucenao] [google]
10825117

rec me some book on probability and statistics for morons that doesn't bring in american textbook style 800 page bullshit, their own applets and whatnot

>> No.10825125

>>10805292
I've read that one in class. It was decent.

>> No.10825134

>>10816088
check out citrine.io, they're a big name in your area of interest.

>> No.10825298
File: 22 KB, 348x499, r.jpg [View same] [iqdb] [saucenao] [google]
10825298

>>10825117
I found this was a good practical book. Not too rigorous and gives you a lot of practice with applications with datasets.

>> No.10825495

>>10823331
Machine learning engineer my dude. Pay is good.

>> No.10825503

>>10825298
this made my balls shrivel up

>> No.10826320

>>10825503
>this made my balls shrivel up
what does this even mean?

>> No.10826324

>>10826320
programming and stats are against life itself

>> No.10826327

>>10826324
programming + stats = data science

>> No.10827093
File: 56 KB, 768x960, 714827381.jpg [View same] [iqdb] [saucenao] [google]
10827093

>>10825298
>start reading
>babby language
>check website
>department of biological sciences

>> No.10827099

>julia
Stop using meme languages

>> No.10827154

>>10827099
Start using good languages.

>> No.10827166

>>10826327
yes that’s right >>10827154

>> No.10827201

>>10819383
You are on step one.

About a hundred more to go from there. Good luck !

>> No.10827665

>>10827201
That's not helpful

>> No.10827672

>>10807616
>UQ

m8

>> No.10827893

>>10810127
Casella or Mood.

>>10814410
Seber, Lee. Or Applied Multivariate Statistical Analysis by Johnson and Wichern.

>> No.10828041

>>10827893
>Applied Multivariate Statistical Analysis by Johnson and Wichern
This is excellent. Thanks.

>> No.10828276

>>10827093
What did you expect it’s an intro book for non-stats students who want to pick up just enough data analysis for their own studies.

>> No.10828300

>>10828276
>pick up just enough data analysis for their own studies
This is the reason why psychology is a meme science with <30% reproducible results.

>> No.10828420

>>10817052
You're retarded. Ridge and Lasso regularization have no correlation with RMSE. They're correlated with a reduction in variance.

>> No.10828451

>>10828300
dilate

>> No.10828533

>>10828420
they reduce rmse on the test set

>> No.10828897

>>10828533
that's not necessarily true. if you apply l1/l2 norm on an already underfit dataset, you're not going to get a lower rmse

>> No.10828942

>>10828451
cope harder

>> No.10829580

>>10828533
Why do you think penalized terms in objective function would reduce rmse?

>> No.10829884

The virgin statistician vs the chad optimization expert.

>> No.10830177

I just finished reading this book, took me about 1 ½ months without doing the exercises.
I think I understood 60 - 65 % of what I read, does this make me too brainlet for ML/stats?

>> No.10830995

>>10819383
Not sure about the nature of your problem but if we are talking about sales that can be made 24/7 (i.e. online website sales) then it might be a good idea to warp the intraday time dimension so that 11:55PM and 01:30AM do not have a larger distance between them than say 13:30PM and 15:05PM. That is of course if the optimisation method of your choice depends on a distance metric. The warping may be done by engineering a time feature through polar coordinate transformation.

>> No.10831019

>>10822208
It will take years for management to catch up to data science concepts - if they ever do. There are some that try it through courses aimed at managers (you see more and more schools cashing in on it, like MIT with their "AI" course). Ultimately though, it will remain the task of the data scientist to adequately and interestingly "pitch" an idea to management such that they understand it and think it makes them look fancy at the same time. This is not easy to do. You can usually distinguish the quality of a data scientist by his/her ability to be able to explain advanced concepts on a retard level.

>> No.10831043
File: 50 KB, 381x500, 51vPMQ3gJWL.jpg [View same] [iqdb] [saucenao] [google]
10831043

I know this book gets some hate from data scientists but if you are new to the ML field and want to get a light-weight, easy to read, hands-on and relatively wide overview of it, I can really recommend it. If someone asks what books to read as a beginner to ML, this is always at the top of my list. Everything else goes from there if you wish to deep dive into individual topics / math / models.

>> No.10831085

Does anyone have experience with unsupervised machine learning and what has been your mental state after this exposure? I am currently applying unsupervised anomaly detection at work and I have to say the combination of unsupervised + anomaly detection in a low signal to noise environment just drains the life out of you - it's beyond soul crushing.

>> No.10831155

>>10828897
of course thats why you do proper model/hyperparameter selection
>>10829580
can reduce overfitting

>> No.10831175

How in the everloving fuck can anyone like data engineering. Cleaning data is the most mindless and soul-crushing programming experience I've had ever.

Mathematical modelling and comparing model types is great fun, but sorting through shitty data is a nightmare. How does /data/ get through it?

>> No.10831182

>>10831175
I guess every job has its downside elements - data handling and cleaning is definitely one of them in the data science / ML field.

>> No.10831247

Senior undergrad statistics major here, been messing around with ML in R for a while. Anyone know any good ways to make a little extra cash with ML?

>> No.10831578

>>10830995
>if we are talking about sales that can be made 24/7 (i.e. online website sales) then it might be a good idea to warp the intraday time dimension so that 11:55PM and 01:30AM do not have a larger distance between them than say 13:30PM and 15:05PM
It's not 24/7, but that's an interesting consideration. I probably want to do that for the weekday and day of the year. Are there any other considerations I might have missed?

>> No.10831645

>>10830177
If you do the excercises you'll understand 90-100%. Work it bitch

>> No.10832860

>>10831175
Some of the database and automation stuff is somewhat interesting, and setting up a fully automated data pipeline is pretty satisfying

In addition, the Data Scientists who understand your work are your customers rather than a load of salespeople or management who don't have a clue

But it seems like a pretty thankless job tbqh

>> No.10832866

Can someone actually explain the advantages of Julia over R without memeing?

>> No.10832916

>>10832866
SUUPA SUPEEDO

Also nice REPL

>> No.10833044

>>10832916
And here i thought R was already quite quick with tabular data, linear algebra, inverting matrices etc. Always seemed faster than Python to me

>> No.10833056

>>10833044
Tbqhwu it depends what you're doing

Julia can be a little annoying in that you have to compile packages before use, but once you do it's fast as fuck

So it depends how much speed you want. Languages like Q are designed for use cases where speed is the only consideration, R is for when you don't give a fuck about speed

>> No.10833426

>>10805292
Sysadmin with experience in Python here.
Where do I start? I tried getting into ML and datascience using a crashcourse made by Google. But it hard to follow very quickly.

>> No.10834224

what is "data science"
Is it related to CS

>> No.10834235

>>10834224
>Is it related to CS
no.

>> No.10834255

>>10834224
Nah

HR cunts think it is though

>> No.10834824

Is professor Han at UIUC good? He has a book with 10000+ citations, but is he good at instructing?

>> No.10834917

>>10831175
they get paid pretty well.

t. a data engineer who makes more than the data scientists at my company

>> No.10835104

>>10834917
what's the difference between the titles?

>> No.10835443

>>10835104
Data engineers just store things. They don't really interact with the data at all.

>> No.10837114

What do you do when your new job has spent millions of dollars on a new project they hired you to lead, only for you to discover that they didn't really know what they were doing, so they built something totally useless?

>> No.10837197

Fucking spent hours working on this webscraper

It turns out beautifulsoup wont work if there’s a slash / on the end of this address.

Anyone doing a data science project? I’m scraping property data from a public website

>> No.10837201

>>10831175
It’s not just data cleaning, it’s a lot of pipelining data around. And at the end of the day that’s pure computer science/software engineering
So there are guys with CS PhDs who work with Luigi and Airflow and stuff just setting up machine learning jobs and making sure it all runs

>> No.10837678

>>10837197
are you using beautifulsoup to fetch the page?
the way i usually use it is use another http client to get the HTML as a string then pass that to beautifulsoup to do the parsing, it shouldn't care about the address that way except for resolving links on the page

>> No.10838206

Is linear programming /data/?

>> No.10838745
File: 13 KB, 160x373, snap.jpg [View same] [iqdb] [saucenao] [google]
10838745

>>10837678
I use requests to fetch the page then parse it with beautifulsoup

Right now I'm making a separate .json file for every property in the database, and there are about 250k. I'm gonna have to run this over multiple nights to fetch all the data.

In the meantime I'll do a little EDA of the little observations I have.

>> No.10839571
File: 211 KB, 664x701, 1546105196146.jpg [View same] [iqdb] [saucenao] [google]
10839571

Just started my MSc Mathematics & Statistics. I'm studying it while I work as a Data Analyst (SQL, Python kinda stuff). Coming from a B. IT (CS but with no Math) hoping this will help me elevate my career to the next level.

Is this is the right move? I turned down a Master of Data Science because I felt it was a bit of a meme degree.

>> No.10839606

>>10839571
Depends on what type of work you'll be doing during your msc

>> No.10839659

>>10839606
It's MSc by coursework. The topics are pretty much a Bachelor of Mathematics with Applied stats focus but with room for directed courses which i'm going to choose the ML courses to fill in.

>> No.10839822

>>10805292
The startup im working in is growing quite fast, it's clear that in one year or less the tools we are currently using for data analysis won't be able to keep up. So what tools do you guys use? We use mainly BigQuery

>> No.10839908
File: 859 KB, 1296x797, mathMemes.png [View same] [iqdb] [saucenao] [google]
10839908

>>10831043

If one would do some of the projects on there can I land myself ML software engy or data scientist roles (Assuming BS in CS)

Also,
>inb4 CS Meme shitposting

>> No.10840502

Is there a way to apply attention mechanism to outputs of multiple LSTM encoders but combine it into a single LSTM decoder?

>> No.10840520

>>10839571
>I turned down a Master of Data Science because I felt it was a bit of a meme degree.
>he actually listens to wha the NEETs on 4chan say
Loser

>> No.10840527

>>10805292
>Fuck R
No, fuck you

>> No.10840537
File: 6 KB, 403x178, application.png [View same] [iqdb] [saucenao] [google]
10840537

>he wants to land a job in data science
just fking LOL

>> No.10840539

>>10840527
this desu

>> No.10840544

>>10831175
You're right that a lot of it can be soul crushing data munging. I much prefer the data science side of it, and still do it as often as I can, but unfortunately data engineering pays better.

I absolutely hate database modeling and SQL, but actually designing and building data pipelines can be fun. My statistical learning is pretty solid, but I need to get better at the hardcore ML deep learning stuff, and hopefully I can pivot into ML/AI Engineer of some sort

>> No.10840551

>>10833056
Fuck base Julia and base R for a quick sec.

Consider R in the context of Tidyverse, and the breadth of robust and well performing packages available for stuff from PCAs, to nonlinear modeling, and network analysis, how the fuck can anyone claim that Julia is better than R?

>> No.10840633

>>10837197
You mean Requests? BS parses HTML responses, you shouldn't need the address

>> No.10840671

I'm working on time series analysis and I need to find a transformation that maps sequences of various lengths to the same length. Tried Fourier transform already, seems to work to a certain degree. Dynamic time warping is not an option, my boss will rip my head off and shit in my throat if I suggest this one more time (and he is probably right).
Do you maybe have an idea?

>> No.10840788

>>10840551
They're claiming base Julia is better than base R, the fact that R and Python have better ecosystems is why Julia isn't that usable yet

>> No.10841030

>>10840671
What's wrong with DTW? Without knowing that, I'm not sure how to help.

>> No.10841071

>>10841030
The data in between the recognized correspondence points can't be used.

>> No.10841078

>>10841071
So only certain subsequence are correlated? Can you post a graph?

>> No.10841086

>>10841078
Unfortunately I can't post a graph because everything is in the company network. They don't allow me to take the laptop with me and I'm 2 hours away from my work place.

>So only certain subsequence are correlated?
Exactly! One subsequence is like 1% longer than the other. Not a big deal, but I can't feed that to machine learning models. I can't even do point-wise comparisons.

>> No.10841582

>>10840788
But in the modern age of programming, it doesn't even make sense to consider a language outside of its ecosystem. Base Julia may be faster than base Python or base R, but things like Pandas, numpy and Tidyverse are C optimized, and Julia isn't that much or isn't faster than a given C subroutine.

There is utility in the entire world speaking English in the same way that there could be utility if we all shared the same programming languages (ie. growth in ecosystems). There's always ways to optimize and improve programming, but Julia offers nothing that makes it revolutionary to me outside "its faster".

>> No.10841719

>>10841582
There's an enormous cost to jumping back and forth between interpreted and compiled code multiple times within a single line. You will never get "native performance" with something like numpy. You'll just get performance that isn't absolute dogshit.

>> No.10841729

>>10841582
>Julia offers nothing that makes it revolutionary to me outside "its faster".
The entire language has end-to-end autodifferentiation. You can write an arbitrary piece of code and turn it into a differentiable model, loops, if statements, and all.

>> No.10841811

>>10840551
>>10840788
>>10841582
Will R die once other languages start to get the same amount of packages?

>> No.10841826

>>10841811
R is already dead.

>> No.10841844

>>10841826
Any sources for this?
I'm a biofag who spent the last 5 months using R, it's even better than my python now.
I don't want all these R-skills to be for naught bros.

>> No.10841851

>>10841844
this please help me I don’t want to learn another language

>> No.10841897

>>10841851
>>10841844
>>10841826
>>10841811
Data scientist in silicon valley, R is nowhere close to dying. The R ecosystem has improved tremendously even in the past 2 years, as has the cross-communication between R and Python.

>> No.10842621

>>10840520
Really depends on the program. I've seen some people with masters of data science from a degree mill and their coursework was literally in excel and power bi, no real technical meat to it.

>> No.10843006

>>10841811
Naa, R is a specialized tool for doing stats aimed at non-programmers, so it will always be relevant.

>> No.10843051
File: 166 KB, 884x1364, 71hX4xNc9NL.jpg [View same] [iqdb] [saucenao] [google]
10843051

Anyone wanting to share some experiences with this?

>> No.10843059

If you want to be a BIG DATA code monkey like the pajeets you barely even need An Introduction to Statistical Learning since most of the major tools are now bundled kits that you can run out of the box.
If you want to actually LEARN something you pick up statistics books. Machine Learning was literally a solved field in Statistics in the 70s

>> No.10844957

>>10843059
>Machine Learning was literally a solved field in Statistics in the 70s
Imagine being this clueless.

>> No.10845040

how is this thread still going - what did I miss?

>> No.10845217

>>10845040
Not much. Juliafags screeching "no u" and some "bros, how do i x?" posts.

>> No.10845260

My college offers a "master’s in applied AI” from the electrical engineering department. Is it worth it?

>> No.10845403

>>10845260
Depends on your skills and interests, what you would do otherwise, as well as the college.
Probably though

>> No.10845572

>>10845260
How is anybody fucking supposed to know that? Take a look at the curriculum and decide for yourself. That being said, the fact that they're using the word AI instead of applied statistics or applied machine learning is a danger sign.

>> No.10845601

I'm testing out using an unscented kalman filter using the filterpy library for python. Wondering if the pykalman library is better here.

Regarding the design of my filter, I am trying to figure out 2D position from a mix of 1D velocity, angle, angle rate of change, and 2D acceleration data. The book I read said that more data is always better, but I'm having a hard time tuning the filter in a way that it doesn't diverge quickly. I am feeding it data at 500HZ and using RTS smoothing on the filtered data. Is it too tedious to tune the filter with this much data, or is more data always a good thing?

>> No.10845602

>>10843059
I wonder if this is a troll or if you're really this uninformed.

>>10845260
I think it would get you a job in data science.
However if I were you I would choose something more theoretical or science-heavy for grad school.
If you have a related STEM background, you just need some MOOCs and motivation to learn data science. But if you go to school for just data science it kind of narrows your horizons.
If you have a good background in generalized linear models and statistics you are 80% of the way to understanding most machine learning algorithms.

>> No.10845604

>>10843051
Monitoring, looks interesting as fuck actually

>> No.10845675

>>10840551
>Fuck base Julia and base R for a quick sec
>"uuh, ignoring the fact that Julia is inherently better..."
R and python are garbage and only survive because of brainlets and boomers

>> No.10846348

>>10843059
Data science is more optimization than statistics. Git gud.

>> No.10846412

Looking for an IQ (Can be ASVAB or old SAT) to Income dataset to do some testing on.

Thanks.

>> No.10846898

How good is the book in the OP? I'm taking an Introduction to Data Science and Machine Learning course this fall but the majority of the course will be based on the lecture notes. They have that book as a reference and I'm wondering if it is worth the read.

>> No.10846923

>>10846898
Very easy, short and free book
Perfect introduction non-pop sci.
https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

>> No.10847247

>>10834224
statistical analysis of stochastic time series

>> No.10847714

>>10846412
Ask /pol/.

>> No.10847731

hi anons

I am actuarial science graduate, I want to start on the path of data science using python, what books and courses you recommend?

I prefer books because I get bored watching videos, but please do recommend if you got a good course. thank

>> No.10847735

>>10805292
Read it for a graduate class at UT Austin. Personally, I just didn't like it. Andrew Ng's course notes were clearer, more concise.

>> No.10847764

>>10816014
10%, big ooeff mr goldstein, thats a rough cut of the top.

>> No.10848314

>>10847731
dataquest.io is good, it has no videos.

Honestly if you want to get handy with Python and data science, you're going to down a long road and you're gonna need lots of practice. So you need to get up and code everyday on different sites like Udacity, dataquest, datacamp, edabit, etc.

I did 3 actuary exams myself. If you took SRM you should have a great understanding of data science topics. If you haven't taken SRM, take it.

>> No.10848599

>>10841826

Nope. Matlab is dead. R is the future.

>> No.10848939

>>10841897
>>10841851
>>10841826
>>10848599
Statistics masters student here, saying R is dead is like saying Java is dead.

>> No.10849073

>>10843051
Introductory Econometrics by Woolridge isn't bad, that's what we used at my uni for econometrics and it'll teach you the basics of regression, time series etc. While it has a fair bit of math, it explains it pretty well. Plus you can get it as well as the datasets used for free online after 2 Google searches.

The data comes in stata format but python+pandas or R can read it easily. I believe that there's actually a python module which contains the data sets so you can import whatever you want right in like in scikit.

>> No.10849076

>>10847735
>UT Austin
ayyyyy

>> No.10849160

Say I'm working on a non-convex optimization problem and I want to try some regularization, how do I compute the norm of a function in its reproducing kernel Hilbert space? How do I select a proper kernel?
This is kinda out of my expertise area and I'm having some trouble to figure it out.

>> No.10849673
File: 1.87 MB, 331x197, 1558373551380.gif [View same] [iqdb] [saucenao] [google]
10849673

>>10841582
>tidyverse is C optimized

>> No.10849681

>>10843059
Top tier brainlet post. Go read Pattern Recognition and Machine Learning

>> No.10850034

Please help me, I think I'm gunna get fired...

I have taken one intro to statistics in undergrad. No one in my research group is a statistician. I hope we hire an epidemiologist in the future.

I have the number of injuries at 15 hospitals (1 of the hospitals is the "hospital of interest"), as well as the corresponding demographic data of each injured patient (age, gender, date of injury, cause of injury, level of injury, date of death). These are the only hospitals in the country, and each serves a known region with a known age- and sex-stratified population (info I can get from gov stats website).

My objectives are as follows:

(1) get a general picture of the situation in terms of age, gender, cause of injury, level of injury at (a) each hospital and (b) nationwide for all 15 hospitals combined.

(2) compare the hospital of interest's age, gender, cause of injury, level of injury with (a) each hospital and (b) nationwide for all 15 hospitals combined.

I have 6 years of data, so I was thinking of looking at the variables annually. Aside from this I'm really out of my element here and I've been reading similar studies (though there really aren't that many) to learn more.

Any help/suggestions is appreciated.

>> No.10850065

Holy shit this thread reads like a math/computerscience drop out meeting.
Lmfao.

Imagine thinking you are smart because you know Python, Statistics (lol) and linear algebra lmfao

>> No.10850103

>>10850065
you have no idea how easy it is to impress managers with something as simple as PCA.

>> No.10850107

>>10850103
The insane part is, 10 years ago people who wrote in Python were seen as literal garbage. I can't understand why that changed.

I think there are just too few people left that actually know how to code , computer science and math so that all the drop outs and losers with no math background can only into Python and easy unoptimized shit.

>> No.10850139

Book in OP:

> "This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist."

> non mathematical sciences

HAHAHAHAH holy shit. how deep do you have to fall

>> No.10850167

>>10850107
To be fair, many applied fields require fast prototyping of ideas which Python is particularly well-suited for. Programmers for compiled languages usually enter the stage when an idea has been completely fleshed out and ready for implementation.

On the other hand, Python can be twice as powerful under a user that actually knows the benefits and limitations of i.e. many of the tools available in Scikit-Learn. You can easily spot the difference between someone who knows a bit of Python, has done some Coursera courses, and can apply 5-liners found on some blog post, vs somebody who can actually build custom pipelines for a project at hand, make changes to base estimator classes etc.

>> No.10850187

>>10850167

Yes, that difference is 10 hours at most. This has to be a joke.

>> No.10850958

>>10849160
What in the actual fuck are you working on?

>> No.10850986

>>10850958
Your mom

>> No.10850992

>>10805292
>general
>no tips on how to start studying the subject
Come on, OP.

>> No.10851004

>>10850992
I don't even know where to start. I'm currently doing an MSc in Applied Math.

>> No.10851039

>>10850958
a non-convex optimization problem

>> No.10851604

>>10850034
this seems easy, did you get in a job you don't know anything about?

>> No.10851605

>>10850139
undergrad and master students, (in mathematical sciences)
AND
phd's in non-mathematical sciences.

>> No.10851608

>>10850992
but he posted it.

>> No.10851746

Aussie here, currently working as a BI Dev and have a B.IT. My degree didn't cover any required maths but voluntarily took one base level Calc & Algebra course and really liked it. I wanna do a Masters to elevate myself above the competition in searching for better jobs and hopefully fill in some gaps from my prev. degree. Help me choose bros:

Master of Data Science from UNSW (top aussie uni)
(AUD $48,000)
https://studyonline.unsw.edu.au/online-programs/master-data-science

Master of Computer Science from Georgia Tech (online) (AUD $11,000)
http://www.omscs.gatech.edu/home

Master of Computer Science from University of Illinois (online) (AUD $30,800)
https://www.coursera.org/degrees/master-of-computer-science-illinois

Master of Computer Science from Arizona State University (online) (AUD $22,000)
https://www.coursera.org/degrees/master-of-computer-science-asu

>> No.10851755

>>10851746
Georgia Tech is the cheapest option and it's among the top engineering/CS schools in the U.S.

>> No.10851788

>>10851039
Regularization is for regression. It should not be used if you're just solving a general optimization problem that isn't related to training a model.

>> No.10851849

>>10851605
>>10851605

that's not what it says

>> No.10852038

>>10851605
Ever heard of Oxford comma?

>"I saw a dog, a horse and a cat cross the street."
Means: "I saw three animals cross the street; a dog, a horse and a cat."

>"I saw a dog, a horse, and a cat cross the street."
Means: "I saw a cat cross the street. I also saw a dog and a horse."

There's no Oxford comma in >>10850139 so it literally means that the book is aimed for students in non-mathematical sciences.

>> No.10852054

>>10850034
Basic exploratory data analysis will solve half of that. learn2pandas + use the pandas-profiling library in a jupyter notebook and you can find all that out within an hour or 2.

>> No.10852063

>>10852038

Oh no an introductory maths book to help people

>> No.10852111

>>10851788
Why not? Regression boils down to an optimization problem, and isn't "training a model' just parameter estimation? My problem looks a lot like non-linear least squares.
Anyway, I don't think this discussion is relevant to my original question about RKHS.

>> No.10852116

>>10852063

> oh no data science is literally a meme for people who failed at math and CS

>> No.10852126

>>10851788
>>10852111
Also regularization is literally constraints handled with lagrange multipliers, except the multiplier is a tuning parameter instead of a variable, why couldn't it be useful in other kinds of constrained optimization problems?

>> No.10852226

>>10805292
>take the Juliapill
Julia mode in emacs is not comfy at all. :(

>> No.10853576

bump

>> No.10853608

>>10850139
lol

>> No.10854097

>>10851788
regularization is just applying a penalty to any parameter being estimated.

if you approach your analyses in a bayesian context, you realize that regularization is just using a partially/weakly informative prior to shrink parameters towards zero/group mean.

see U(-infty, infty) vs Normal(0, 100) vs Normal(0, 1) vs Laplace(0, 1) vs horseshoe e.g. Normal(0, \sigma), \sigma ~ Cauchy+(0, 1).

>> No.10854099
File: 16 KB, 375x376, stan_logo.png [View same] [iqdb] [saucenao] [google]
10854099

>>10854097
like, in general, most problems in the actual modeling part of data science are MUCH easier if you approach them from a bayesian context. most "tricks" in machine learning/data science are just a change in prior choice.

Stan master race.

>> No.10854128

>>10850107
>>10850139
>>10850187
This is the guy in your group who works on weekends and doesn't talk to anybody during lunch. Either that or an undergrad LARPer.

>> No.10854419

>>10852111
>>10852126
>>10854097
You have no idea what you're doing, and you're going to get bad results. Regularization is a prior to enforce regression model simplicity. If you're trying to optimize a non-regression problem, that makes literally no fucking sense to apply regularization. You're not going to converge to the correct optima because you arbitrarily decided that you want to find answers close to zero, and this will also mess with your objective function. Constraints, if you really do have them, should be handled using barrier functions.

>> No.10854454

>>10819150
>implying this is a bad thing
just rush the easy work and teach yourself some fancy tricks during your downtime, then shoot for a senior role at a different company

>> No.10854469

>>10834224
yes and no. "data science" is a term that includes a lot of topics, including but not limited to:
>web scraping (scripting in python, JS, etc.)
>data warehousing (databases & api development with SQL and an object-oriented lang)
>machine learning (typically python)
>linear and multivariate regression (applied statistics, usually done in R or SAS)
>data modeling (a set of decisions you have to make before writing code)
>data visualization (a meme term for making charts and powerpoints)
>technical writing (explaining numbers to boomers)
a lot of these require programming, but "data scientist" could mean anything from "machine learning dev with high code quality" to "zoomer who uses microsoft office really well and sometimes writes python scripts."

>> No.10854479

>>10851746
Seconding Georgia tech. University of Illinois is good too, but there's no reason to pay three times the price. That being said, I'd apply to all 4 to see if you qualify for any scholarships.

>> No.10854522

>>10810127
All of these are shit for brainlet engineers. What should a real mathematician (IQ > 190) read?

>> No.10854525

>>10854469
So you’re saying its a meme?

>> No.10854563

>>10854522
How to stop sucking dicks and get a job. Written by me.

>> No.10854569

>>10854563
Shut the fuck up NEET. Everyone knows you still suck cocks

>> No.10854743

>>10854419
Why do you try to respond when you clearly don't understand what you are talking about?

>> No.10854749

>>10810127
I took 2 classes on Casella and Berger and can confirm it is indeed hard as fuck.

t. MS Math

>> No.10854759

Should I take an introductory course in machine learning even if I never programmed before a lot?

>> No.10854847

>>10805651
Would you say that those 6 week bootcamps are worth it?

I only have a Chemical Engineering masters currently working at a university mostly doing scientific programming, but I have a lot of good GitHub contributions and some minor software dev freelance work. The oil industry slump is scaring me and I desperately want to get into DS/ML.

>> No.10854855

>>10854847
I recommend going into programming instead, Data Science would be a lot of retraining when it sounds like you're already a competent programmer. Just apply for some positions that say software engineer or something like that.

>> No.10854874

>>10854855
>Just apply for some positions that say software engineer or something like that.
I have, but I've never gotten an interview for any of those applications. The only freelance work I've ever found is through friends/colleagues. I'm not sure how exactly to transition into software dev. There's a lot of work for software engineers working on control systems etc., but they don't seem to look like twice at a ChemE's CV (which is a bit ridiculous since I've taught both ChemEs/EEs process control as an assistant lecturer, but whatever such is life).

The only reason I thought DS/ML is because my research has some applications for ML, in fact I have some limited contributions to fundamental open source libraries in the field (not ML libraries themselves, but the SciPy stack).

I have until the end of the year when my stipend expires, but after that I might actually starve.

>> No.10854897

>>10854874
Like employers care about your EE/ChemE experience for programming. In fact, putting that in your resume/CV makes you sound overqualified and that you would expect higher pay just for having education in an unrelated field.

>> No.10854904

>>10854897
>In fact, putting that in your resume/CV makes you sound overqualified and that you would expect higher pay just for having education in an unrelated field.
I don't really know how to taylor my CV otherwise though.

My only experience developing libraries is on scientific libraries.

>> No.10854908

>>10854904
I think I'm just going to bite the bullet and start applying for CS PhD/Master programmes in ML.

Thanks anyway Anons.

>> No.10854910

>>10854904
I don't know what to tell you then, it doesn't sound like you have any relevant experience in programming or data science right now. Just find some bootcamp online, complete it, then pray you get a job.

>> No.10856103

>>10854569
Nope. Senior data scientist. Try again, retard.

>> No.10857021

Can I get a DS/ML job without a degree? I have a couple years in uni but want to quit. I have a "good" mathematical background, python data stack, SQL, some kaggle competitions.

>> No.10857026

>>10854897
Retard.

>> No.10857043

I did a bachelors degree in sociology, a masters degree in survey methodology afterwards while studying stats + machine learning in my free time. Currently working in bioinformatics doing data science stuff. Roast me please

>> No.10857066

>>10857021
I think you probably need at least a bachelors in something. I have heard of many people from multiple different disciplines entering data science but they always at least had a bachelors degree.

>> No.10857831
File: 595 KB, 3456x1988, map_algorithms_spmf_data_mining097.png [View same] [iqdb] [saucenao] [google]
10857831

Guys! I need help. I'm trying to decide if I want to do a masters degree or not.

My current credentials
>Bachelors of computer science with good enough grades to do a masters program
>3 years of work experience as a software developer
>Currently just got a very high paying remote job for a unicorn startup, 43 hours a week
>I'm a really fast programmer (previous job I could get all my sprints work done in 3 days if I tried and stopped shitposting)
>Relatively smart and good at math
>Work is essentially building distributed systems to handle very large numbers of requests
>Familiar with python, statistical learning concepts

Goals
>Obtain statistical prowess - be able to apply it on my day to day to my work
>Work at google/facebook as a senior by 27 (currently 24)
>Learn some more math, I love math
>Publish a paper or two


Here's the kicker. If I did a masters degree, I'd do it while working full time. There is no way I'll give up my current income. I know I can do this, because I worked full time when I finished my CS bachelors. Should I do a masters degree, or just fashion one myself?

So it comes down to

>Make my own study programme; and do open source work to show progress for prospective employers
>Do a masters degree and almost die, but come out on the other end a stronger man


plz halp

>> No.10858207

>>10857831
Do you REALLY need a Masters to achieve your goals?

>> No.10858230

/data/, I have only ever taken one course in probability. what do I have to study to go into data science?

>> No.10858312

>>10858230
statistics

>> No.10858314

>>10854749
lol retard

>> No.10858362

>>10858207
no not really, but it would definitely help put me on the right path..

>> No.10859371

Why does L1 regularization induce sparse solutions?

>> No.10859377

>>10859371
Because it penalizes solutions for being not sparse.

>> No.10859911
File: 9 KB, 353x143, (PNG Image, 353 × 143 pixels).png [View same] [iqdb] [saucenao] [google]
10859911

>>10859371
because l2's contours are a circle while l1's are a rotated square, with the edges on the axis, so it's more likely to hit coefficients at 0

not sure if i transfered the intuition nicely, check pic

>> No.10859963

>>10859371
It penalizes a lot small errors but is forgiving for larger ones (compared to l2 for example)
So you get a solution that has no small variations but once in a while one HUGE variation.
So it's sparse in the sense that it makes lots of residuals pile up at zero and allows in exchange few residual to take huge values.