Great news for Othot as they just secured $1.7 Million in their first round of outside financing. For more details and a shoutout for Dr. Mark Voortman check out MercuryNews.
Note: This is a repost of my original blog post at OThot, where I also work as Senior Data Scientist.
By now you have probably heard about the “Big Data” hype in one form  or another , or read about how other companies are achieving success harnessing their data . With all the attention for Big Data and the accompanied field of Data Science to make sense of the data, it would be no surprise if you want your company to benefit as well from one or more Data Scientists generating actionable insights from your ever-growing sea of data.
When you are at this stage with your company and start to search for some great Data Scientists you will quickly find out that these people are in short supply. Now why is that? And why is it potentially dangerous if you simply want to bump or educate one of your company’s developers to a Data Scientist role if your search keeps turning up empty?
The reason why Data Scientists are scarce is twofold. First off, the aforementioned Big Data hype with its growing need for Data Scientists (with the amount of data outgrowing the number of data analysts ), is creating a high demand for Data Scientists. With many Data Science positions opening up, there is also the troublesome side effect of developers starting to market themselves as Data Scientists, while having zero to none of the required expertise.
Secondly, and somewhat problematic, is the required skill set for a Data Scientist. Why problematic? Well, this is expressed clearly in the by now classic illustration of the Data Science skill set, the Data Science Venn Diagram :
The three main skills (indicated by the primary colors) are hacking skills, math and stats knowledge, and substantive expertise. What this implies is that you need to be a programmer and statistician, while also having a lot of experience in these fields and in the relevant problem domain and business context. Given that each of these skills on their own already poses a challenge when you want to find a great candidate, then searching for all of them combined in one person can send you on a wild goose chase.
Earlier I mentioned the danger of turning developers (with hacking skills) into Data Scientists, which if you look at the diagram, might put you in the Danger Zone! Reason being that if you have hackers without substantive Math and Statistics knowledge you "[..] give people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created" . This could give rise to flat-out wrong business decisions based on wrong interpretations of the data.
This is not meant to say that you cannot turn a developer into a Data Scientist, but rather that you have be aware that you also have to teach them the required math and statistics background. Or the other way around, you need to make sure that you are teaching your statisticians to gain better development skills. Nowadays there are many resources on the internet for learning Data Science . This will get you started, but it will take a lot of time and practice to gain enough experience for a desired level of proficiency.
We believe there might be a better alternative to growing your own in-house Data Science team and that is to have OThot be of service. OThot can either take the Data Science challenge off your hands completely or complement your Data Scientists with substantial expertise in all required skills. So don’t hesitate to reach out if you want to know more about what OThot has to offer!
 Big Data, Big Hype? http://www.wired.com/insights/2014/04/big-data-big-hype/
 Big Data Is A Big Problem That’s Getting Bigger: http://www.forbes.com/sites/larrymyler/2015/07/29/big-data-is-a-big-problem-thats-getting-bigger/
 Who is ready for some big data success stories: http://www.forbes.com/sites/howardbaldwin/2015/06/08/whos-ready-for-some-big-data-success-stories/
 Growth of Data vs Growth of Data Analysts: http://www.delphianalytics.net/wp-content/uploads/2013/04/GrowthOfDataVsDataAnalysts.png
 The Data Science Venn Diagram: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
 How to actually learn data science: https://www.dataquest.io/blog/how-to-actually-learn-data-science/
Two months ago a good friend of mine (Jeff Rose), who is a former DeepMind engineer (acquired by Google), presented me with an opportunity to come work at ThinkTopic to build intelligent tools for a big publisher that involves image search, collaborative filtering, an intelligent catalog, and much more by applying Deep Learning techniques for learning in Neural Networks. Additionally most development work would be done in Clojure (a Lisp dialect on the JVM), which for me really is the icing on the cake, being a longtime Clojure and functional programming fan.
So after almost 4 years of working as Chief Architect and Data Scientist at Bottlenose, I decided to take on this new challenge to start working for ThinkTopic, of which today is my first day.
I had a great time at Bottlenose working with a great group of people. It has been amazing to see the company grow to where it is right now, and the potential of where it could go. Getting the KPMG investment has been a great milestone and together with all the work over the years the idea of no longer being part of Bottlenose feels a bit weird. Bottlenose has given me a lot in terms of experience, joy and friendships.
Having said that, I'm keeping my desk at the Hackers and Founders building in Amsterdam, so I'm not going cold turkey on my former colleagues (only switching rooms).
I'm excited to go on this new ThinkTopic adventure, which for today means diving back into Convolutional Neural Networks. May the fun and challenge begin.
Note: This is a repost of my original blog post at OThot, where I also work as Senior Data Scientist.
In statistics there is the old and ongoing debate of Frequentism versus Bayesianism, which has been humorously depicted in the following popular XKCD cartoon :
In this cartoon we see the Frequentist statistician believing that the odds (p-value) of the neutrino detector lying are below the (arbitrary) significance level of 0.05, saying that it is unlikely that machine is lying, therefore concluding that the Sun must have exploded. The Bayesianist on the other hand also takes his prior knowledge about the Sun into account and determines that given the billions of years track record of the Sun not exploding vastly outweighs the likelihood of the neutrino detector lying.
The use of explicit priors with Bayesianism versus implicit priors with Frequentism (yes there are priors, but they are fixed), is one difference most statisticians know about. However there are actually more subtle differences that carry big consequences that can sometimes lead to contradictive conclusions between the two approaches.
The popular blogging site "Pythonic Perambulations" has a great series of technical posts that give a practical introduction to Frequentism and Bayesianism , which are highly recommended. In this series Jake Vanderplas explains with great clarity the differences, which are summarized as follows:
- Frequentists and Bayesians disagree about the definition of probability
- Frequentism considers probabilities to be objective and related to frequencies of real or hypothetical events
- Bayesianism considers probabilities to be subjective and measures degrees of knowledge or belief
As a result he explains: "[..] frequentists consider model parameters to be fixed and data to be random, while Bayesians consider model parameters to be random and data to be fixed." 
This actually has far stretching consequences for the use of Frequentism in Science, where you most often have one dataset (i.e., fixed data) for which you want to make inferences, and you are not interested in inferences for hypothetical other datasets. Using Frequentism in science answers the wrong question, because you want answers for your specific dataset. Therefore the use of p-values and confidence intervals in this context are useless.
But you might ask, why then is the use of p-values the de facto standard in scientific research, if it is fundamentally wrong? Good question. The problem is that confidence intervals are easy to compute and often give similar results to the Bayesian approach. This doesn't change the fact that the approach is flat out invalid, and doesn't support the conclusions made.
Recently the science community started acknowledging this fact and we are now starting to see journals, e.g., "Basic and Applied Social Psychology" , where research using p-values is being rejected. This has not gone unnoticed as both Nature, the international weekly journal of Science, as well as Scientific American, wrote about it in depth and both proposed Bayesian statistics as a good alternative. 
The Scientific American article adds the following about p-values, and confirms the aforementioned hypothetical other datasets problem with Frequentism:
"Unfortunately, p-values are also widely misunderstood, often believed to furnish more information than they do. Many researchers have labored under the misbelief that the p-value gives the probability that their study’s results are just pure random chance. But statisticians say the p-value’s information is much more non-specific, and can interpreted only in the context of hypothetical alternative scenarios: The p-value summarizes how often results at least as extreme as those observed would show up if the study were repeated an infinite number of times when in fact only pure random chance were at work. This means that the p-value is a statement about imaginary data in hypothetical study replications, not a statement about actual conclusions in any given study" 
Needless to say, but at OThot we are big proponents of the Bayesian approach for statistical inferences. In a previous blog post by Mark Voortman , we already started talking and explaining the Bayesian approach and you can safely bet to expect more of that.
 XKCD: Frequentists vs. Bayesians http://www.xkcd.com/1132/
 Frequentism and Bayesianism: A Practical Introduction http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
 Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/
 Journal of Basic and Applied Social Psychology http://www.tandfonline.com/doi/abs/10.1080/01973533.2015.1012991
 Statistics: P values are just the tip of the iceberg http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412
 Scientists Perturbed by Loss of Stat Tools to Sift Research Fudge from Fact http://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tools-to-sift-research-fudge-from-fact
 What Do Searching For a Plane and a Doctor’s Diagnosis Have in Common? http://www.othot.com/what-do-searching-for-a-plane-and-a-doctors-diagnosis-have-in-common
In the past couple of months I managed to complete 4 online courses on Coursera, which is an education platform that partners with top universities to offer free MOOCs (Massive open online courses). In this post I like to present you my experiences with these courses and my plans for any future courses.
The reason for me to start following online courses has to do with what I already pointed out in my previous blog post "My System to Win Big". I want to keep on improving myself and acquire more knowledge in domains of my interest. Part of it is revisiting knowledge I'm "supposed" to know to gain an even better understanding, and the other part is exploring new uncharted territories. I also wanted to follow courses where I would have to practice and hone my programming skills.
But why follow online courses with a rigid schedule and deadlines if you can simply study books or solve some tough problems on Project Euler at your own pace? Well, for me as it turns out, having a fixed course schedule releases me of the burden of planning and serves as a nice stick to finish on time, which made it easier to get into the habit of spending evening hours on education. This way I didn't have to spend my will-power every time to get started, and now that I have grown this study habit and with it the discipline to follow through, I'm in a much better position to also study at my own pace.
The courses for which I have received statements of accomplishment are in chronological order:
- Machine Learning
- Algorithms: Design and Analysis, Part 1
- Bioinformatics Algorithms, Part 1
- Learning How to Learn
In the subsequent sections I will describe each course and my experiences in more detail.
Overview: Stanford University // June-September 2014 // 12 weeks of study // 6 hours per week // Coursera link
Half a year ago at Bottlenose I was shifting from primarily a Software Architect role to more of a Data Scientist role, and therefore spending more time on Machine Learning problems. So I thought it was valuable to refresh my knowledge in this particular domain. Besides reading up on some study books I decided to enroll in the Stanford Machine Learning course, which presented a nice overview with some basic programming exercises. Topics included (as listed on the Coursera page):
- Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks).
- Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning).
- Best practices in Machine Learning (bias/variance theory, innovation process in machine learning and AI).
Given that I was already very familiar with these topics, combined with an excellent presentation by instructor Andrew Ng, it may not come as a surprise that I found the course easy to follow and the programming exercises not hard to implement with only the (frequent) annoyances of coding in Matlab.
It felt good to revisit all the topics in the course, which presented me once again with the wide variety of Machine Learning approaches and techniques. Perhaps in hindsight other specialized courses on Machine Learning on a more graduate level would have been more worthwhile. However, I don't regret my time spent on this course as it never hurts to go back to the basics once in a while. Additionally, as this was my first online course, it was a great way to get started with Coursera and familiarize myself with this new educational format. And after finding out that it worked really well for me, I instantly signed up for a lot of other (more specialized) courses.
Algorithms: Design and Analysis, Part 1
Overview: Stanford University // October-December 2014 // 6 weeks of study // 7 hours per week
// Coursera link
After finishing the Machine Learning exercises in Matlab I wanted my next Coursera course to require a "real" and more interesting programming language, and at the time I was already reading up on two other programming languages that I liked to put into practice: Julia and Rust.
I was already aware of the "Algorithms: Design and Analysis" class, which two of my friends already completed and recommended, thereby making it interesting candidate as my next course. Having finished a similar course (in Java) during my study at Delft University I thought this course would be a walk in the park and a good playground for testing the waters with Julia and Rust. At the same time I was starting with the "Bioinformatics Algorithms" class, which also focused on implementing algorithms, but in the domain of biology (more on Bioinformatics later). Both courses I started with Julia.
I can honestly say that I really like the Julia language, which was born out of its creators wish to have a programming language that is:
... open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
(Blog: Why we created Julia)
The first week's programming assignment, a counting algorithm piggy backing on merge-sort, I also implemented in Rust after finishing the Julia version. The Rust implementation was a painful delivery, which was mostly me fighting the compiler and having a hard time finding clues on the web due to a lot of breaking changes in Rust's development towards the 1.0 release version. Add to that the resulting performance being slower than with my Julia implementation, and I quickly decided to leave Rust alone or at least until the language would be more stable (1.0 alpha was released Jan 2015).
While I was getting the hang of Julia, I also started exploring possibilities to introduce the language at Bottlenose, but found the language too immature just yet (at version 0.3), especially compared with Python and its vast amount of available (scientific) packages. It would have been interesting to complete the entire course in Julia, but I deemed it more valuable to switch to Python along the way and perhaps revisit Julia again in a couple of years from now.
Let me now turn to the actual course contents after this (quite long) Julia-Rust-experience-intermezzo. When I mentioned that I believed this course to be an easy ride, I was actually quite mistaken. The lectures had a lot of technical and mathematical depth, and the quizzes were often very challenging. You simply cannot fly through this course without a good understanding of the introduced concepts, which are luckily very well presented by course instructor Tim Roughgarden. All in all I can highly recommend this course for both expert and aspiring computer scientists for learning (or revisiting) several fundamental principles of algorithm design, and I'm looking forward in participating in Part 2 of this course later this year.
Bioinformatics Algorithms, Part 1
Overview: UC San Diego // October 2014 - Feb 2015 // 10 weeks of study // 10 hours per week //
I always had a general interest in biology with genetics in particular, and if you combine this with my interest in algorithms and Data Science, you can see why I had a course in bioinformatics high on my wish list. Along came "Bioinformatics Algorithms" on Coursera and I could no longer resist signing up and was eager to get started. The syllabus consists of chapters of the interactive text book "Bioinformatics Algorithms: an Active Learning Approach":
- Where in the Genome Does DNA Replication Begin? (Algorithmic Warmup)
- How Do We Sequence Antibiotics? (Brute Force Algorithms)
- Which DNA Patterns Act As Cellular Clocks? (Randomized Algorithms)
- How Do We Assemble Genomes? (Graph Algorithms)
- How Do We Compare Biological Sequences? (Dynamic Programming Algorithms)
- Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)
In a way this was the "Algorithms: Design and Analysis" course all over again applied to the biological domain. The practical application of the algorithms really made this course stand out for me, and made all algorithms more tangible. The interactive book accompanied by the online lectures had great production value and introduced concepts and terminology very well.
The programming assignments throughout the book were the real meat of the course, and it is where you will spend most (90%) of your time. Where most exercises were not too hard, there was still a big chunk of problems that were very challenging. Sometimes due to the automated solution checker that would be a bit too strict in the solutions it would accept (and no feedback on why you were wrong), but mostly it were just hard problems to solve. Once you worked your way through the chapter and finished all the exercises, the corresponding quiz was easy to pass.
To get a statement of accomplishment you needed to score 70%, which is definitely doable. I went the extra mile and focused on scoring above 85% for an accomplishment with distinction, which meant no hiding from the difficult parts. In the end I was very proud that I achieved my statement of accomplishment with distinction.
I'm looking forward to part 2 of the course, which will start this month.
Learning How to Learn
Overview: UC San Diego // January 2015 // 4 weeks of study // 2 hours per week // Coursera link
Difficulty: Very easy
When you are spending your spare time following online courses and you notice there isn't enough time in the week to follow all the courses you would like, you have to make choices. And it is not only the choices that are difficult, you also want the courses that you decide to follow to have a lasting impact and not be quickly forgotten when you are moving on to another course. This brings us to the topic of how you can learn to learn more effectively, which is what the course "Learning how to learn" has to offer.
While the course is very easy and lacks real depth, it is still beneficial to at least watch the lectures and the interviews, as there might be some tips and tricks you can pick up that will improve your learning capabilities and help you overcome procrastination when it hits you.
My key takeaway points are:
- Recall: A great way to improve your understanding and ability to form strong memories is to pause for a moment after you have read some text, look away, and forcing yourself to recall what you just read. Formulating your thoughts really helps, so talking and explaining to others is another big plus.
- Focus on "process" not "product".
- Make to-do lists for next day.
- Exercise really helps when you get stuck on some hard problem. Shifting you focus can make your subconscious and diffuse mode of thinking work for you in the background.
- Skim through an article or paper to get a sense of of the context to help structuring new knowledge.
In 2015 I want at least take the following courses, of which the first two are continuations of two courses I already completed:
- Algorithms: Design and Analysis, Part 2
- Bioinformatics Algorithms, Part 2
- Probabilistic Graphical Models
Additionally I'm looking forward to follow more courses specialized in the field of bioinformatics (e.g. genetics, bio-medicine, evolution, neuroscience) and Data Science (mining datasets, pattern discovery, advanced statistics)
I also want to finish Linear and Integer Programming of which I already completed 2 out of 7 weeks, but discontinued the course due to time constraints of other overlapping courses.
I want to conclude with two tips for when you are going to embark on your own Coursera journey:
- Do no take too many courses at once. It might seem sometimes that you could easily squeeze in another course, but they often take more time than you anticipate. Additionally if you end up stalling in one or more courses, you might get demotivated and stop all-together.
- Find yourself some friends who are already following online courses or talk them into joining you. Having a study group of like-minded individuals will really help you stay on course and make the ride less lonely and a lot more fun.
Good luck with our own online education!
One and a half years after securing our $3.6 Million series A investment round to bring 'Trendfluence' to the enterprise, we are now proud at Bottlenose to have raised (and continue to raise) our series B round with a staggering amount of $13.4 Million, bringing the total raised funds north of $17 Million.
Back in December we already disclosed that KPMG International took a "substantial equity share" in Bottlenose, but did not yet reveal the size of the investment nor names of other investors that were ready to follow KPMG Capital's lead. Today Techcrunch reported on our Series B round, listing our additional investors and describing our growing capabilities in enterprise (real-time) stream data analytics. The series B round is still open and we plan to raise significant venture debt on top of the $13.4 million.
At Bottlenose we have exciting times ahead of us.
You can read the full story on Techcrunch.
Back in January 2014 I read a book by Scott Adams (of Dilbert fame) titled: "How to Fail at Almost Everything and Still Win Big". The book was brought to my attention when I was reading up on some interesting research and ideas on passion and discipline (Carl Newport comes to mind, explaining why following your passion is bad career advice). Reading Adam's book definitely struck a chord with me and even motivated me to change my work flow and long-held believes on what to focus on in life. Two things stood out for me:
- Passion is bullshit. It is only after you have put in the work and are lucky enough in having success that passion comes, not before.
- Goals are for losers. Systems are for winners.
Before reading the book I have often asked myself questions like: What is my greatest passion and am I working every day on the things I'm most passionate about? What kind of startup or company would fit my passion? I often had a hard time trying to find the perfect answer to these questions, which resulted in inertia to get started and in some cases I was setting myself up for ever higher and greater goals.
Reading about the system approach as an alternative to goal setting made a lot of sense to me. The system approach can be summarized as follows:
- Create habits and reserve your will-power for learning new habits.
- Reduce daily decisions to routine.
- Avoid setting (high) goals, instead have a system that will keep you going
- Keeping up your personal energy, which is your primary metric
- Simplicity is key
- A system will keep you going, also in moments of failure
One of the key points here is the creation of (good) habits and the discipline to carry it out. A recent blog post titled: "Screw motivation, what you need is discipline" argues for the same point. It basically boils down to growing habits and not let yourself be driven by your emotional state. So stop waiting until you feel good (enough) to start a particular task and get into the habit of allocating time to just do stuff to get things done. Focus on the process (your system) and not the product (deliverables and goals).
While in general you want to be the most productive you can be, for me having a family with the third kid on his way (due any moment now), you have to get even more organized and disciplined to get things done. The way in which I manage this is what I want to present to you next by outlining my system along several dimensions that Scott describes in his book. I have been adhering to this system for a full year and besides some minor changes and occasional missteps I'm still following through and it has definitely made a hugely positive impact on my life and productivity.
Now without further ado, My System to Win Big:
One or more of the following activities I do daily for at least 30 to 60 minutes
- Push-ups / Sit-ups / Plank / Triceps-dips / 7-minute workout cycles (Honestly this is the one I want to become more of a routine several days a week.)
- Football (Soccer)
Needless to say you need a good amount of sleep.
- Go to bed before 00:00 (Yes, this one is hard if you do remote work for US companies from Europe)
- sleep 8 hours
Being flexible here means being in charge of your own schedule. Making a fixed planning and schedule for things you want to do is highly recommended for young parents, as long as you are the one who chooses to make time for it.
- Plan holidays/time-off
- Plan fixed slots for daily exercise and sports
- Plan stuff that makes you want to get out of bed for it
I'm a big proponent of the Paleo lifestyle, and try to avoid as much processed foods as possible and keeping it real.
- Only occasionally bread/rice/potatoes
- No Candy/Cookies/Chips
- No Sodas
- No Coffee
- Occasionally alcohol (celebrations)
- Meal skipping is good
- Fasting once in a while is great (irregular intervals)
- Snacks: nuts and fruits
- Eating more vegetables by improving seasoning
- Starches like Oatmeal
Steadily improving / Skills
- Programming (Python, Clojure, Julia, Euler, Math)
- Book reading / knowledge gathering
- Blogging (on side-projects / programming)
- Practicing playing the Piano (still a noob here)
- Coursera courses
Imagine an incredible future
- related: stay optimistic and opportunistic
- Be there for my kids and wife
- Offer free advice and guidance for start-ups and companies
- Engage at meetups and help people out
- Value what you already have
- Enjoy the moment
Have a healthy dose of Stoicism
- Avoid the pursuit of materialistic goals and don't get attached to stuff
My system helped me become more productive and happier overall, which may not yet be the fame and fortune that Scott Adams achieved, but it is still a big win to me, and I'm ready when any big opportunity arises in the future to take it head on.
Of course my system isn't perfect. Of course my system is tailored to suit me and can be different for you. Of course I'm still failing every so often, but at least I have my system to hold on to and to remind me how I want things to go. Get your self into the habit of working on what matters to you and with a little bit of luck and serendipity you can win big too!
After years of talking and thinking about creating our own Artificial Intelligence (AI), Mark Voortman and I finaly decided a couple of months ago to actually start working on it. So project brAIn was born. With brAIn we have the modest goal to build an AI that will change the world as we know it. Using Deep Learning techniques we are building towards some cool applications.