Data science Archives - Page 3 of 10

Museum Treasures – AI at the National Museum in Warsaw

September 13, 2018/in Data science /by Agata Chęcińska

Object recognition is common for street and satellite photos, diagram analysis and text recognition. After the DS team, including several deepsense.ai data scientists, took first place in the National Museum in Warsaw’s HackArt hackathon, designing a “Museum Treasures” game, the technique may soon be used to popularize art and culture, too.

In May 2018, the National Museum in Warsaw organized its HackArt hackathon. The task was to combine seemingly disparate fields: museology, art history and artificial intelligence. The goal of HackArt was to create tools: AI-based applications, bots and plug-ins that could help solve challenges the museum set.

The idea

Our focus was on the target group of parents and children, and on answering the questions of:

How to encourage families to visit the museum
How to make the visit interesting for children
How to build interest, among children and their parents, in the museum’s resources.

Somewhere along the way the idea of scavenger hunts came up, along with augmented reality games of record-breaking popularity, including Ingress and PokemonGO. Ingress players visiting a city fight for portals to another world. In PokemonGo, players look for pokemon in the least expected places.
That’s how the idea for the “Museum Treasures” game came about: A game in which children and parents get a map with fragments of paintings belonging to particular categories (animals or trees, for example). Armed with their maps, they then find the paintings that contain the fragments. The game’s formula was simple, but we wanted it also to be individualized and constantly changing. It was essential that parents with children be able to play it multiple times and that families searching for the treasures in the museum follow a number of “paths”. How could we go about doing that? The solution ended up using AI.

The execution

Artificial intelligence allows for the automation of many activities, especially tedious and time-consuming ones. Object recognition in street and satellite photos, diagram analysis, text recognition and analysis – there are countless applications. Now the automatic recognition of what can be found in paintings hanging in a museum can be added to the list.
Everyone participating in the hackathon had access to 200 photos of museum pieces. Our solution was to create a database of specific fragments of images – animals, trees, houses, feet, you name it – which could be used to create the various paths of a treasure hunt by category. The database would be extensive, and elements from new exhibitions could be added to it once they were digitalized. The only limitation in selecting the fragments to be used was in this case the set of elements which the object detection model we intended to use could recognize.

Image analysis

Take, for example, the image of Antoni Brodowski “Parys in a hat”. How would the object detection model work in this case?

In the image, we will recognize, for example, the head, the hand, the cap and the human, together with the areas where these elements appear (coordinates x, y of the rectangle) and the probability (the certainty with which the model found the element, p). A dictionary of fragments is thus created:

{category_1: image_id, detected fragment, (x, y), probability p}

Object detection models (among many other kinds of models) can be found in the open-source code repositories on GitHub. Depending on your needs and situation, a model can be taught from scratch using the data available, or, alternatively, pre-trained, ready-made models can be used. We chose the latter approach, because training a model requires a set of photos and labels. The labels in this case would define the category of elements and coordinates of their occurrence for each photo. Because we lacked such labels, we went with an object detection model from the Tensorflow Models (Google Vision API) repository, which allows quick detection and searches on 545 categories.

The code that enables objects to be found in images from a set is located on the stared/hackart-you-in-artwork repository. Below you will see a fragment that likewise makes it possible to detect objects in an image, and to then be saved as a json file. The code for each photo (from the list of photos to be processed) launches a neural network that recognizes the objects in the picture. As a result, we receive thousands of frame proposals along with the class and probability of the occurrence of this class in the area in question. We rejected results that were too uncertain, while adding the remaining ones to the dictionary to use in further stages of working with data. The fragment of code pictured here displays our results on the image and saves them to files, thanks to which we can visually verify the results obtained.

results = dict()
        for image_path in tqdm(list(TEST_IMAGE_PATHS)):
            base_name = image_path.split('/')[-1][:-4]
            image = Image.open(image_path)
            image_np = load_image_into_numpy_array(image)
            image_np_expanded = np.expand_dims(image_np, axis=0)
            (boxes, scores, classes, num) = sess.run(
                [detection_boxes, detection_scores,
                 detection_classes, num_detections],
                feed_dict={image_tensor: image_np_expanded})
            image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)
            objects = []
            for s, c, b in zip(scores[0], classes[0], boxes[0]):
                if s > THRESH:
                    b = list(b)
                    objects.append({"prob": s,
                                    "name": str(category_index[c]['name']),
                                    "xmin": b[0], "ymin": b[1],
                                    "xmax": b[2], "ymax": b[3]})
            results[image_path.split('/')[-1]] = objects
            # Visualization of the results of a detection.
            vis_util.visualize_boxes_and_labels_on_image_array(
              image_np,
              np.squeeze(boxes),
              np.squeeze(classes).astype(np.int32),
              np.squeeze(scores),
              category_index,
              min_score_thresh=THRESH,
              use_normalized_coordinates=True,
              line_thickness=8)
            cv2.imwrite('%s.jpg' % base_name, image_np)
            counter = counter+1
        print(results)
        with open('oidv3.json', 'w') as f:
            json.dump(results, f)

Source: https://github.com/stared/hackart-you-in-artwork/blob/master/aux/scripts_karol/evaluate_on_images.py

Outline of how the game works

Preparing the dictionary of fragments and their categories was only part of the task. At the same time, the work of developing the application demo awaited us. We had to focus on basic functionalities to be able to build the skeleton of the solution:

The application was intended to work automatically. After entering a new set of photos, a dictionary of categories was created (with additional filters to improve the quality of the items received), from which the user is then presented with 5 categories to choose from. During the hackathon we had a limited set of photos (→ lower credibility of automatically generated elements) and time constraints, so we supervised some of the tasks. For example, we checked the quality of the elements generated and merged several categories into one: cat, dog, fish, … → animals.
We created a web application and used Vue.js to create it. We assumed the following about the “Museum Treasures” game:

It could be played in an “analog” version: downloading a pdf with fragments of images and information about which room they could be found in → a designated “path” through the rooms; in this case, the player doesn’t need to use a smartphone or tablet, which may be important for parents and school trips.
it could also be played electronically: using a smartphone or tablet, with analogous information as above.

In both cases, the user first selects the category, and then receives a list of fragments (unique, random set) that must be found in the paintings.
During the hackathon we were able to create the basic version of the application, which allows users to select categories and shows the relevant fragments. We did not have time to hammer out map- and pdf-related functionalities.

Nor, during the event, did we have metadata about the location of the images from the collection, though these data can be added at a later stage of the work. The result of our efforts was a demo version.
To seehow the app works, check out the video below:

Further development ideas

In its basic form, “Museum Treasures” has players look for images, but in further development verification, rewarding and gamification could be added to enrich the experience. Defining the rules of the game, goals and motivation was very important and involved determining the age of the players. The challenges that await a five-year-old will differ from those 12-year-olds will take on. We believed the game could also be interesting for adults, as paths could likewise be created for mature users. We had several ideas for introducing these elements and developing them further. You can read a few of them below.

Verification
Metadata	To confirm that a painting has been found, the player enters information about the painter, the year the work was done, etc. These type of data can easily be entered into the dictionary, and questions can be generated based on them.
photos	A more advanced form of verification involves requiring the participant to take a picture – an image or note down information. The application could be enriched with a module comparing the image photographed with the fragment source. This solution is much more technically complex. There may also be occasions when photography is of poor quality and can’t be verified.

Clues
metadata	Hints about the painter
Generating descriptions with AI	Using algorithms to generate photo captions defining the context of what is found in the picture. This could be an interesting extra, though such captions don’t always work properly as clues.

The prize for correctly retrieving the images could be stickers or other small gadgets and a badge that could be awarded on social media.
The game can also be developed for group work. Ultimately, schools could use it. One of the ideas also assumes gamification:

two groups follow separate paths, with their times compared at the end,
two groups follow paths that end up at the same place, thus allowing the players to meet at the end of the game (and talk about who came in first).

The games could also be offered to adults: Races based on categories such as “Lips, lips” or “Buttocks over the centuries” are often very popular among adults.

Future plans

We’re currently working on getting our pilot solution up and running. It will be based on one of the museum’s exhibits – a collection of 19th century paintings. We would like to create a basic version of the game, which would then be tested for its level of difficulty, among other factors.

Wait, so loans need to be repaid? The home credit risk prediction competition on Kaggle

September 6, 2018/in Data science /by Konrad Budek

It was far and away the most popular Kaggle competition, gaining the attention of more than 8,000 data scientists globally. The team of Paweł Godula, team leader and deepsense.ai’s Director of Customer Analytics, Michał Bugaj and Aliaksandr Varashylau took fifth place and 1st on the public leaderboard.

The goal of the competition launched by Home Credit Group was to build a model that could predict the probability of a bank’s customer repaying a cash loan (90% of the training data) or installment loan (10% of the training data). Combining an exciting, real-life challenge and a high-quality dataset, this competition became the most popular ever featured competition on Kaggle.

The sandbox raiders

There can be no doubt that being a data scientist is fun. Playing with various datasets, finding patterns and exploring the needles hidden in the depths of the digital haystack. This time, the dataset was a marvel to behold. Why?

The bank behind the competition provided data on roughly 300,000 customers, including details on credit history, properties, family status, earnings and geographic location.
To enrich the dataset the bank provided information about the customers’ credit history taken from external sources, mostly from customer-ratings institutions.
The level of detail provided was astonishing. Participants could analyze the credit history of customers at the level of a single installment of a single loan.
While the personal data was of course perfectly anonymized, the features were not. This enabled endless feature engineering, which is every data scientist’s dream.

In other words, the dataset was the perfect sandbox, allowing all of the participants to get into the credit underwriters’ shoes for more than 3 months.
Our solution was based on three steps, described briefly below.

1. Hand crafting more than 10,000 features

Out of 10,000 features, we carefully chose the 2,000 strongest for the final model.
Endless brainstorming, countless creative sessions and discussions gave us more than 10,000 features that could possibly explain the default on a loan. As most of these features carried largely duplicate information, we used an algorithm for automatic feature selection based on feature importance. This procedure enabled us to eliminate ~8,000 features and reduce the training time significantly, while improving the cross validation score at the same time.
The heavily tuned, five-fold bagged, LightGBM model based on these 2,000 features was our submission’s workhorse.

2. Using deep learning to extract interactions among different data sources

We wondered how we could capture the interactions between signals coming from different data sources. For example, what if 20 months ago someone was rejected in an external credit Bureau, had a late payment in their installment payments, and applied for a loan at our branch? These types of interactions are very hard to capture by humans because of the number of possible options. So, we turned to deep learning and turned this problem into an image classification problem.
How?
We created a single vector of user characteristics coming from different data sources for every month of user history, going as far back as 96 months(8 years was the cutoff in most data sources). We then stacked those vectors and created a very sparse “user image” and, finally, fed this image into a neural network.

The network architecture was as follows:

Normalization – division by global max in every row
Input (the “user image” in format of n_characteristics x 96 months – we looked 8 years into the past)
1-D convolution spanning 2 consecutive months (to see the change between periods)
Bidirectional LSTM
Output

This model trained rather quickly–around 30 mins on a GTX 1080–and gave us a significant improvement on an already very strong model with 2000 hand-crafted features. This means that the network was able to extract some information on top of >2000 hand-crafted features.
We believe in this approach, particularly for more commercial settings, where the actual metric of the model is not only accuracy, but also the time the data science team (i.e. cost) requires to develop the model. For us, the opportunity cost was only the sleep we missed, which we gladly gave up to take part in this amazing competition. However, most businesses have a more rational and less emotional approach to data science and prefer cheap models to expensive ones. Deep learning offers an attractive alternative to manual feature engineering and is able to extract meaningful information from time-series bank data.

3. Using nested models

One of the things that bothered us throughout the competition was the somewhat arbitrary nature of various group-bys that we performed on data while doing the hand-crafted features. For example, we supposed that an overdue installment from five years ago would be less important than one from just a month ago. However, what is the exact relationship? The traditional way is to test different thresholds using a cv score, but there is also a more elegant way for the model to figure it out.
What we did was build a bunch of “limited-power” models using only only a single source of data (for example, only a credit card’s history). The purpose of that was to force the model to find all possible relationships in the given data source, even at the cost of accuracy. Below are the AUC (area under curve metric) scores that we got from models using only one data source:

Previous application: 0.63
Credit card balance: 0.58
Pos cash balance: 0.54
Installment payments: 0.58
Bureau: 0.61
Bureau balance: 0.55

The very low AUC scores for these models were hardly surprising, as they carry enormous amounts of noise. Even for default clients, the majority of their past behaviors (past loans, past installments, past credit cards) were OK. The point was to identify those few behaviors that were common across defaulters.
The way to use those models is to extract the most “default-like” behaviors and use them to describe every user. For example, a very strong feature in the final model was “the maximum default score on a single behaviour of a particular user”. Another very strong feature was “Number of behaviors with default score exceeding 0.2 for a particular user”.
Using features from these models further improved an already very strong model: The models learned to abstract whether a particular behavior would lead to a default or not.
In summary, the final model used the following features:

More than 2,000 hand-crafted features, selected out of 10,000 features created during brainstorming and creative sessions
one feature from the neural network from Step 2
Around 40-50 features coming from “nested models” described in Step 3

The portfolio of models we tested included an XGBoost, LightGBM, Random Forest, Ridge Regression and Neural Nets. A LightGBM proved to be the best model, especially with heavily tuned hyperparameters for regularization (the two most important parameters were feature fraction and L2 regularization).

The model was prepared by Michał Bugaj, Aliaksandr Varashylau and Paweł Godula (Customer Analytics Director at deepsense.ai), who led the team. They were able to predict if a lender would default on a loan with 80% AUC (meaning that there was an 80% probability that a randomly selected “defaulter”, or person who defaulted on a loan, would be ranked by the model as a defaulter before a non-defaulter).
Our solution ranked fifth, and a tenth of a percentage point less effective than the leader on the private leaderboard. We took 1st place on the public leaderboard.
The competition itself was a great experience, both for the organization behind it and the participants, as the models provided appeared to be effective and business oriented. Remember our blog post about launching a Kaggle competition? This competition may qualify as perfect, one that might fit right in Sevres.

Four ways to use a Kaggle competition to test artificial intelligence in business

August 24, 2018/in Data science /by Konrad Budek and Patryk Miziuła

For companies seeking ways to test AI-driven solutions in a safe environment, running a competition for data scientists is a great and affordable way to go – when it’s done properly.

According to a McKinsey report, only 20% of companies consider themselves adopters of AI technology while 41% remain uncertain about the benefits that AI provides. Considering the cost of implementing AI and the organizational challenges that come with it, it’s no surprise that smart companies seek ways to test the solutions before implementing them and get a sneak peek into the AI world without making a leap of faith.
That’s why more and more organizations are turning to data science competition platforms like Kaggle, CrowdAI and DrivenData. Making a data science-related challenge public and inviting the community to tackle it comes with many benefits:

Low initial cost – the company needs only to provide data scientists with data, pay the entrance fee and fund the award. There are no further costs.
Validating results – participants provide the company with verifiable, working solutions.
Establishing contacts – A lot of companies and professionals take part in Kaggle competitions. The ones who tackled the challenge may be potential vendors for your company.
Brainstorming the solution – data science is a creative field, and there’s often more than one way to solve a problem. Sponsoring a competition means you’re sponsoring a brainstorming session with thousands of professional and passionate data scientists, including the best of the best.
No further investment or involvement – the company gets immediate feedback. If an AI solution is deemed efficacious, the company can move forward with it and otherwise end involvement in funding the award and avoid further costs.

While numerous organizations – big e-commerce websites and state administrations among them – sponsor competitions and leverage the power of data science community, running a comptetion is not at all simple. An excellent example is the competition the US National Oceanic and Atmospheric Administration sponsored when it needed a solution that would recognize and differentiate individual right whales from the herd. Ultimately, what proved the most efficacious was the principle of facial recognition, but applied to the topsides of the whales, which were obscured by weather, water and the distance between the photographer above and the whales far below. To check if this was even possible, and how accurate a solution may be, the organization ran a Kaggle competition, which deepsense.ai won.

Having won several such competitions, we have encountered both brilliant and not-so-brilliant ones. That’s why we decided to prepare a guide for every organization interested in testing potential AI solutions in Kaggle, CrowdAI or DrivenData competitions.

Recommendation 1. Deliver participants high-quality data

The quality of your data is crucial to attaining a meaningful outcome. Minus the data, even the best machine learning model is useless. This also applies to data science competitions: without quality training data, the participants will not be able to build a working model. This is a great challenge when it comes to medical data, where obtaining enough information is problematic for both legal and practical reasons.

Scenario: A farming company wants to build a model to identify soil type from photos and probing results. Although there are six classes of farming soil, the company is able to deliver sample data for only four. Considering that, running the competition would make no sense – the machine learning model wouldn’t be able to recognize all the soil types.

Advice: Ensure your data is complete, clear and representative before launching the competition.

Recommendation 2. Build clear and descriptive rules

Competitions are put together to achieve goals, so the model has to produce a useful outcome. And “useful” is the point here. Because those participating in the competition are not professionals in the field they’re producing a solution for, the rules need to be based strictly on the case and the model’s further use. Including even basic guidelines will help them to address the challenge properly. Lacking these foundations, the outcome may be right but totally useless.

Scenario: Mapping the distribution of children below the age of 7 in the city will be used to optimize social, educational and healthcare policies. To make the mapping work, it is crucial to include additional guidelines in the rules. The areas mapped need to be bordered by streets, rivers, rail lines, districts and other topographical obstacles in the city. Lacking these, many of the models may map the distribution by cutting the city into 10-meter widths and kilometer-long stripes, where segmentation is done but the outcome is totally useless due to the lack of proper guidelines in the competition rules.

Advice: Think about usage and include the respective guidelines within the rules of the competition to make it highly goal-oriented and common sense driven.

Recommendation 3. Make sure your competition is crack-proof

Kaggle competition winners take home fame and the award, so participants are motivated to win. The competition organizer needs to remember that there are dozens (sometimes thousands) of brainiacs looking for “unorthodox” ways to win the competition. Here are three examples

Scenario 1: A city launches a competition in February 2018 to predict traffic patterns based on historical data (2010-2016). The prediction had to be done for the first half of 2017 and the real data from that time was the benchmark. Googling away, the participants found the data, so it was easy to fabricate a model that could predict with 100% accuracy. That’s why the city decided to provide an additional, non-public dataset to enrich the data and validate if the models are really doing the predictive work.

However, competitions are often cracked in more sophisticated ways. Sometimes data may ‘leak’: data scientists get access to data they shouldn’t see and use it to prepare their model to tailor a solution to spot the outcome, rather than actually predicting it.

Scenario 2: Participants are challenged to predict users’ age from internet usage data. Before the competition, the large company running it noticed that there was a long aplha-numeric ID, with the age of users embedded, for every record. Running the competition without deleting the ID would allow participants to crack it instead of building a predictive model.

Benchmark data is often shared with participants to let them polish their models. By comparing the input data and the benchmark it is sometimes possible to reverse-engineer the outcome. The practice is called leaderboard probing and can be a serious problem.

Scenario 3: The competition calls for a model to predict a person’s clothing size based on height and body mass. To get the benchmark, the participant has to submit 10 sample sizes. The benchmark then compares the outcome with the real size and returns an average error. By submitting properly selected numbers enough times, the participant cracks the benchmark. Anticipating the potential subterfuge, the company opts to provide a public test set and a separate dataset to run the final benchmark and test the model.

Advice: Look for every possible way your competition could be cracked and never underestimate your participants’ determination to win.

Recommendation 4. Spread the word about your competition

One of the benefits of running a competition is that you get access to thousands of data scientists, from beginners to superstars, who brainstorm various solutions to the challenge. Playing with data is fun and participating in competitions is a great way to validate and improve skills, show proficiency and look for customers. Spreading the word about your challenge is almost as important as designing the rules and preparing the data.

Scenario: A state administration is in need of a predictive model. It has come up with some attractive prizes and published the upcoming challenge for data scientists on its website. As these steps may not yield the results it’s looking for, it decides to sponsor a Kaggle competition to draw thousands of data scientists to the problem.

Advice: Choose a popular platform and spread the word about the competition by sending invitations and promoting the competition on social media. Data scientists swarm to Kaggle competitions by the thousands. It stands to reason that choosing a platform to maximize promotion is in your best interest.

Conclusion

Running a competition on Kaggle or a similar platform can not only help you determine if an AI-based solution could benefit your company, but also potentially provide the solution, proof of concept and the crew to implement it at the same time. Could efficiency be better exemplified?

Just remember, run a competition that makes sense. Although most data scientists engage in competitions just to win or validate their skills, it is always better to invest time and energy in something meaningful. It is easier to spot if the processing data makes sense than a lot of companies running competitions realize.
Preparing a model that is able to recognize plastic waste in a pile of trash is relatively easy. Building an automated machine to sort the waste is a whole different story. Although there is nothing wrong with probing the technology, it is much better to run a competition that will give feedback that can be used to optimize the company’s short- and long-term future performance. Far too many competitions either don’t make sense or produce results that are never used. Even if the competition itself proves successful, who really has the time or resources to do fruitless work?

Three reasons why data analysts make the perfect data scientists

August 9, 2018/in Data science /by Konrad Budek

Data scientist is considered the hottest job in 2018 for good reason. Combining the tech skills, business attitude and godlike-at-first-glance ability to build artificial intelligence, data scientists are indeed both highly desired and hard to come by.

Recruiting data scientists comes with several challenges, with their high salaries being the least significant one. As the profession is a new one, most individuals have neither the educational background nor the long employment history that might be considered ideal. The lion’s share come from IT, but a significant number are scientists and business people.
That’s why looking within your organization, particularly among data analysts, is a good way to find a data scientist. Below we discuss three benefits of this approach.

Overlapping skills, or why data analysts are already halfway to becoming data scientists

Data analysts not only have to be proficient with data processing tools but also to have business acumen that will allow them to harvest the database for meaningful insights. Analysts have to be data-driven, curious and inquiring, with the problem-solver’s mentality.
Adding programming skills to one’s skill set can go a long way to making the perfect data scientist. In fact, data analysts are often doing the work that data scientists try to automate with machine learning models. Preparing segmentation or rooting out anomalies within the database are among their daily tasks. Data analysts know what is and what is not the best part of a business to be automated. Being used to handling data-oriented tasks and having a strong business background, data analysts have the key qualities to become data scientists. What they need is to acquire the missing competence in building models.

Building loyalty within the team – reducing the turnover rate and gaining skills

Reducing churn on teams is a key challenge HR departments face as employees come and go, seeking a pay rise or new opportunities. So what motivates people to stay? According to Shiftlearning, 70% of employees consider training and development opportunities motivation to stay. Giving a data analyst the opportunity to work in the hottest job of 2018 only by training him or her is a growth opportunity like no other – for the company and the employee.
By transforming a data analyst into a data scientist, the company is both building a stronger bond with the employee and acquiring the skills it requires.

No need for recruitment – improve efficiency to lower costs

Seeking employees is a tedious process, with job postings revealing only the tip-of-the-tip of the iceberg. The HR specialists need to review resumes, pick the best candidates and conduct interviews. The company has to invest time and money in the process.
What’s more, it is far from certain that a given hire will fit the organization’s culture or work ethic.
Data analysts that work for the organization are already trained, fit into the culture and have domain knowledge. So, the cost of recruitment is a non-issue – there is no need for onboarding or elaborate introductions to the work. And there are likely no other surprises.
Last but not least, data scientists tend to be picky, as they have a plethora of job offers. If you are not Google Brain or an other tech giant, attracting them can be challenging. On the other hand, data analysts currently working for the company are already in and trust the brand.

Summary

There is a slew of technical skills and a sizable block of knowledge to be mastered in order to enter the ranks of data scientists. In fact, learning the tech part of the job is not rocket science. If a data analyst already working for the organization can translate data into business, then he or she is more than halfway there.
The distance between the world of AI and the business development department is smaller than one might expect, and many business analysts are already practically there without even knowing it. Supporting them with knowledge and training may be the best way to give them the skills they need and would no doubt love to acquire and to keep them at the company by giving them a great opportunity to grow.

Online course vs. instructor-led training – how to develop your team’s new skills?

July 27, 2018/in Data science /by Anna Kowalczyk

The e-learning market is anticipated to be worth $37.6 billion by 2020. On the other hand, the dropout rate of massive open online courses (MOOCs) is upwards of 87%! It’s tempting to send your team to any of the popular e-learning platforms to pick up the new skills they need. An important question, then, is are online courses superior to instructor-led training.

The opportunities online education brings

Online learning opens up a raft of benefits, flexibility foremost among them. You don’t have to look for a suitable time for everyone in the team to start training as they can adjust other responsibilities and study on their own schedule, no matter where they are.
Some courses are offered in collaboration with top universities, which provide a wide range of topics to choose from, and further guarantees quality. Moreover, leading companies and industry experts participate in creating the courses, so you can be sure that the educational content is up-to-date and has practical value.
There is also the issue–or non-issue, as the case be–of location. Instead of doing research and looking for a decent vendor, you can just send the team to a data visualization course at the University of Illinois, a Python course at the University of Michigan or a data science course at the Johns Hopkins University, all while your business is headquartered in Australia.
Your employees will surely appreciate the fact that you invest in their development and will be proud of the certification they get after finishing the course. Participants can learn at their own pace, and, if receiving a course certificate isn’t the aim, dedicate more time to one thing while skipping those parts with less value.
Last but not least, MOOCs are usually affordable; you can receive a group discount or wait for a sale season. Some platforms even offer free courses.

So why is the dropout rate so high?

We have already agreed that online education provides a lot of freedom. Of course, there is a set of guidelines and rules students should follow. The question is, do they have enough self-control and determination to stick with it? Online courses, especially ones in demanding and ambitious fields such as machine learning, require great self-discipline. It is essential the student be able to balance their priorities to finish the course. The dropout rate of massive open online courses (MOOCs) is north of 87%!
Although MOOCs have enjoyed wide publicity and numerous institutions (including MIT and Stanford) have invested heavily in developing and promoting such courses, the jury is out on their effectiveness. Staying motivated and keeping up with assignments is not a piece of cake, even if you work in a team.

Teams need support when learning

Online learning is awesome, no doubt. But it comes with a few drawbacks. The absence of direct interaction with an instructor makes online education more of a monolog than a dialogue. When it comes to educating teams, face-to-face communications is paramount, as everyone has to have the same understanding of the problem and the solutions. It therefore should not be replaced by technology. Learning from a live instructor helps students remain focused while enabling instructors to keep students motivated. Online courses don’t provide these opportunities, which may be a, a huge hurdle.

According to a study conducted by Susan Dynarski, professor of education, public policy and economics at the University of Michigan, online courses tend to be more beneficial for proficient students, while keeping the less motivated off track. Without a teacher present to help students with problems, online courses tend to lose their efficiency, while instructor-led sessions can be adjusted to the actual level of the team being trained.
The lack of an individual approach and support exactly when your team need it make the transition from theoretical learning to practical application required for the real-life problems challenging. During an online course, students won’t receive instructor support when the subjects become more and more difficult, leaving some feeling overwhelmed.
According to a study conducted by researchers at the University of Pennsylvania, instructional approach and instructor or peer feedback have a huge impact on the effectiveness of training. Direct and interactive instructions, experiential learning and an individual approach all increase participant engagement. Online courses use peer reviews to enhance learning, and they allow students to learn from each experience by providing feedback. However, this approach is not fully controlled by the instructor and may not always give students constructive feedback. Instructor-led courses can do that, giving the less focused or motivated attendees an opportunity to actually benefit from the course.

Why companies choose online courses vs. instructor-led training

According to Capterra, reducing costs is one reason companies decide to train teams online. But can low price give you the quality you need? The problem with MOOCs begins with the fact that, as their name says, they’re “massive” and “open”. They address many student profiles and the materials are not tailored to their particular needs. Companies sometimes choose courses with little understanding of what the course requires and have unrealistic expectations of it or of their employees abilities. Aside from their attractive price, online courses are easier to arrange than instructor-led training. You don’t have to look for a vendor and arrange a program. Just choose a course covering a relevant topic online.
Nevertheless, managers should look forward and think about the future outcome. Will the team put to good use the knowledge it gains during the online course? An internal instructor-led training program grounded in real projects may cost more than an online course, but that money will soon enough be recouped. Ask yourself whether you can afford ineffective training.

How to effectively develop technical skills in-house

Interaction with others can help you boost your knowledge and increase your interest in a particular topic. Indeed, nothing is more motivating than the people around you. Group training can help your team develop practical skills that are increasingly important in the professional world. Thanks to teamwork and brainstorming, they can tackle more complex problems than they could do individually–for example, develop new approaches to resolve an issue or pool their knowledge.
Some instructor-led courses provide the learning experience tailored to the technical goals and business strategy. Teams learn by building projects using hands-on, code-based training to be able to apply new skills and experience in practice. This practical approach can be followed up by a mentoring program to ground new competencies. A group can potentially understand a given technology more thoroughly when it applies it to specific business use cases. Online course can’t address a company’s particular problems.
So if you have a team which needs new skills which are crucial for your company, are online courses really the answer, or is instructor-led training the way to go? The former seem the riskier of the two, particularly if it is quality results and maximizing budget efficiency you’re after. Finally, the Association for Talent Development has stated that companies offering comprehensive training sessions to their people have 218% higher income per employee. This is a win-win situation for the employee and employer. It takes into account both the individual employee’s development and the company’s strategic goals. The optimal solution combines the best educational practices with real-life business cases. Participants can then turn around their knowledge and use it in their everyday work.

Keras or PyTorch as your first deep learning framework

June 26, 2018/in Data science, Deep learning, Machine learning /by Piotr Migdal and Rafał Jakubanis

So, you want to learn deep learning? Whether you want to start applying it to your business, base your next side project on it, or simply gain marketable skills – picking the right deep learning framework to learn is the essential first step towards reaching your goal.

What are Keras and PyTorch?

Keras and PyTorch are open-source frameworks for deep learning gaining much popularity among data scientists.

Keras is a high-level API capable of running on top of TensorFlow, CNTK, Theano, or MXNet (or as tf.contrib within TensorFlow). Since its initial release in March 2015, it has gained favor for its ease of use and syntactic simplicity, facilitating fast development. It’s supported by Google.
PyTorch, released in October 2016, is a lower-level API focused on direct work with array expressions. It has gained immense interest in the last year, becoming a preferred solution for academic research, and applications of deep learning requiring optimizing custom expressions. It’s supported by Facebook.

Before we discuss the nitty-gritty details of both frameworks, we want to preemptively disappoint you – there’s no straight answer to the ‘which one is better?’. The choice ultimately comes down to your technical background, needs, and expectations. This article aims to give you a better idea of where each of the two frameworks you should be pick as the first.

TL;DR:

Keras may be easier to get into and experiment with standard layers, in a plug & play spirit.
PyTorch offers a lower-level approach and more flexibility for the more mathematically-inclined users.

Ok, but why not any other framework?

TensorFlow is a popular deep learning framework. Raw TensorFlow, however, abstracts computational graph-building in a way that may seem both verbose and not-explicit. Once you know the basics of deep learning, that is not a problem. But for anyone new to it, sticking with Keras as its officially-supported interface should be easier and more productive.
[Edit: Recently, TensorFlow introduced Eager Execution, enabling the execution of any Python code and making the model training more intuitive for beginners (especially when used with tf.keras API).]
While you may find some Theano tutorials, it is no longer in active development. Caffe lacks flexibility, while Torch uses Lua (though its rewrite is awesome :)). MXNet, Chainer, and CNTK are currently not widely popular.

Keras vs. PyTorch: Ease of use and flexibility

Keras and PyTorch differ in terms of the level of abstraction they operate on.
Keras is a higher-level framework wrapping commonly used deep learning layers and operations into neat, lego-sized building blocks, abstracting the deep learning complexities away from the precious eyes of a data scientist.
PyTorch offers a comparatively lower-level environment for experimentation, giving the user more freedom to write custom layers and look under the hood of numerical optimization tasks. Development of more complex architectures is more straightforward when you can use the full power of Python and access the guts of all functions used. This, naturally, comes at the price of verbosity.
Consider this head-to-head comparison of how a simple convolutional network is defined in Keras and PyTorch:

Keras

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(MaxPool2D())
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

PyTorch

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3)
        self.conv2 = nn.Conv2d(32, 16, 3)
        self.fc1 = nn.Linear(16 * 6 * 6, 10)
        self.pool = nn.MaxPool2d(2, 2)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 6 * 6)
        x = F.log_softmax(self.fc1(x), dim=-1)
        return x
model = Net()

The code snippets above give a little taste of the differences between the two frameworks. As for the model training itself – it requires around 20 lines of code in PyTorch, compared to a single line in Keras. Enabling GPU acceleration is handled implicitly in Keras, while PyTorch requires us to specify when to transfer data between the CPU and GPU.
If you’re a beginner, the high-levelness of Keras may seem like a clear advantage. Keras is indeed more readable and concise, allowing you to build your first end-to-end deep learning models faster, while skipping the implementational details. Glossing over these details, however, limits the opportunities for exploration of the inner workings of each computational block in your deep learning pipeline. Working with PyTorch may offer you more food for thought regarding the core deep learning concepts, like backpropagation, and the rest of the training process.
That said, Keras, being much simpler than PyTorch, is by no means a toy – it’s a serious deep learning tool used by beginners, and seasoned data scientists alike.
For instance, in the Dstl Satellite Imagery Feature Detection Kaggle competition, the 3 best teams used Keras in their solutions, while our deepsense.ai team (4th place) used a combination of PyTorch and (to a lesser extend) Keras.
Whether your applications of deep learning will require flexibility beyond what pure Keras has to offer is worth considering. Depending on your needs, Keras might just be that sweet spot following the rule of least power.

Summary

Keras – more concise, simpler API
PyTorch – more flexible, encouraging deeper understanding of deep learning concepts

Keras vs. PyTorch: Popularity and access to learning resources

A framework’s popularity is not only a proxy of its usability. It is also important for community support – tutorials, repositories with working code, and discussions groups. As of June 2018, Keras and PyTorch are both enjoying growing popularity, both on GitHub and arXiv papers (note that most papers mentioning Keras mention also its TensorFlow backend). According to a KDnuggets survey, Keras and PyTorch are the fastest growing data science tools.

Unique mentions of deep learning frameworks in arxiv papers (full text) over time, based on 43K ML papers over last 6 years. So far TF mentioned in 14.3% of all papers, PyTorch 4.7%, Keras 4.0%, Caffe 3.8%, Theano 2.3%, Torch 1.5%, mxnet/chainer/cntk <1%. (cc @fchollet) pic.twitter.com/YOYAvc33iN

— Andrej Karpathy (@karpathy) 10 marca 2018

While both frameworks have satisfactory documentation, PyTorch enjoys stronger community support – their discussion board is a great place to visit to if you get stuck (you will get stuck) and the documentation or StackOverflow don’t provide you with the answers you need.
Anecdotally, we found well-annotated beginner level deep learning courses on a given network architecture easier to come across for Keras than for PyTorch, making the former somewhat more accessible for beginners. The readability of code and the unparalleled ease of experimentation Keras offers may make it the more widely covered by deep learning enthusiasts, tutors and hardcore Kaggle winners.
For examples of great Keras resources and deep learning courses, see “Starting deep learning hands-on: image classification on CIFAR-10“ by Piotr Migdał and “Deep Learning with Python” – a book written by François Chollet, the creator of Keras himself. For PyTorch resources, we recommend the official tutorials, which offer a slightly more challenging, comprehensive approach to learning the inner-workings of neural networks. For a concise overview of PyTorch API, see this article.

Summary

Keras – Great access to tutorials and reusable code
PyTorch – Excellent community support and active development

Keras vs. PyTorch: Debugging and introspection

Keras, which wraps a lot of computational chunks in abstractions, makes it harder to pin down the exact line that causes you trouble.
PyTorch, being the more verbose framework, allows us to follow the execution of our script, line by line. It’s like debugging NumPy – we have easy access to all objects in our code and are able to use print statements (or any standard Pythonic debugging) to see where our recipe failed.
A Keras user creating a standard network has an order of magnitude fewer opportunities to go wrong than does a PyTorch user. But once something goes wrong, it hurts a lot and often it’s difficult to locate the actual line of code that breaks. PyTorch offers a more direct, unconvoluted debugging experience regardless of model complexity. Moreover, when in doubt, you can readily lookup PyTorch repo to see its readable code.

Summary

PyTorch – way better debugging capabilities
Keras – (potentially) less frequent need to debug simple networks

Keras vs. PyTorch: Exporting models and cross-platform portability

What are the options for exporting and deploying your trained models in production?
PyTorch saves models in Pickles, which are Python-based and not portable, whereas Keras takes advantages of a safer approach with JSON + H5 files (though saving with custom layers in Keras is generally more difficult). There is also Keras in R, in case you need to collaborate with a data analyst team using R.
Running on Tensorflow, Keras enjoys a wider selection of solid options for deployment to mobile platforms through TensorFlow for Mobile and TensorFlow Lite. Your cool web apps can be deployed with TensorFlow.js or keras.js. As an example, see this deep learning-powered browser plugin detecting trypophobia triggers, developed by Piotr and his students.
Exporting PyTorch models is more taxing due to its Python code, and currently the widely recommended approach is to start by translating your PyTorch model to Caffe2 using ONNX.

Summary

Keras – more deployment options (directly and through the TensorFlow backend), easier model export.

Keras vs. PyTorch: Performance

Donald Knuth famously said:

Premature optimization is the root of all evil (or at least most of it) in programming.

In most instances, differences in speed benchmarks should not be the main criterion for choosing a framework, especially when it is being learned. GPU time is much cheaper than a data scientist’s time. Moreover, while learning, performance bottlenecks will be caused by failed experiments, unoptimized networks, and data loading; not by the raw framework speed. Yet, for completeness, we feel compelled to touch on this subject. We recommend these two comparisons:

TensorFlow, Keras and PyTorch comparison by Wojtek Rosiński
Comparing Deep Learning Frameworks: A Rosetta Stone Approach by Microsoft (make sure to check notebooks to get the taste of different frameworks). For a detailed explanation of the multi-GPU framework comparisons, see this article.

PyTorch is as fast as TensorFlow, and potentially faster for Recurrent Neural Networks. Keras is consistently slower. As the author of the first comparison points out, gains in computational efficiency of higher-performing frameworks (ie. PyTorch & TensorFlow) will in most cases be outweighed by the fast development environment, and the ease of experimentation Keras offers.

Summary

As far as training speed is concerned, PyTorch outperforms Keras

Keras vs. PyTorch: Conclusion

Keras and PyTorch are both excellent choices for your first deep learning framework to learn.

If you’re a mathematician, researcher, or otherwise inclined to understand what your model is really doing, consider choosing PyTorch. It really shines, where more advanced customization (and debugging thereof) is required (e.g. object detection with YOLOv3 or LSTMs with attention) or when we need to optimize array expressions other than neural networks (e.g. matrix decompositions or word2vec algorithms).

Keras is without a doubt the easier option if you want a plug & play framework: to quickly build, train, and evaluate a model, without spending much time on mathematical implementation details.
EDIT: For side-by-side code comparison on a real-life example, see our new article: Keras vs. PyTorch: Alien vs. Predator recognition with transfer learning.

Knowledge of the core concepts of deep learning is transferable. Once you master the basics in one environment, you can apply them elsewhere and hit the ground running as you transition to new deep learning libraries.

We encourage you to try out simple deep learning recipes in both Keras and PyTorch. What are your favourite and least favourite aspects of each? Which framework experience appeals to you more? Let us know in the comment section below!

Would you and your team like to learn more about deep learning in Keras, TensorFlow and PyTorch? Choose our custom-made AI workshops.

Spot the flaw – visual quality control in manufacturing

April 19, 2018/in Data science, Deep learning, Machine learning /by Konrad Budek

Quality assurance in manufacturing is demanding and expensive, yes, but also absolutely crucial. After all, selling flawed goods results in returns and disappointed customers. Harnessing the power of image recognition and deep learning may significantly reduce the cost of visual quality control while also boosting overall process efficiency.

According to “Forbes”, automating quality testing with machine learning can increase defect detection rates by up to 90%. Machines never tire, nor lose focus or need a break. And every product on a production line is inspected with the same focus and meticulousness.
Yield losses, the products that need to be reworked due to defects, may be one of the biggest cost-drivers in the production process. In semiconductor production, testing cost and yield losses can constitute up to 30% of total production costs.

Time and money for quality

Traditional quality control is time-consuming. It is manually performed by specialists testing the products for flaws. Yet the process is crucial for business, as product quality is the pillar a brand will stand on. It is also expensive. Electronics industry giant Flex claims that for every 1 dollar it spends creating a product, it lays out 100 more on resolving quality issues.
Since the inception of image recognition software, manufacturers have been able to incorporate IP cameras into the quality control process. Most of the implementations are based on complex systems of triggers. But with the conditions predefined by programmers, the cameras were able to spot only a limited number of flaws. While the technology may not yet have been worthy of the title game changer, the image recognition revolution was one step further.

Deep learning about perfection

Artificial intelligence may enhance the company’s ability to spot flawed products. Instead of embedding complex and lengthy lists of possible flaws into an algorithm, the algorithm learns the product’s features. With the vision of the perfect product, the software can easily spot imperfect ones.

Visual quality control in Fujitsu

A great example of how AI combined with vision systems can improve product quality is on display at Fujitsu’s Oyama factory. The Image Recognition System the company uses not only helps it ensure the production of parts of an optimal quality, but also supervises the assembly process. This dual role has markedly boosted the company’s efficiency.
As the company stated, the solution lacked the flexibility today’s fast-moving world demands. But powering up an AI-driven solution allowed it to quickly adapt its software to new products without the need for time-consuming recalibration. With the AI solutions, Fujitsu reduced its development time by 80% while keeping part recognition rates at 97%+.
As their solution proved successful, Fujitsu deployed it at all of its production sites.
Visual quality control is also factoring in the agricultural product packing arena. One company has recently introduced a high-performance fruit sorting machine that uses computer vision and machine learning to classify skin defects. The operator can teach the sorting platform to distinguish between different types of blemishes and sort the fruit into sophisticated pack grades. The solution combines hardware, software and operational optimization to reduce the complexity of the sorting process.

Summary

As automation becomes more widespread and manufacturing more complex, factories will need to employ AI. Self-learning machines ultimately allow the companies forward-thinking enough to use them to reduce operational costs while maintaining the highest quality possible.
However, an out-of-box solution is not always the best option. Limited flexibility and lower accuracy are the most significant obstacles most companies face. Sometimes building an in-house team of machine learning experts is the best way to provide both the competence and ability to tailor the right solutions for one’s business. As building the internal team to design visual quality control is more than challenging, finding the reliable partner to gain knowledge may be the best option.

What is the best method of efficiently training machine learning for teams?

April 6, 2018/in Data science /by Kamila Stępniowska

The average Briton has an average attention span of 14 minutes, as the recent study says. People tend to lose interest faster when they find the subject boring or complicated. That’s why the quality of training comes not only from the information provided but also the form, that it is served.

“You had my curiosity, but now you have my attention”

Adults learn selectively. We pay attention and learn what we find potentially useful and beneficial, or what we consider to be interesting. If I lay out for data scientists an example of a new convnet architecture that is ten times better than ResNet on the ImageNet dataset – I might well gain their attention. At the same time, if you are a sales representative who trades office furniture on the EMEA market, there is a great chance that I have “lost you”. You will never focus on this sentence… unless neural networks happen to be your hobby (in which case you already know that data science can help you at work too).
The point here is that good training is not only about the quality of knowledge involved, though that is crucial. Effective training is also about how new knowledge is served up, relating new information to previous experience, know-how and upcoming personal and work-oriented goals. This approach works for any learning process, data science education included.

Data science training in house

At the Training & Development Hub, we have built the 4T method – Tailored Team Training Tracks – which represents a fourth approach to machine learning education and good practices in providing learning experience, incorporated into data science education.
The 4T method is built on the assumption that practical education in data science should be provided in-company to software developers, data analytics, and other specialists familiar with computer programming or statistics. The 4T method can also be used by data scientists seeking to boost their skills. In both cases, the learning experience is tailored to the technical and business goals, provided for a team via hands-on code-based training structured in training tracks.

A win-win situation for the employee and employer

This approach to education benefits both the future data scientist (the learner) and his or her manager and company. 4T is focused on providing knowledge, experience and good practices, considering the nature of adult learners and teamwork benefits. The method emphasizes that the training needs to meet the specific practical goals understood by the individual and shared by his or her entire team, and it needs to provide skills that are going to be used in practice right away.
When these things happen, the company naturally uses the new knowledge to tackle its challenges. Because acquiring the new knowledge is a part of the job, the individual is strongly motivated and reasonable goals to be reached with the team so they have to work together using data science techniques.

Your internal data science team

Employers have plenty of reasons to build an internal data science team, but should bear in mind a couple of issues. First, there is a dearth of data scientists on the market. Further, good data scientists should have broad knowledge in numerous fields, such as computer programming, statistics and math, and that knowledge should be accompanied by a problem-solving mindset or at least a few years of experience in defining and solving work-related problems.
Such a skill set is nothing to sneeze at, and such employees are few and far between. Given that reality, is it worth the investment to retrain current employees such as software engineers and developers to build an internal data science team, or does outsourcing remain the better option?. If the company wants to grow its teams’ skills and develop know-how that will stay in-house, building a data science team from scratch is indeed a step worth considering.

Tailored Team Training Tracks

The 4T method was created by the Training & Development Hub based on seven years of experience in building data science solutions for our customers, and three years’ experience providing professional data science training in Europe and the US.
The method has four components that build a unique framework for training in data science and potentially other technological fields: Tailored, Team Training, Tracks. Below you will find an overview on each of the components.

Tailored

Every training is designed to meet the specific participant’s educational needs and, most essentially, the technical and business goals of the group involved. To give you an example, let’s say that a company would like to start a new project recognizing skin defects from photos. In this case, two different approaches – each based on deep neural networks – can be used: classification and anomaly detection. The training program provides all participants with knowledge about these exact techniques and the skills required to use them properly, no matter which approach they choose to work with finally. The participants can then turn around and use them right away in their everyday work.

Team

The training should be provided for a team, or at least for a group of a people working together who will be able to use the knowledge and skills they have gained in a common project. That is extremely beneficial for employers: their employees’ experience will be “calibrated” for future projects, which usually involve whole teams, for which a similar level of knowledge, understanding of issues and possible solutions are required. Thus armed, companies can smoothly and quickly take on new data science projects.
From the other hand, it’s useful also for the participant – even if the learning motivation for adults is extremely internal variety, though group approach and the need of cooperation are also a strong stimulus.

Training

In data science, the learning experience should be provided by hands-on, code-based, intensive training using real-life examples. In each minute of a training, participants should know why they are doing what they are doing. This is why the training should be as practical and project-based as possible.
In data science, that means working on datasets related to the team’s goals, writing some modules for the experiments (or in long-term, project-based programs where they run many experiments), and calibrating models, to name just a few of the issues involved. The training might be framed as a workshop, with mentoring sessions added, online exercises, projects and other online and offline forms.

Tracks

Blocking training sessions in one thematic track works better than having participants attend a bunch of one-off workshops that are not necessarily connected. Within well-designed development programs, knowledge, experience and good practices are much more efficiently transferred into projects. The track should be an end-to-end educational experience built up from workshops, mentoring sessions, projects and other educational forms that will result in a real, useful solution.

The 4T method was developed and is used by deepsense.ai to provide high-quality technical training. It takes into account both the individual employee’s development and the company’s strategic goals. We believe that it’s the way to go – the optimal solution combining the best educational practices with real-life business cases. So far so good, our clients are satisfied with this approach and regularly help us understand how it can be tailored more and more exactly. If you have any questions or ideas regarding 4Ts, let us know in the comments! We’ll be glad to hear your thoughts.

How to start with machine learning wisely and become a data scientist?

March 28, 2018/in Data science /by Kamila Stępniowska

The job title data scientist wasn’t widely used even a short fifteen years ago. There remain a limited (but rapidly growing) number of universities that offer a master’s degree in data science. So, what are the most promising and effective ways to become a data scientist? Let’s analyze what we already have on the education market and what the perfect learning path should look like.

The most popular ways to become a data scientist

Adult education is strongly connected with lifelong education and learning experience – the process of gaining knowledge, practice and developing a problem-solving mindset supported by a deep motivation to learn.
What are the most common ways to gain a data science learning experience? There are three baseline paths – academia (still), bootcamps and online courses. There are also more informal ways, including community-based short courses (e.g. Introduction to Data Science by PyData) or old-fashioned, one-on-one classes. In this article I will focus on the three most common and well-structured paths.

Universities – prestige & a problem-solving mindset

Top ten Universities (QS Top Universities Ranking, 2018), including MIT, Stanford, Cambridge and ETH Zurich offer superior data science courses. Graduates gain not only knowledge on specific technologies but also develop a problem-solving mindset. Universities teach students how to process issues, ask questions and look for valid answers – in other words, how to think like a scientist. They also offer prestige and peerless social networking. What universities may fail to provide is experience turning knowledge into practice. They also require years to complete, can be exclusive and are not an inexpensive means to gaining knowledge.

Bootcamps – experience

Bootcamps are a shorter alternative to a university education (usually 3-12 months) that in most cases teach technologies and good practices. They are ideal for those committed to starting a new carrier in a completely new field. As with the academy, most bootcamps are designed to be a full-time activity. What sets them apart is the experience and work on projects they provide.
Thanks to the internship programs and strong connection with companies, bootcamps are an easy path to one’s first junior data scientist job. The downside of the bootcamp route is that graduates have fewer opportunities to learn how to think about more complex issues. There is simply not enough time to do so. Bootcamps are a great option for individuals who are fully committed to starting a new carrier in a completely new field. However, the learning experience has a limitations.

Online courses – knowledge

Online courses are a third approach. Stanford, MIT and EPFL offer a wide range of courses, as do established online education players such as Coursera or Udacity. Online courses are particularly suited for those who have learned how to learn. If you do not have that skill, you may well get lost in the forest of information. Data science, neural networks, machine learning and business uses of AI can all be tackled online.
Some of the courses are very good quality and very accessible, though relatively few provide complex knowledge and even fewer the comprehensive learning experience – that is, knowledge, practice and the current state of the art. The challenge with online courses is based on extreme self-motivation and exploration. Unfortunately, if you lack basic knowledge about data science and experience as a learner, than it’s hard to use online courses effectively.

Are these enough?

Each of the above three forms of education suffer from the same problem: they fail to provide the opportunity to apply knowledge in real projects. Two of them likewise come up short in providing experience and a problem-solving mindset as well. Both of these factors are crucial in adult education and essential for employers.
Data Science Training Types
* Per person.
** Instant Application of the knowledge and skills is very important in the adult education. It helps in establishing internal motivation
and defining goals. In this case, the instant application would be a use in a particular project or other job-related tasks.
*** Prestige is an additional variable that comes from classical, academy based education.

The fourth way into data science

The most effective route to education in data science would be however different than Gourdjieff’s, and we won’t be referring here to yogis, monks and fakirs. Instead, let’s do a simple exercise. If you take a look at random Linkedin profiles, very often now data scientists has their background in statistics, physics, math, or programming. If you consider to work as a data scientist it’s rather possible that this is a part of your experience too. It’s not a coincidence.
There’s little doubt that it’s faster and more effective to teach an engineer who knows how to code, or a researcher who already knows statistics and has limited experience in R (programming language), building machine learning models, than to teach the same thing to someone who does not have that experience. If you are an engineer or researcher, data science is a natural career choice.

Learning while building a project

An effective learning experience is also a matter of purpose. You need to understand why you are learning and how you will be able to apply your new skills and experience in practice. If you are a software developer and you know that you need to gain specific knowledge to build a model that can solve an issue in your project, you are more willing to learn the technology that will enable you to do that. You will learn faster and potentially understand the technology more thoroughly, because you will be able to apply it by building the solution.

Team training as a part of your job

Let’s add one more variable. Who can benefit from you becoming a data scientist besides yourself? Your employer, of course. This brings me to the fourth way of becoming a data scientist – going through an internal training program grounded in real projects. This may be training series combined with mentoring sessions, based on a real project that will be run in-company. This kind of training would be provided by an employer as an investment in a future internal data science team. This solution will be mutually beneficial for a company as well as an individual.
So, the fourth route to an education in data science is on-the-job training based on a team learning (giving you motivation and a sense of purpose) and real projects (focusing your efforts on a practical goal). Such training functions as a bridge to solving an existing issue or building an outstanding future solution with your and your team’s new competencies. Because basic programming knowledge is required and an academic background or at least experience in solving programming, statistical or economic issues is also helpful, the program may not be for everyone. It’s an advanced path for professionals who appreciate their time and have a strong need to develop.
Characteristics of the Fourth Way Training
* Assumption that the training participants are already familiar with problem solving from the academy or years of work experience.
** Depends on the previous participants’ experience and educational needs.

Teaching data science effectively

Based on many years of my work with data scientists and software developers in the US and Europe, I would suggest that none of the main, current data science educational paths – universities, boot camps and online courses – fully cover the job-oriented potential in this field. None of the three offer effective, advanced training to give you knowledge and experience that is ready to use in your job. The fourth way is a new force on the advanced, professional educational market, and will give you an opportunity to develop in your company.
The next post will discuss the practical implementation of the fourth way using the 4T training method. Enjoy!

Why do we need more data scientists and why should you become one?

March 22, 2018/in Data science /by Anna Kowalczyk

If you have come across this article, you already know what data science is and how it can be used. That’s great! Now, you are probably wondering why there is so much fuss about data science. If you want to know why you should become a data scientist, the facts speak for themselves!

Demand

According to LinkedIn’s 2017 U.S. Emerging Jobs Report, the number of data scientists has grown over 650% since 2012. Yet there are still too few people exploiting the opportunities in this field. Why has it grown so fast?
Companies need to use data to run and grow their everyday business. The fundamental goal of data science is to help companies make quicker and better decisions, which can take them to the top of their market, or at least – especially in the toughest red oceans – be a matter of long-term survival. The number of companies prepared to use big data is increasing. As Dresner Advisory Services laid out in their Big Data Analytics Market Study, forty percent of non-users expect to adopt big data in the next two years.
What is more, you can apply machine learning on smaller data sets, such as ones from a local company’s social media or shopping gift card history. This provides even more opportunities and increases the demand for data scientists. Job growth in the next decade is expected to exceed growth from the past ten years, creating 11.5M jobs by 2026, according to the U.S. Bureau of Labor Statistics. Companies are building up their data science teams to embrace data analytics and will make it integral to their success. Why are these analytics so important? Is it worth working for one of these companies? You will find the answer in the next two chapters.

Influence

Data science changes how decisions are made and companies are adapting a data-driven approach on a huge scale. Data-driven decisions made with advanced data analytics benefit all manner of company, from global behemoths to medium-sized companies down to local businesses looking to get ahead. Lack of data is rarely an issue – mountains of it are collected every single second, and we are beginning to understand the potential and influence it can have. Data sets in the right hands can help predict and shape the future.
The problem is getting data sets to mingle. It is the data scientist’s role to transform organisations from reactive environments with static and aged data, to automated ones that continuously learn in real time. Forecasts are simple – data is a valuable resource and investing in it will definitely pay off.

Tractica forecasts that worldwide revenue from deployments of AI software, hardware, and services will increase from $14.9 billion in 2017 to $23.6 billion in 2018, a year-over-year increase of 58%.

Do we need more data scientists?

Now, knowing that data science is in huge demand, you are probably wondering who is going to do all the work. Do we have enough data scientists? Maybe the market is already flush with experts. Nothing could be further from the truth – data scientists are few and far between, and highly sought after. IBM predicts demand for data scientists will soar 28% by 2020. Machine learning and data science are generating more jobs than there are experts to fill them, which is why these two fields are the fastest growing tech employment areas today.

Why should you become a data scientist?

Let’s start from the bottom of Maslow’s pyramid of human needs, which you secure with money. According to Glassdoor, in 2016 data science was the single highest paid profession. If data is money, as they say, then this should come as no surprise. The combination of skills necessary to do data science the right way is not common. The good news, however, is that if you want to become a data scientist and are willing to develop yourself, you are very likely to succeed. A background in mathematics, statistics or physics is a good foundation to build upon. You don’t necessarily need to have finished a data science program. We write a lot about learning methods on our blog, which you’ll see if you read our next post. Sign up for our newsletter if you would like to be updated.

Make the world easier

Besides its financial and economic aspects, data science is simply a fascinating discipline, one which affects many areas of our everyday lives and makes the world a better place. We already use it in many fields, such as quick and easy customer service, intelligent navigation, recommendations and voice-to-text. You can even improve the resolution of an image with deep learning.
We don’t have space enough to chronicle as of the ways that data science is improving people’s lives. It is indispensable to the banking sector as it is used to detect fraud by analyzing the behavior of financial institutions in real time. Elsewhere, robots will be used to help the elderly and the disabled gain mobility and independence. Data science makes these breakthroughs accessible to individuals, solves social problems and modernizes business. Most importantly, you can take part in the revolution data science is bringing about.

Things that matter

Among the many reasons you would want to become a data scientist is that you can make a positive contribution to society. Data science can give you some pretty super superpowers. One of them is reshaping industries like healthcare. The amount of data produced about patients and illnesses rises by the second, opening new opportunities for better structured and more informed healthcare. The challenge is to carefully analyze the data in order to be able to recognize problems quickly and accurately – like deepsense.ai did in diagnosing diabetic retinopathy with deep learning.
Did you know that deep learning can help predict dangerous seismic events and keep miners safe? Underground mining is fraught with threats including fires, methane outbreaks or seismic tremors and bumps. An automatic system for predicting and alerting against such dangers is of utmost importance – and also a great challenge for data scientists. Our deepsense.ai team created a machine learning model for the Data Mining Challenge: Predicting Dangerous Seismic Events in Active Coal Mines, which was the winning solution, and one we take great pride in.
Another superpower is saving rare species. When you think of rescuing endangered animals, you see remote jungles and scientists chasing them. This is a stereotype that has changed a lot in recent years. Complex predictive models and algorithms can create insights that help scientists analyze threats to wildlife and create a solution that can save animals – all from the relative comfort of a desk. In fact, it was at our very desktops that we created the Facebook for whales, and It works with 87% accuracy!

Fun

Psst… There’s just one more thing. Data science can simply be fun. Can deep learning play Atari games? Yes! Or perhaps you want to make art even if you aren’t an artist. Data scientists can do it. The only limitation is your imagination!
In the next post you will find inspiration on how to become a data scientist and what kinds of data science training are available on the market, along with their advantages and drawbacks. Stay tuned!

The idea

The execution

Image analysis

Outline of how the game works

Further development ideas

Future plans

The sandbox raiders

1. Hand crafting more than 10,000 features

2. Using deep learning to extract interactions among different data sources

3. Using nested models

Recommendation 1. Deliver participants high-quality data

Recommendation 2. Build clear and descriptive rules

Recommendation 3. Make sure your competition is crack-proof

Recommendation 4. Spread the word about your competition

Conclusion

Overlapping skills, or why data analysts are already halfway to becoming data scientists

Building loyalty within the team – reducing the turnover rate and gaining skills

No need for recruitment – improve efficiency to lower costs

Summary

The opportunities online education brings

So why is the dropout rate so high?

Teams need support when learning

Why companies choose online courses vs. instructor-led training

How to effectively develop technical skills in-house

What are Keras and PyTorch?

TL;DR:

Ok, but why not any other framework?

Keras vs. PyTorch: Ease of use and flexibility

Keras

PyTorch

Summary

Keras vs. PyTorch: Popularity and access to learning resources

Summary

Keras vs. PyTorch: Debugging and introspection

Summary

Keras vs. PyTorch: Exporting models and cross-platform portability

Summary

Keras vs. PyTorch: Performance

Summary

Keras vs. PyTorch: Conclusion

Time and money for quality

Deep learning about perfection

Visual quality control in Fujitsu

Summary

“You had my curiosity, but now you have my attention”

Data science training in house

A win-win situation for the employee and employer

Your internal data science team

Tailored Team Training Tracks

Tailored

Team

Training

Tracks

The most popular ways to become a data scientist

Universities – prestige & a problem-solving mindset

Bootcamps – experience

Online courses – knowledge

Are these enough?

The fourth way into data science

Learning while building a project

Team training as a part of your job

Teaching data science effectively

Demand

Influence

Do we need more data scientists?

Why should you become a data scientist?

Make the world easier

Things that matter

Fun

Contact us

Locations

Let us know how we can help

Services

Resources

About us

Support

Join our community