Divide-and-conquer: Constructing decision trees

 
A Pollinator in Pink……..HFDF! flickr photo by The Manic Macrographer shared under a Creative Commons (BY) license

The problem of constructing a decision tree can be expressed recursively. First, select an attribute to place at the root node and make one branch for each possible value. This splits up the example set into subsets, one for every value of the attribute. Now the process can be repeated recursively for each branch, using only those instances that actually reach the branch. If at any time all instances at a node have the same classification, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on, given a set of examples with different classes. If we had a measure of the purity of each node, we could choose the attribute that produces the purest daughter nodes. The measure of purity that we will use is called the information and is measured in units called bits. Associated with a node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node.

div

div0

For outlook We can calculate the average information value of these, taking into account the number of instances that go down each branch—five down the first and third and four down the second:

div1

This average represents the amount of information that we expect would be necessary to specify the class of a new instance, given the tree structure in Figure 4.2(a) Before we created any of the nascent tree structures in Figure 4.2, the training examples at the root comprised  line yes and five no nodes, corresponding to an information value of

div2

Thus the tree in Figure is responsible for an information gain of

div4

The way forward is clear. We calculate the information gain for each attribute and choose the one that gains the most information to split on. In the situation of Figure 4.2,

div5

Calculating information

Before examining the detailed formula for calculating the amount of information required to specify the class of an example given that it reaches a tree node with a certain number of yes’s and no’s, consider first the kind of properties we would expect this quantity to have:

div6

1. When the number of either yes’s or no’s is zero, the information is zero.

2. When the number of yes’s and no’s is equal, the information reaches a maximum

3. The information must obey the multistage property illustrated previously.

Remarkably, it turns out that there is only one function that satisfies all these properties, and it is known as the information value or entropy:

div7

The arguments p1, p2, . . . of the entropy formula are expressed as fractions that add up to one, so that, for example, Info([2,3,4]) = entropy(2/3,3/9,4/9) Thus the multistage decision property can be written in general as

div8

Because of the way the log function works, you can calculate the information measure   without having to work out the individual fractions:

div9

The method that has been described using the information gain criterion is essentially the same as one known as ID3. (https://en.wikipedia.org/wiki/ID3_algorithm)

A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called C4.5 (https://en.wikipedia.org/wiki/C4.5_algorithm). These improvements include methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees.

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Ethics and security

By definition, Ethics is a branch of philosophy that involves systematizing, defending, and recommending concepts of right and wrong conduct. In computer science and security we do ethics whenever we take a decision that harms or protects the end user, taking for example, deciding to salt and encrypt the database so in case of a security breach most of the information remains safe for all of the people registered in our database.

Safety and security of the end user.

One of the goals of computer science and computer engineers is to make software for it to be used by an end user, in this case we as the architects of the software to be used by many have to take into account the safety of the people we are serving. This is one of the many reasons why testing is important and why a software without testing is unreliable.

Safety and greed.

Many of us have installed software from the appstore or the playstore and whenever we install this apps we are prompted with a security feature for us to agree in which kind of access we grant to the application we have installed. Some of use barely read what appears infront of us, but I warn you that this is very important, is a contract that the end user is agreeing to share its personal information with whoever is the developer of said app. As developers we cannot use the end user personal information as our asset, selling this information is important and protecting that is our duty.

The risks and management challenges in the digital era.

This month I decided it was a good time to start managing all of my passwords, for a long time I had managed all of my information in a pretty simple manner I take little thought about the security issues that some decisions may lead to, something as trivial as a password, the single and only key to most of the websites and common places I visit and something so tight to my web identity that if found it could let to identity thief. Along the years I have created countless accounts, so many to remember even in which website I am or not registered and so I found myself in a vulnerability. In which sites have I registered and how many of those sites have the same password been used? To just imagine this vital and private information in the wrong hands could turn my entire world apart and that was something I was not willing to let happened. Think just for a second that one of this small website is attacked and that its user’s information’s is compromised, this simple slip could cost you a ton of money IF the same password you use for smalltaquito.mx is the same as amazons one.

kvqhycv
-laughs in spanish-

That is the reason I decided that It was time to move on, because my lack of security was placing me at risk for others to make it easy to be hacked. This was my motivation to make use of password managers and get to know how they can change the way I use the internet, how I contact different websites and how to store a secure password.

Accessibility was one of the mayor factors that kept me at bay whenever I tried to use this tools but now is not complicated to move along with my passwords on the other hand, the ease of access is one of the reasons that I like the password manager last pass (see https://www.lastpass.com/es) because it’s easy to keep track of my passwords among many computers, cellphones and tablets or any other digital device I find It easy to use.

Digital Identity

Online identity is a social presence that an internet user stablishes in online communities and websites such as Facebook or twitter.

Untitled
As a whole online identity defines you as a user of the internet.

By expressing opinions on blogs and other social media, users define a tacit identity, which can be considered as actively constructed presentation of one’s self. Many people like myself prefer to use a pseudonym as our personal identity while many others prefer to use their names online.

But what is identity?

Identity is personal identifiable information, this is sensitive information that can be used on its own or combined with other information to identify someone, specifically is any information that can be used to distinguish or trace an individual. through a name, social security number, date and place of birth, mother’s maiden name, home address or email address, passport number, driver’s license number, credit card numbers, telephone number, or any other information that is linked to a specific person.

Untitled
The risks of today digital streaming media sites like twitch has rised a new risk in identity, to be identified on the internet can have severe consequences.

Identity theft

Identity theft happens when someone steals your personal information and uses it without permission. Thieves can run up your credit accounts, get new credit cards, medical treatment, or a job, write stolen or fake, or altered checks, siphon checking and savings accounts, or take out loans for large ticket items, all in your name.

Easy to access for convenience = easy for others to access

And while most of us is aware of the risks of the digital world and we like to believe that what we have is secured there are several risks that happen every day and we have to take in account like for example, there are several web pages where we store our passwords but if there is a security breach and our information is compromised, the atackers may gain our passwords and more information, In the real world many people recicle the same passwords for diferent websites and losing one password compromises several websites. We have to take into accound that the information we handle to companies has to be secured equally to not be tranpased by other people and this is another flaw in security.

References

https://www.lynda.com/IT-Infrastructure-tutorials/Theft-identity/616732/643952-4.html