Saturday, September 30, 2017

Decision Tree Models Based on Matthew North Book’s for the datasets: eReader-Training and eReader-Scoring Analysis (Assignment 4)

Abstract

     This analysis was made to fulfill the assignment of big data subject and to know the unique information from the data example.


The Attributes:

  • User_ID: A numeric, unique identifier assigned to each person who has an account on the company’s web site.
  • Gender: The customer’s gender, as identified in their customer account. In this data set, it is recorded a ‘M’ for male and ‘F’ for Female. The Decision Tree operator can handle non-numeric data types.
  • Age: The person’s age at the time the data were extracted from the web site’s database. This is calculated to the nearest year by taking the difference between the system date and the person’s birthdate as recorded in their account.
  • Marital_Status: The person’s marital status as recorded in their account. People who indicated on their account that they are married are entered in the data set as ‘M’. Since the web site does not distinguish single types of people, those who are divorced or widowed are included with those who have never been married (indicated in the data set as ‘S’).
  • Website_Activity: This attribute is an indication of how active each customer is on the company’s web site. Working with Richard, we used the web site database’s information which records the duration of each customers visits to the web site to calculate how frequently, and for how long each time, the customers use the web site. This is then translated into one of three categories: Seldom, Regular, or Frequent.
  • Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not the person browsed for electronic products on the company’s web site in the past year.
  • Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they purchased an electronic item through Richard’s company’s web site in the past year.
  • Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or not the person has purchased some form of digital media (such as MP3 music) in the past year and a half. This attribute does not include digital book purchases.
  • Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative to the company’s new eReader, this attribute will likely be the best indicator. Thus, this attribute has been set apart from the purchase of other types of digital media. Further, this attribute indicates whether or not the customer has ever bought a digital book, not just in the past year or so.
  • Payment_Method: This attribute indicates how the person pays for their purchases. In cases where the person has paid in more than one way, the mode, or most frequent method of payment is used. There are four options:
    • Bank Transfer : payment via e-check or other form of wire transfer directly from the bank to the company.
    • Website Account : the customer has set up a credit card or permanent electronic funds transfer on their account so that purchases are directly charged through their account at the time of purchase.
    • Credit Card : the person enters a credit card number and authorization each time they purchase something through the site.
    • Monthly Billing : the person makes purchases periodically and receives a paper or electronic bill which they pay later either by mailing a check or through the company web site’s payment system.
Steps:

1. Open the Rapid Miner with a new blank space

2. Input data of  7 - eReaderAdoption-Training.csv and 7 - eReaderAdoption-Scoring.csv

     
   Here is our data example

3. Designing the process

    The first thing we need to do is drag the Training and Scoring Data from the repository to the process page, and pick the set role from the operations menu, like the picture below:



     We set the attribute name is “User_ID” and for the target role is “id” on the “Set Role”  operation like the picture above.


     After setting the set role of the user_id, add one more “Set Role” process from the operation menu, like before and take it to the Training Process line.

    Re-set again the “Set Role” but the attribute name is “eReader_Adoption” and for the target role is “label” like the previous one.

4. Designing decision tree

    Take the “Decision Tree” process from the operation menu and choose the “gain_ratio” for the criterion. It means we use the basic Decision Tree (C4.5). After put the “Decision Tree” on the Training line we put again the “Apply Model” process it also from the operation menu. 


Results:

  • Frequent Decision Tree


  • Frequent Decision Tree Description


    • Regular Decision Tree


    • Regular Decision Tree Description


    • Seldom Decision Tree
    • Seldom Decision Tree Description

    • Adoption Graph
    • Payment Method Graph

    Conclusion:

         After we collect all the data above, we have information about people behavior of payment method. People mostly seldom to adopt a new payment method, and mostly people using website account and bank transfer  as their payment method.

    Sunday, September 24, 2017

    Prediction Model using Rapid Miner (Assignment 3)

    RapidMiner Studio is a powerful visual programming environment for rapidly building complete predictive analytic workflows. This all-in-one tool features hundreds of pre-defined data preparation and machine learning algorithms to efficiently support all your data science needs. It is can be used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualisation, validation and optimization. I use RapidMiner for my tools to finish my assignment about create a prediction model using data training on data pemilu that given by my lecture.

    In this case, we have to predict weather the legistative candidates is elected or not using using the following
    algorithm :

    1. Decision Tree (C4.5)
    2. Naïve Bayes (NB)
    3. K-Nearest Neighbor (K-NN)

    And for th evaluation / accuracy testing, we are asked to use 10-fold X Validation.




    This diagram shown the flow of my activity from inputting the data, setting the role and creating the visualization.

    The steps are:
    1. Open Rapidminer and click new process and open with the blank space.
    2. Then you can start it.
    3. In operators menu search ‘Read Excel’, ‘Set Role’, and ‘Validations’ then drag and drop to the Process window.
    4. Double click on ‘Read Excel’ then input ‘datapemilukpu.xls’.
    5. Click on ‘Set Role’ and edit on parameters box, input “TERPILIH ATAU TIDAK” and change Target Role become Label.
    6. Connect the operators dot to dot.
    7. Double click on ‘Validation’ to open the Training and Testing.
    8. In operator box, search ‘Decision Tree’/’K-NN’/’Naive Bayes’ then drag and drop in the Training space.
    9. In operator box, search ‘Apply Model’ and ‘Performance’ then drag and drop in Testing space.
    10. 10. Then connect the operators.


    • Decision Tree
    • Naive Bayes

    • K-NN
    • Refrences
    https://rapidminer.com/products/studio/
    https://kevinbwstudenttelkomuniversity.wordpress.com/2016/10/23/big-data-task-evaluation-prediction-elektabilitas-caleg/

    Saturday, September 16, 2017

    IMPLEMENTING BIG DATA ANALYSIS TO GET INFORMATION ABOUT MYSELF IN DOTA 2 (Assignment 2)

    IMPLEMENTING BIG DATA ANALYSIS
    TO GET INFORMATION ABOUT MYSELF
    IN DOTA 2

    ·         Abstract:
    Dota 2 is the most played game in the world. Dota 2 has 113 heroes with unique skills and dozens of items. DotA 2 is the most attractive game that should be played by everyone.
    And with various heroes and items in Dota 2, I tried to make a research using big data. I already played 700 games so far in Dota 2, and I want to know unique information that comes up after played 700 games.

    In this case, I am using dota buff to help me provide all the data that I need to find out my unique information. And here is some information that I got.

    ·         Win rate by activity by day of week:

    Here is the result of my win rate by activity by day by week. The higher the bar means the more often I played in that day. As you  can see, my highest win rates is on Saturday with 53% win rates, followed by Monday with 50% of win rates. The rest of it, I often got loose than win.

    ·         Win rate by activity by hour of day :

          Here is the result of my win rate by activity by hour of day. The higher the bar means the more often I played in that day. As you can see, my highest win rates are at 1a.m followed by 9a.m, 9p.m, and 7p.m. The rest of it, I often got loose rather than win.

    ·         Most played heroes with win rates:



             This picture shows us my top 10 most played heroes all the time. The top 3 of the tables is my signature heroes. It means that the most heroes I used all the time. As you can see, my highest win rate heroes is witch doctor with 67.5% win rates, followed by  Warlock with 65.22% win rates, Shadow Fiend 52.38% win rates, and viper with 51.43% win rates.

    ·         Conclusions:
             After I got all the in formations, now I know if I want to win the game. When I should play dota 2, and what hero should I play.

    Sunday, September 10, 2017

    IMPLEMENTATION OF BIG DATA FOR DISASTER MANAGEMENT (Assignment 1)


    •  Objective
    A core objective of this plan is disaster recovery to gained and monitored
    recorded as an evaluation and analysis material for the future can be done
    disaster recovery better.


    • Problems

    Indonesia is located in disaster prone areas, can be considered as laboratory
    Disasters, due to geographical, geological, and demographic conditions. Disaster intensity
    increase and become more complex, must use multi-sectoral and
    multi-disciplinary approach, in an integrated manner
    and in a coordinated manner. This emphasizes the need for a disaster management system.
    Law No. 24/2007 on Disaster Management as a basis for
    developing the National Countermeasures System


    • Solution Idea

    The position of Indonesia which lies between the Asian continent plate and the Australian continent,
    as well as being on the volcano ranks prompted Indonesia to potentially occur
    earthquakes, both earthquakes caused by continental plate shifts, as well as earthquakes
    due to volcanic activity or tectonic earthquakes. While Indonesia is
    in the form of an archipelago, causing Indonesia to have the potential of a tsunami such as a disaster
    the devastating tsunami that once occurred in Aceh. So the data like this become very
    large and large are administered as study materials or analysis that can be used as
    basic decision-making.

    • Methodology
    In this case, The writer using study case. We can see the data that the writer served.


    • Measurement


    1. Big Data Utilization combined with disaster and prevention disaster for further analysis, modeling and computing capabilities.
    2. Increased Resilience and Resilience of information technology for enable real time sensing, visualization, analysis, experimentation and prediction, and sensitive decision-making for all critical circumstances.
    3. Development of fundamental knowledge and innovation for resilience and the sustainability of civil infrastructure and the network of distributed infrastructure.
    Source: http://suyatno.dosen.akademitelkom.ac.id/wp-content/uploads/2015/11/Master-Plan-Big-Data-dan-Manajemen-Bencana.pdf