Sunday, December 3, 2017

#ASSIGNMENT8 [FINAL] SENTIMEN ANALYSIS TO IMPROVE COMPANY INSIGHT WITH CUSTOMERS

Implementing Big Data for Get More Insight with Customer through Sentiment Analysis

Introduction
Nowadays, people love to use public online transportation rather than their own vehicle. The reasons are people nowadays love time efficiency, privacy, and also energy. Time goes by, people feel unsatisfied with the services that given by the company such as in Indonesia they have Gojek, Grab, and Uber as big player in public online transportation. But, they give insight with customer personally. They do not turned customers complain into advertisement to make customer feel they are heard.
How to solve problem?

                In this case, we are going to crawling word data from Twitter with keyword “@GrabID”. Because people complaining via twitter mentioned official account of Grab on @GrabID. I use orange application, because it is simple to use.
                Here is the data of @GrabID and it will become our data set.





                After we process the data, we got information like this







From the picture above, we can see there are 1462 documents with 3927 words. We can process this data into the word of bag. Like this





Conclusion


After we know the big picture of our customers complain, we can give more insight to our customer. For Grab company, they can make advertisement more efficient and give more insight to their customer. And they can describe it in their advertisement

Sunday, October 1, 2017

ADVANTAGE AND DISADVANTAGE OF VARIOUS METHODS OF DATA MINING CLASSIFICATION (Assignmen 5)

ADVANTAGE AND DISADVANTAGE OF VARIOUS METHODS OF DATA MINING CLASSIFICATION
Method
Advantage
Disadvantage
Naïve Bayes
Handles quantitative and discrete data
Not applicable if the conditional probability is zero, if zero then the predicted probability will be zero as well
Sturdy for isolated noise points, let's say the points are averaged when estimating conditional data opportunities.
Assuming variableas independent
It requires only a small amount of training data to estimate the parameters (average and variance of variables) required for classification.

Handle the lost value by ignoring the agency during the estimated opportunity calculation

Fast and space efficiency

Sturdy against irrelevant attributes

Decision Tree
A complex and very global area of decision-making, can be changed to be more simple and specific.
Overlap occurs especially when the classes and criteria used are very large. It can also lead to increased decision-making time and the amount of memory required.
Eliminate unnecessary calculations.
Accumulate the number of errors from each level in a large decision tree.
Flexible to select features from different internal nodes
Difficulty in designing the optimal decision tree. The result of the decision quality obtained from the decision tree method depends on how the tree is designed.
  The decision tree method can avoid the emergence of this problem by using fewer criteria on each internal node without significantly reducing the quality of the resulting decision.

K-NN
KNN has some advantages that he is tough to training data that noisy and effective when the data training him great.
It is necessary to determine the most optimal k value which states the number of nearest neighbors
More effective in large training data
The computational cost is quite high because distance calculations must be performed on each query instance together with all instances of the training sample
Can produce more accurate data








Preference:
Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011.
Jiawei Han and Micheline Kamber, Data Mining:Concepts and TechniquesSecond Edition, Elsevier, 2006
Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques3rd Edition, Elsevier, 2011.


North, M. (2012). Data Mining for the Masses. USA:Creative Commons Attribution.

Saturday, September 30, 2017

Decision Tree Models Based on Matthew North Book’s for the datasets: eReader-Training and eReader-Scoring Analysis (Assignment 4)

Abstract

     This analysis was made to fulfill the assignment of big data subject and to know the unique information from the data example.


The Attributes:

  • User_ID: A numeric, unique identifier assigned to each person who has an account on the company’s web site.
  • Gender: The customer’s gender, as identified in their customer account. In this data set, it is recorded a ‘M’ for male and ‘F’ for Female. The Decision Tree operator can handle non-numeric data types.
  • Age: The person’s age at the time the data were extracted from the web site’s database. This is calculated to the nearest year by taking the difference between the system date and the person’s birthdate as recorded in their account.
  • Marital_Status: The person’s marital status as recorded in their account. People who indicated on their account that they are married are entered in the data set as ‘M’. Since the web site does not distinguish single types of people, those who are divorced or widowed are included with those who have never been married (indicated in the data set as ‘S’).
  • Website_Activity: This attribute is an indication of how active each customer is on the company’s web site. Working with Richard, we used the web site database’s information which records the duration of each customers visits to the web site to calculate how frequently, and for how long each time, the customers use the web site. This is then translated into one of three categories: Seldom, Regular, or Frequent.
  • Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not the person browsed for electronic products on the company’s web site in the past year.
  • Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they purchased an electronic item through Richard’s company’s web site in the past year.
  • Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or not the person has purchased some form of digital media (such as MP3 music) in the past year and a half. This attribute does not include digital book purchases.
  • Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative to the company’s new eReader, this attribute will likely be the best indicator. Thus, this attribute has been set apart from the purchase of other types of digital media. Further, this attribute indicates whether or not the customer has ever bought a digital book, not just in the past year or so.
  • Payment_Method: This attribute indicates how the person pays for their purchases. In cases where the person has paid in more than one way, the mode, or most frequent method of payment is used. There are four options:
    • Bank Transfer : payment via e-check or other form of wire transfer directly from the bank to the company.
    • Website Account : the customer has set up a credit card or permanent electronic funds transfer on their account so that purchases are directly charged through their account at the time of purchase.
    • Credit Card : the person enters a credit card number and authorization each time they purchase something through the site.
    • Monthly Billing : the person makes purchases periodically and receives a paper or electronic bill which they pay later either by mailing a check or through the company web site’s payment system.
Steps:

1. Open the Rapid Miner with a new blank space

2. Input data of  7 - eReaderAdoption-Training.csv and 7 - eReaderAdoption-Scoring.csv

     
   Here is our data example

3. Designing the process

    The first thing we need to do is drag the Training and Scoring Data from the repository to the process page, and pick the set role from the operations menu, like the picture below:



     We set the attribute name is “User_ID” and for the target role is “id” on the “Set Role”  operation like the picture above.


     After setting the set role of the user_id, add one more “Set Role” process from the operation menu, like before and take it to the Training Process line.

    Re-set again the “Set Role” but the attribute name is “eReader_Adoption” and for the target role is “label” like the previous one.

4. Designing decision tree

    Take the “Decision Tree” process from the operation menu and choose the “gain_ratio” for the criterion. It means we use the basic Decision Tree (C4.5). After put the “Decision Tree” on the Training line we put again the “Apply Model” process it also from the operation menu. 


Results:

  • Frequent Decision Tree


  • Frequent Decision Tree Description


    • Regular Decision Tree


    • Regular Decision Tree Description


    • Seldom Decision Tree
    • Seldom Decision Tree Description

    • Adoption Graph
    • Payment Method Graph

    Conclusion:

         After we collect all the data above, we have information about people behavior of payment method. People mostly seldom to adopt a new payment method, and mostly people using website account and bank transfer  as their payment method.

    Sunday, September 24, 2017

    Prediction Model using Rapid Miner (Assignment 3)

    RapidMiner Studio is a powerful visual programming environment for rapidly building complete predictive analytic workflows. This all-in-one tool features hundreds of pre-defined data preparation and machine learning algorithms to efficiently support all your data science needs. It is can be used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualisation, validation and optimization. I use RapidMiner for my tools to finish my assignment about create a prediction model using data training on data pemilu that given by my lecture.

    In this case, we have to predict weather the legistative candidates is elected or not using using the following
    algorithm :

    1. Decision Tree (C4.5)
    2. Naïve Bayes (NB)
    3. K-Nearest Neighbor (K-NN)

    And for th evaluation / accuracy testing, we are asked to use 10-fold X Validation.




    This diagram shown the flow of my activity from inputting the data, setting the role and creating the visualization.

    The steps are:
    1. Open Rapidminer and click new process and open with the blank space.
    2. Then you can start it.
    3. In operators menu search ‘Read Excel’, ‘Set Role’, and ‘Validations’ then drag and drop to the Process window.
    4. Double click on ‘Read Excel’ then input ‘datapemilukpu.xls’.
    5. Click on ‘Set Role’ and edit on parameters box, input “TERPILIH ATAU TIDAK” and change Target Role become Label.
    6. Connect the operators dot to dot.
    7. Double click on ‘Validation’ to open the Training and Testing.
    8. In operator box, search ‘Decision Tree’/’K-NN’/’Naive Bayes’ then drag and drop in the Training space.
    9. In operator box, search ‘Apply Model’ and ‘Performance’ then drag and drop in Testing space.
    10. 10. Then connect the operators.


    • Decision Tree
    • Naive Bayes

    • K-NN
    • Refrences
    https://rapidminer.com/products/studio/
    https://kevinbwstudenttelkomuniversity.wordpress.com/2016/10/23/big-data-task-evaluation-prediction-elektabilitas-caleg/

    Saturday, September 16, 2017

    IMPLEMENTING BIG DATA ANALYSIS TO GET INFORMATION ABOUT MYSELF IN DOTA 2 (Assignment 2)

    IMPLEMENTING BIG DATA ANALYSIS
    TO GET INFORMATION ABOUT MYSELF
    IN DOTA 2

    ·         Abstract:
    Dota 2 is the most played game in the world. Dota 2 has 113 heroes with unique skills and dozens of items. DotA 2 is the most attractive game that should be played by everyone.
    And with various heroes and items in Dota 2, I tried to make a research using big data. I already played 700 games so far in Dota 2, and I want to know unique information that comes up after played 700 games.

    In this case, I am using dota buff to help me provide all the data that I need to find out my unique information. And here is some information that I got.

    ·         Win rate by activity by day of week:

    Here is the result of my win rate by activity by day by week. The higher the bar means the more often I played in that day. As you  can see, my highest win rates is on Saturday with 53% win rates, followed by Monday with 50% of win rates. The rest of it, I often got loose than win.

    ·         Win rate by activity by hour of day :

          Here is the result of my win rate by activity by hour of day. The higher the bar means the more often I played in that day. As you can see, my highest win rates are at 1a.m followed by 9a.m, 9p.m, and 7p.m. The rest of it, I often got loose rather than win.

    ·         Most played heroes with win rates:



             This picture shows us my top 10 most played heroes all the time. The top 3 of the tables is my signature heroes. It means that the most heroes I used all the time. As you can see, my highest win rate heroes is witch doctor with 67.5% win rates, followed by  Warlock with 65.22% win rates, Shadow Fiend 52.38% win rates, and viper with 51.43% win rates.

    ·         Conclusions:
             After I got all the in formations, now I know if I want to win the game. When I should play dota 2, and what hero should I play.

    Sunday, September 10, 2017

    IMPLEMENTATION OF BIG DATA FOR DISASTER MANAGEMENT (Assignment 1)


    •  Objective
    A core objective of this plan is disaster recovery to gained and monitored
    recorded as an evaluation and analysis material for the future can be done
    disaster recovery better.


    • Problems

    Indonesia is located in disaster prone areas, can be considered as laboratory
    Disasters, due to geographical, geological, and demographic conditions. Disaster intensity
    increase and become more complex, must use multi-sectoral and
    multi-disciplinary approach, in an integrated manner
    and in a coordinated manner. This emphasizes the need for a disaster management system.
    Law No. 24/2007 on Disaster Management as a basis for
    developing the National Countermeasures System


    • Solution Idea

    The position of Indonesia which lies between the Asian continent plate and the Australian continent,
    as well as being on the volcano ranks prompted Indonesia to potentially occur
    earthquakes, both earthquakes caused by continental plate shifts, as well as earthquakes
    due to volcanic activity or tectonic earthquakes. While Indonesia is
    in the form of an archipelago, causing Indonesia to have the potential of a tsunami such as a disaster
    the devastating tsunami that once occurred in Aceh. So the data like this become very
    large and large are administered as study materials or analysis that can be used as
    basic decision-making.

    • Methodology
    In this case, The writer using study case. We can see the data that the writer served.


    • Measurement


    1. Big Data Utilization combined with disaster and prevention disaster for further analysis, modeling and computing capabilities.
    2. Increased Resilience and Resilience of information technology for enable real time sensing, visualization, analysis, experimentation and prediction, and sensitive decision-making for all critical circumstances.
    3. Development of fundamental knowledge and innovation for resilience and the sustainability of civil infrastructure and the network of distributed infrastructure.
    Source: http://suyatno.dosen.akademitelkom.ac.id/wp-content/uploads/2015/11/Master-Plan-Big-Data-dan-Manajemen-Bencana.pdf

    Tuesday, May 2, 2017

    Tugas Akhir ICT (Delete Soon)

    This is the conclusion of journal that titled Information and Communication Technology (ICT) Literacy: Integration and Assessment in Higher Education
    http://www.iiisci.org/journal/cv$/sci/pdfs/p890541.pdf

    This study provides some evidence for the convergent and discriminant validity of the ETS ICT Literacy Assessment, paving the way for its use to evaluate instructional programs on ICT literacy. In current work, we are assessing the effectiveness of an innovative ICT literacy instructional method by comparing student performance on the ICT Literacy Assessment before and after instruction. Our overall goals are to understand how firstyear students acquire information-processing skills, identify best practices for integrating information literacy into the curriculum, and assess the impact of skill acquisition on overall academic achievement.