Data Analytics And BigML For Crowd Funding Projects

Variables in Crowd Funding Data Set

Business intelligence and Data visualization are the most important scenarios in today’s world of businesses. Data visualization consists of the different techniques for exploration of the data by using different tools and techniques of statistical analysis. Here, we have to analyse the data set related to the fund collection for different types of projects. For the analysis of this fund data set, we have to use different statistical tools and techniques. After analysis of this fund data set, we have to find the facts that would be helpful for obtain money via crowd funding to fund a creative project. We will get advices for success of project and we will get general idea about the crowd funding for creation of different projects. This data analysis work will be useful for the people who want to create similar crowd funding projects. In terms of project succeeds, we have to find out the most important and significant attributes for crowd funding data set. We will use the software’s like BigML and SPSS for the analysis of the given data sets. For the statistical analysis of the given data set, we will use basic descriptive statistics, graphical analysis, and inferential statistical analysis by using SPSS and other software’s. Let us see this research study in detail.

For this research study, we have to analyse the crowd funding data set by using different statistical software’s. Data for this research study is downloaded from the blackboard. The data set PleaseFundThis.xlsx have many variables such as project name, date launched, duration days, goal $, percent raised, project state, amount pledged $, major category, minor category, etc. The list of all variables with scale of variables is summarised as below:

No.	Variable	Scale Save Time On Research and Writing Hire a Pro to Write You a 100% Plagiarism-Free Paper. Get My Paper
1	project_name	Nominal
2	date_launched	Nominal
3	duration_days	Ratio
4	goal_$	Ratio
5	percent_raised	Ratio
6	project_state	Nominal
7	amt_pledged_$	Ratio
8	major_category	Nominal
9	minor_category	Nominal
10	project_updated_count	Ratio
11	city	Nominal
12	region	Nominal
13	number_of_pledgers	Ratio
14	comments_count	Ratio
15	avg_amt$_per_pledger	Ratio
16	project_has_video	Nominal
17	project_has_facebook_page	Nominal
18	facebook_friends_count	Ratio
19	project_has_pledge_rewards	Nominal
20	lowest_pledge_level_$	Ratio
21	highest_pledge_level_$	Ratio
22	total_count_of_pledge_levels	Ratio
23	success	Nominal

We have to use descriptive statistical analysis and inferential statistical techniques for the analysis of above listed variables.

Graphical Analysis

In this section, we have to see the graphical analysis of the variables included in the crowd funding data set. First of all we have to see some histograms for the variables included in the given data set. Required histograms are given as below:

From above histogram for the variable duration of project in days, it is observed that duration for most of the project is given as 30 days. Also, this histogram indicated that the median duration of the different projects is 30 days. So, it is recommended that the duration of the new project should be 30 days or near to 30 days. A 30 day project duration is most popular project duration for any project and most people preferred this time period for short as well as long projects. Given data is collected from all over the world for the different projects including short films, dramas, etc. and therefore this finding would be applicable all over the world. Also, it is observed that maximum duration taken by the projects is not more than 60 days. So peoples are abandoned to complete their projects within one or two months for succeed in their fund collection.

Graphical Analysis of Variables

Now, we have to see the histogram for the variable project update count. Required histogram is given below.

From this histogram, it is observed that the frequency for the less project update count is more and as the project update counts are increasing the frequency is decreasing. This variable is right skewed in nature. From this histogram, it is revealed that the project update count would be minimize for getting highest frequency.

Statistical Analysis

In this section, we have to see statistical analysis of the given data set for crowd funding. First of all we have to see some frequency distributions for the variables which are categorical in nature. The frequency distribution will provide us the general idea about the distribution of different categories under the given variables. The frequency distribution for the variable major category of the project is given as below:

Tally for Discrete Variables: major_category

major_category Count

Art 2577

Comics 886

Dance 378

Design 1475

Fashion 1265

Film & Video 5967

Food 1334

Games 2091

Music 6160

Photography 775

Publishing 3672

Technology 705

Theater 1162

N= 28447

From above frequency distribution for the variable major category for project, it is observed that the major categories used by the people for their projects are music, film and video, publishing, and arts. So, it is better to select the new project under these categories for getting more success.

Now, we have to see the frequency distribution for the variable minor category. Required frequency distribution is given as below:

Tally for Discrete Variables: minor_category

minor_category Count

Animation 268

Art 565

Art Book 272

Board & Card Games 294

Children’s Book 651

Classical Music 305

Comics 886

Conceptual Art 103

Country & Folk 589

Crafts 242

Dance 378

Design 195

Digital Art 80

Documentary 1634

Electronic Music 180

Fashion 1265

Fiction 1022

Film & Video 1135

Food 1334

Games 266

Graphic Design 166

Hardware 229

Hip-Hop 353

Illustration 154

Indie Rock 813

Jazz 261

Journalism 149

Mixed Media 290

Music 1885

Narrative Film 754

Nonfiction 887

Open Hardware 44

Open Software 115

Painting 272

Performance Art 296

Periodical 169

Photography 775

Poetry 134

Pop 462

Product Design 1114

Public Art 381

Publishing 388

Rock 1075

Sculpture 194

Short Film 1465

Tabletop Games 555

Technology 317

Theater 1162

Video Games 976

Webseries 711

World Music 237

N= 28447

Some more frequency distributions for the categorical variables included in the given data set are summarised as below:

Statistical Analysis of the Crowd Funding Data Set

project_has_video Count project_has_facebook_page Count

FALSE 4440 No 7969

TRUE 24007 Yes 20478

N= 28447 N= 28447

project_has_pledge_rewards Count project_success Count

Yes 28447 FALSE 14368

N= 28447 TRUE 14079

N= 28447

It is observed that about 4440 project don’t have video, while 24007 projects have video. From the given statistical analysis it is also revealed that about 7969 projects don’t have their own facebook page while 20478 projects have facebook page. So, it is important to create facebook page for our project for getting more success. So, it is recommended to create profiles on different social media sites for getting contacted with people. It is seen that all projects has pledge rewards. From the data analysis it is observed that about 14368 projects are categorized as failed, while about 14079 projects are categorized as success.

Now, we have to see some descriptive statistics for the variables included in the crowd funding data set. First of all we have to see the descriptive statistics for the variable duration in days. Required descriptive statistics for this variable is given as below:

Variable N Mean Median TrMean StDev SE Mean

duration 28447 32.750 30.000 32.383 10.980 0.065

Variable Minimum Maximum Q1 Q3

duration 1.000 60.000 30.000 35.000

Average number of days for completion of projects is given as 32.75 days with the standard deviation of 10.98 days.

Descriptive statistics for the variable goal amount in $ is given as below:

Descriptive Statistics: goal_$

Variable N Mean Median TrMean StDev SE Mean

goal_$ 28447 20575 5000 9186 241016 1429

Variable Minimum Maximum Q1 Q3

goal_$ 1 21474836 2000 12000

Some more descriptive statistics for the variables included in the given data set are summarised below:

Variable N Mean Median TrMean StDev SE Mean

percent_ 28447 121 73 68 1758 10

Variable Minimum Maximum Q1 Q3

percent_ 0 240716 5 113

Descriptive Statistics: amt_pledged_$

Variable N Mean Median TrMean StDev SE Mean

amt_pled 28447 10196 1710 3999 91367 542

Variable Minimum Maximum Q1 Q3

amt_pled 0 8596475 290 5675

Descriptive Statistics: project_update_count

Variable N Mean Median TrMean StDev SE Mean

project_ 28447 3.219 1.000 2.467 5.228 0.031

Variable Minimum Maximum Q1 Q3

project_ 0.000 147.000 0.000 4.000

Variable N Mean Median TrMean StDev SE Mean

number_o 28447 133.2 28.0 53.6 1124.9 6.7

Variable Minimum Maximum Q1 Q3

number_o 0.0 91584.0 6.0 80.0

Descriptive Statistics: comments_count

Variable N Mean Median TrMean StDev SE Mean

comments 28447 30.3 0.0 2.1 740.1 4.4

Variable Minimum Maximum Q1 Q3

comments 0.0 59463.0 0.0 3.0

Descriptive Statistics: facebook_friends_count

Variable N N* Mean Median TrMean StDev

facebook 17886 10561 479.22 221.00 354.24 777.86

Variable SE Mean Minimum Maximum Q1 Q3

facebook 5.82 0.00 5358.00 0.00 596.00

Descriptive Statistics: total_count_of_pledge_levels

Variable N Mean Median TrMean StDev SE Mean

total_co 28447 9.2036 8.0000 8.7629 5.2298 0.0310

Variable Minimum Maximum Q1 Q3

total_co 1.0000 31.0000 6.0000 11.0000

Now, we have to see some inferential statistics for checking some claims about the variables involved in the given data set. First of all we have to check the claim whether the average goal amount in $ same for different duration period in days or not. For checking this hypothesis or claim we have to use one way analysis of variance or one way ANOVA F test. The null and alternative hypotheses for this one way ANOVA F test are summarised as below:

Null hypothesis: H₀: There is no any statistically significant difference exists between the average goal amounts for the different duration periods in days.

Alternative hypothesis: H_a: There is a statistically significant difference exists between the average goal amounts for the different duration periods in days.

We consider 5% level of significance for this test. Required ANOVA table for this test is given as below:

One-way ANOVA: goal_$ versus duration_days

Analysis of Variance for goal_$

Source DF SS MS F P

duration 59 6.836E+12 1.159E+11 2.00 0.000

The p-value for this ANOVA test is given as 0.00 < alpha value 0.05, so we reject the null hypothesis that There is no any statistically significant difference exists between the average goal amounts for the different duration periods in days.

There is sufficient evidence to conclude that there is a statistically significant difference exists between the average goal amounts for the different duration periods in days.

Now, we have to test one more claim or hypothesis whether the average number of count of pledges for failed projects and succeed projects are same or not. For checking this hypothesis we have to use the two sample t test for the population means. The null and alternative hypotheses for this test are summarised as below:

Null hypothesis: H₀: There average number of count of pledges for failed projects and succeed projects are same.

Alternative hypothesis: H_a: The average number of count of pledges for failed projects and succeed projects are not same.

We consider 5% level of significance for this test.

Output for this test is given as below:

Two-Sample T-Test and CI: total_count_of_pledge_levels, project_success

Two-sample T for total_count_of_pledge_levels

project_ N Mean StDev SE Mean

FALSE 14368 8.37 4.76 0.040

TRUE 14079 10.06 5.54 0.047

Difference = mu (FALSE) – mu (TRUE )

Estimate for difference: -1.6859

95% CI for difference: (-1.8060, -1.5657)

T-Test of difference = 0 (vs not =): T-Value = -27.50 P-Value = 0.000 DF = 27652

The p-value for this test is given as 0.00 which is less than alpha value 0.05, so we reject the null hypothesis that there average number of count of pledges for failed projects and succeed projects are same.

There is sufficient evidence to conclude that the average number of count of pledges for failed projects and succeed projects is not same.

Results and Conclusions

From the analysis of the given data set we find out so many facts regarding different variables. Some important results from this data analysis are summarised as below:

It is observed that duration for most of the project is given as 30 days. Also, this histogram indicated that the median duration of the different projects is 30 days.
It is observed that the frequency for the less project update count is more and as the project update counts are increasing the frequency is decreasing. This variable is right skewed in nature.
It is observed that the major categories used by the people for their projects are music, film and video, publishing, and arts.
It is observed that about 4440 project don’t have video, while 24007 projects have video. From the given statistical analysis it is also revealed that about 7969 projects don’t have their own facebook page while 20478 projects have facebook page. So, it is important to create facebook page for our project for getting more success.
It is observed that about 14368 projects are categorized as failed, while about 14079 projects are categorized as success.
Average number of days for completion of projects is given as 32.75 days with the standard deviation of 10.98 days.
There is sufficient evidence to conclude that there is a statistically significant difference exists between the average goal amounts for the different duration periods in days.
There is sufficient evidence to conclude that the average number of count of pledges for failed projects and succeed projects is not same.

Benefits of Statistical Data analysis using different Software’s

As we know that peoples and organizations uses the excel spreadsheets for maintaining their data. Great Eastern University is a very big university located in Melbourne Australia also uses the excel spreadsheets for maintaining their data of 20000 current students and millions of past students. The Great Eastern University should used different statistical and analytics tools such as Power Pivot, Power BI, Tableau, BigML, SPSS, Geospatial tools, Google Analytics, Minitab, Matlab, SAS, R, IBM Watson, etc. These softwares provide much more reliability with statistical data analysis. Excel do not provide advanced analysis and it needs add on or extensions for advanced work. Excel spreadsheets do not provide suitable outputs or tables, but other statistical softwares provides very excellent outputs with proper tables. Excel spreadsheets unable to perform advanced statistical tests and most of the time we need to use manual commands for completion of analysis. We know that BigML, SPSS, etc. are premium software products that are used for a wide variety of statistical analysis. This analysis includes the data compilation, preparation, graphics, modelling and analysis. These statistical software products play an important role in the market research, surveying, healthcare and social sciences. If your business or organization is using Microsoft excel spreadsheet for market research or any other type of business related research, and then you would consider using SPSS instead. AS compared to excel spreadsheet, other statistical software products have an easier and quicker access to basic functions such as descriptive statistics in pull down menus. These software products consist of wide range of charts and graphs to choose from and also there is faster access to statistical tests. These statistical software products made machine learning easy and comfortable.

Advantages of using advanced data analytics tool at Great Eastern University

If we use the advanced statistical software products such as Power Pivot, Power BI, Tableau, BigML, SPSS, Geospatial tools, Google Analytics, Minitab, Matlab, SAS, R, IBM Watson, etc., there are so many benefits. If we use these software products in the Great Eastern University, then there would be so many benefits. The Great Eastern University will be save their time and cost by using these products. Also, they would represent the results in a proper and attractive way. Data analysis work will become more reliable and easy. Different tables for the analytical study would be easily available. Data keeping and handing would be easier as compared to spreadsheets. By using these software products, University will represent all types of information in a click.

By using the above discussed statistical software products; the university could be able to increasing the number of students. Also, university can analyse the results from different social media, analytic tools, and Geospatial tools for improving the student experience for all students. The university will understand the different facts after all types of data analysis related to the student. University may take decisions by using the results from these data analytics. So, these types of analytics work will help in increasing the student retention at university.

We know that, Ken Rudin, the Director of Analytics at Facebook, mentioned that organizations must “focus on impacts, not insights”. This statement explains the importance of focusing on impact rather than insights. During the data analytics work, it is necessary to focus on the impacts of the different factors or variables included in the data analytics and there is no need to focus on insights regarding different treatments, factors, variables, etc.

Implementation of Ken’s suggestion at Great Eastern University

According the Ken Rudin, organizations must focus on the impacts and not insights. For implementation of Ken’s suggestion at Great Eastern University, it is required to use the advanced statistical software products for the data analysis and management team or administration should be focus on the impacts of this analysis and more discussion other than the results obtained from this analysis should be avoided.

References

Antony, J. (2003). Design of Experiments for Engineers and Scientists. Butterworth Limited.

Babbie, E. R. (2009). The Practice of Social Research. Wadsworth.

Beran, R. (2000). React scatterplot smoothers: Superefficiency through basis economy. Journal of the American Statistical Association.

Bickel, P. J. and Doksum, K. A. (2000). Mathematical Statistics: Basic Ideas and Selected Topics, Vol I. Prentice Hall.

Casella, G. and Berger, R. L. (2002). Statistical Inference. Duxbury Press.

Cox, D. R. and Hinkley, D. V. (2000). Theoretical Statistics. Chapman and Hall Ltd.

Degroot, M. and Schervish, M. (2002). Probability and Statistics. Addison – Wesley.

Dobson, A. J. (2001). An introduction to generalized linear models. Chapman and Hall Ltd.

Evans, M. (2004). Probability and Statistics: The Science of Uncertainty. Freeman and Company.

Hastle, T., Tibshirani, R. and Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer – Verlag Inc.

Hogg, R., Craig, A., and McKean, J. (2004). An Introduction to Mathematical Statistics. Prentice Hall.

Liese, F. and Miescke, K. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer.

Pearl, J. (2000). Casuality: models, reasoning, and inference. Cambridge University Press.

Ross, S. (2014). Introduction to Probability and Statistics for Engineers and Scientists. London: Academic Press.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data Analytics And BigML For Crowd Funding Projects ”

Get high-quality paper

NEW! AI matching with writer